Metadata-Version: 2.4
Name: investigation
Version: 0.1.2a3
Summary: Investigation is an machine learning experiment management library that couple with project source code to provide version dependent experiment assistance.
License: MIT
License-File: LICENSE
Author: Yuxiang Luo
Author-email: yuxiang.lll@outlook.com
Requires-Python: >=3.11,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: environs (>=14.6.0,<15.0.0)
Requires-Dist: flask (>=3.0.0,<4.0.0)
Requires-Dist: pre-commit (>=4.2.0,<5.0.0)
Requires-Dist: slurming (>=0.2.3a1,<0.3.0)
Requires-Dist: tqdm (>=4.67.1,<5.0.0)
Requires-Dist: tyro (>=0.9.24,<0.10.0)
Requires-Dist: yapf (>=0.43.0,<0.44.0)
Description-Content-Type: text/markdown

# Investigation

Investigation is a machine learning experiment management library that couples with project source code to provide version-dependent experiment assistance.

> *做的更快，更 robust，更 solid。*
> *让别人在 queue 中等，而不是你。*
> *让你的 paper 被傻逼 reviewer 拒稿，而不是被 reasonable critics。*

## Features

- 🔬 **Experiment Management**: Manage machine learning experiments with version control integration
- 📊 **Training Log Visualization**: Real-time visualization of training curves and metrics
- 🎯 **Design of Experiments (DoE)**: Generate and manage experiment configurations
- 📈 **Multi-Instance Comparison**: Compare and aggregate results across multiple training runs
- 🔍 **Advanced Filtering**: Filter and organize experiments using regex patterns
- 🏗️ **Nested Dataclass Support**: Organize hyperparameters hierarchically (Hydra.cc-style)

## Installation

```bash
# Using poetry (recommended)
poetry add investigation

# Or install with pip
pip install investigation flask
```

## Quick Start

### Basic Usage as Terminal Tools

Investigation provides two main terminal tools: `ivst` and `ivst-vsl`.

#### Experiment Management with `ivst`

In an external project, use `ivst` to manage experiments:

```bash
# Add investigation to your project
poetry add investigation

# Initialize a new experiment
poetry run ivst init --name doe

# Commit experiment configuration
poetry run ivst commit --commit githash --name doe

# Submit experiments
poetry run ivst submit --commit githash
```
### Example Usage with a ML Project

Here is a detailed example of integrating `investigation` into a machine learning project (`mlctrl`).

1.  **Clone the project and navigate into the directory:**

    ```bash
    # Replace with your actual repository URL
    git clone https://github.com/your-username/mlctrl.git
    cd mlctrl
    ```

2.  **Add `investigation` as a dependency:**

    ```bash
    poetry add investigation
    ```

3.  **Create `entry/init.py` for experiment configuration:**

    This file defines how parameters map to a job, like a `SlurmJob`.

    ```python
    from investigation.doeargs.args import ExampleArgs
    from typing import Type, List, Dict
    from slurming.SlurmJob.job import SlurmJob
    from slurming.Slurm.shellUtils import make_command
    from pathlib import Path
    import environs
    import numpy as np
    import os


    # This file is part of EpisodicRL.

    from example.models.model1 import Args as FlowArgs
    from example.models.model2 import FourierDiffusionArgs, UnetDiffusionArgs, AttentionDiffusionArgs

    CLUSTER_CONFIG = {
        "clusterC-dense": {
            "hostname": "clusterC",
            "mem_per_core": 4.4,
            "min_cores": 1,
            "core_per_job": 6,
            "vip": ["clusterB", "clusterA"],
            "partition": ["cpu", "cpu-exp"],
        },
        "clusterA-dense": {
            "hostname": "clusterA",
            "mem_per_core": 4.84,
            "min_cores": 1,
            "core_per_job": 32,
            "vip": ["clusterB"],
            "partition": ["cpu", "gpu"],
        }
    }

    cwd = Path.cwd()
    env = environs.Env()
    env.read_env(str(cwd / ".env"))

    LIST_OF_ARGS: List[Type[ExampleArgs]] = [
        FlowArgs,
        UnetDiffusionArgs,
        FourierDiffusionArgs,
        AttentionDiffusionArgs,
    ]
    LIST_OF_CLUSTERS: List[str] = ["clusterA", "clusterC", "clusterB"]

    def parameter2SlurmJob(param_dict: Dict, ) -> SlurmJob:
        """
        Convert a parameter dictionary to a SlurmJob instance.

        Args:
            param_dict (Dict): A dictionary containing parameters for the Slurm job.

        Returns:
            SlurmJob: An instance of SlurmJob configured with the provided parameters.
        """
        python_run_json_head_lightning = f"src.train"
        gpus = 1
        hostname = os.uname().nodename

        match hostname[0]:
            case "p":
                host = "clusterC-dense"
            case "c":
                host = "clusterA-dense"
            case _:
                host = "clusterC-dense"

        cores = max(
            int(np.ceil(CLUSTER_CONFIG[host]["core_per_job"])),
            CLUSTER_CONFIG[host]["min_cores"],
        )
        partition = CLUSTER_CONFIG[host]["partition"]
        job_duration = 40
        if (gpus > 0):
            job_duration = job_duration // 20

        job = SlurmJob(
            account=env.str("SLURM_ACCOUNT"),
            content=[
                "wandb offline",
                f"rm -rf logger.json videos wandb",
                "find example/ -type f -name \"*.pyc\" | xargs rm -fv",
                make_command(
                    "accelerate launch --main_process_port 54874 poetry run python",
                    params1_dict={
                        "m": python_run_json_head_lightning,
                    },
                    params2_dict=param_dict,
                    connection=" ",
                ),
            ],
            licenses={},
            modules=[],
            env_vars={
                "WANDB_CONSOLE": "off",
                "SLURM_CPU_PER_TASK": str(cores),
            },
            notify_email=[],
            aliases={},
            paths=[],
            sbatch_args={},
            hours=job_duration,
            cpus_per_task=cores,
            gpus=gpus,
            interactive=True,
            output_storage=["history.csv"],
            partition=",".join(partition),
        )
        return job
    ```

4.  **Create the main entry point `src/entry.py`:**

    This script uses the `@ctrl.main()` decorator to handle experiment logic.

    ```python
    import investigation as ctrl
    import tyro
    from investigation.doeargs.args import ExampleArgs

    @ctrl.main()
    def main() -> None:
        """Entry point for loading configurations and running training."""
        args, _ = tyro.cli(
            ExampleArgs,
            return_unknown_args=True,
        )
        match args.name:
            case "ModelName1":
                from example.models.model2 import FourierDiffusionArgs
                args = tyro.cli(FourierDiffusionArgs)
            case "ModelName2":
                from example.models.model2 import UnetDiffusionArgs
                args = tyro.cli(UnetDiffusionArgs)
            case _:
                raise NotImplementedError(f"{args.name} not implemented yet.")

        from src.train import train
        train(args)

    if __name__ == '__main__':
        main()
    ```

5.  **Initialize a DoE configuration file `doe.json`:**

    This JSON file defines the parameter space for the experiments.

    ```json
    {
        "meta": {
            "num_seeds": {"type": "<class 'int'>", "default": 1, "value": 1},
            "include": {
                "type": "List[Dict[str, Dict]]",
                "default": [],
                "value": []
            }
        },
        "shared": {
            "batch_size": {"type": "<class 'int'>", "default": 16, "range": []},
            "epochs": {"type": "<class 'int'>", "default": 100, "range": []}
        },
        "ModelName1": {
            "name": {"type": "<class 'str'>", "default": "ModelName1", "range": []},
    }
    ```

6.  **Run the experiments:**

    Execute the entry script with the DoE configuration and a commit hash.

    ```bash
    poetry run python src/entry.py --doe-config doe.json --commit <your_commit_hash>
    ```

    This command will:
    -   Create a JSON case database under `.investigation/<your_commit_hash>/`.
    -   Generate shell scripts for each experiment case in `.investigation/<your_commit_hash>/shell/`.
    -   Submit the jobs defined in `doe.json`.



#### Training Log Visualization with `ivst-vsl`

The visualization tool reads data from `.investigation/commithash/storage/*.csv`:

```bash
# Start the visualization server
poetry run ivst-vsl

# Or specify custom port
poetry run ivst-vsl --port 8080
```

Both `ivst` and `ivst-vsl` are terminal tools that operate on data in `workdir/.investigation`.
The visualization tool specifically looks for CSV files in the structure: `workdir/.investigation/commithash/storage/*.csv`.

Then open your browser to `http://127.0.0.1:5000`

#### Features:
- 📊 **Real-time Visualization**: Automatically refresh and display latest training data
- 🔍 **Instance Filtering**: Filter training instances using regex patterns
- 📈 **Multi-Instance Comparison**: Display curves from multiple training instances simultaneously
- 🎯 **Data Aggregation**: Aggregate multiple instances with mean, min, or max
- 🎨 **Interactive Charts**: Modern interactive charts powered by Chart.js
- 🌐 **Bilingual Interface**: Supports both Chinese and English

For detailed usage instructions, see [Visualization Guide](docs/VISUALIZATION_GUIDE.md).

### Nested Dataclass Support

Investigation now supports hierarchical dataclass structures, similar to Hydra.cc configurations:

```python
from dataclasses import dataclass, field
from investigation.doeargs.args import ExampleArgs

@dataclass
class DataConfig:
    data_path: str = "./data/"
    batch_size: int = 16

@dataclass
class TrainingConfig:
    epochs: int = 100
    learning_rate: float = 0.001

@dataclass
class Args(ExampleArgs):
    name: str = "MyExperiment"
    data: DataConfig = field(default_factory=DataConfig)
    training: TrainingConfig = field(default_factory=TrainingConfig)
```

Nested fields are automatically flattened using dot notation (`data.batch_size`, `training.epochs`) and work seamlessly with all Investigation features.

For more details, see [Nested Dataclass Support Guide](docs/NESTED_DATACLASS_SUPPORT.md).

### CSV Log Format

The system expects CSV files with training metrics. Example format:

```csv
policy_loss,log_pi,cgqp,q_loss,qf1_loss,qf2_loss,cnqp,cnqn,-qf_res1_min,qf_res1_max,-qf_res2_min,qf_res2_max,episodic_return,episodic_length,last_step_reward,goal_reward,test_goal_reward,test_episodic_return,test_last_step_reward,SPS,global_step,buffer_filled,epoch,step
```

Key columns:
- `global_step` or `step`: X-axis (training steps)
- Other numeric columns: Selectable Y-axis metrics

## Project Structure

```
Investigation/
├── investigation/          # Core experiment management
├── visualization/       # Web-based visualization system
│   ├── app.py          # Flask application
│   ├── cli.py          # Command-line interface
│   ├── templates/      # HTML templates
│   └── README.md       # Visualization documentation
├── docs/               # Documentation
│   └── VISUALIZATION_GUIDE.md
└── .investigation/        # Default storage for training logs
```

## Usage Examples

### Basic Visualization

1. Start the server: `python -m visualization.cli`
2. Select a training instance from the left panel
3. Choose a metric to visualize (e.g., `episodic_return`)
4. View the training curve on the right

### Comparing Multiple Experiments

1. Check multiple training instances in the left panel
2. Select a metric to compare
3. The chart displays all selected instances with different colors

### Aggregating Experiments

1. Select multiple instances
2. Choose aggregation method (Mean/Min/Max)
3. Click "聚合选中实例" (Aggregate Selected Instances)
4. View the aggregated curve

## API Endpoints

The visualization module provides REST API endpoints:

- `GET /api/instances` - List all training instances
- `GET /api/data/<path>` - Get data for specific instance
- `POST /api/filter` - Filter instances by pattern
- `POST /api/aggregate` - Aggregate multiple instances

See [Visualization README](visualization/README.md) for full API documentation.

## Development

```bash
# Install dependencies
poetry install

# Run with debug mode
python -m visualization.cli --debug

# Run tests (if available)
pytest
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Author

Yuxiang Luo <yuxiang.lll@outlook.com>

## Acknowledgments

- Built with Flask for the backend
- Chart.js for interactive visualizations
- Designed for machine learning experiment tracking and analysis

