Metadata-Version: 2.1
Name: irisml
Version: 0.0.32
Summary: Simple ML pipeline platform
Home-page: https://github.com/microsoft/irisml
Author: irisdev
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE

# IrisML

Proof of Concept for a simple framework to create a ML pipeline.


# Features
- Run a ML training/inference with a simple JSON configuration.
- Modularized interfaces for task components.
- Cache task outputs for faster experiments.

# Getting started
## Installation
Prerequisite: python 3.8+

```
# Install the core framework and standard tasks.
pip install irisml irisml-tasks irisml-tasks-training
```

## Run an example job
```
# Install additional packages that are required for the example
pip install irisml-tasks-torchvision

# Run on local machine
irisml_run docs/examples/mobilenetv2_mnist_training.json
```

## Available commands
```
# Run the specified pipeline. You can provide environment variables by "-e" option, which will be acceible through $env variable in the json config.
irisml_run <pipeline_json_path> [-e <ENV_NAME>=<env_value>] [--no_cache] [--no_cache_read] [-v]

# Show information about the specified task. If <task_name> is not provided, shows a list of available tasks in the current environment.
irisml_show [<task_name>]

# Manage a cache storage on Azure Blob Storage
# list - Show a list of matched blobs.
# download - Download matched blobs.
# remove - Remove matched blobs.
# show - Show the contents of matched blobs.
irisml_cache <list|download|remove|show> [--mtime <+|->N] [--name NAME]
```

## Pipeline definition
```
PipelineDefinition = {"tasks": List[TaskDefinition], "on_error": Optional[List[TaskDescription]]}

TaskDefinition = {
    "task": <task module name>,
    "name": <optional unique name of the task>,
    "inputs": <list of input objects>,
    "config": <config for the task. Use irisml_show command to find the available configurations.>
}
```
In the TaskDefinition.inputs and TaskDefinition.config, you cna use the following two variable.
- $env.<variable_name>
  This variable will be replaced by the environment variable that was provided as arguments for irisml_run command.
- $outputs.<task_name>.<field_name>
  This variable will be replaced by the outputs of the specified previous task.

It raises an exception on runtime if the specified variable was not found.

If a task raised an exception, the tasks specified in on_error field will be executed. The exception object will be assigned to "$env.IRISML_EXCEPTION" variable.

## Pipeline cache
Using cache, you can modify and re-run a pipeline config with minimum cost. If the cache is enabled, IrisML will calculate hash values for all task inputs/configs and upload the task outputs to a specified storage. When it found a task with same hash values, it can download the cache and skip the task execution.

To enable cache, you must specify the cache storage location by setting IRISML_CACHE_URL environment variable. Currently Azure Blob Storage and local filesystem is supported.

To use Azure Blob Storage, a container URL must be provided. It the URL contains a SAS token, it will be used for authentication. Otherwise, interactive authentication and Managed Identity authentication will be used.

## Python API
To run a pipeline from python code, you can use the following APIs.
```python
import json
import pathlib
from irisml.core import JobRunner

job_description = json.loads(pathlib.Path('example.json').read_text())
runner = JobRunner(job_description)

runner.run({'DATASET_NAME': 'mnist'})

runner.run({'DATASET_NAME': 'cifar10'})
```

# Available official tasks

To show the detailed help for each task, run the following command after installing the package.
```
irisml_show <task_name>
```

## [irisml-tasks](https://github.com/microsoft/irisml-tasks)
| Task                                     | Description                                                    |
| ---------------------------------------- | -------------------------------------------------------------- |
| assertion                                | Assert the given input.                                        |
| assign_class_to_strings                  | Assigns a class to a string based on the class name being present in the string. |
| branch                                   | 'If' conditional branch.                                       |
| calculate_cosine_similarity              | Calculate cosine similarity between two sets of vectors.       |
| check_model_parameters                   | Check Inf/NaN values in model parameters.                      |
| compare                                  | Compare two values                                             |
| deserialize_tensor                       | Deserialize a pytorch tensor.                                  |
| divide_float                             | Floating point division.                                       |
| download_azure_blob                      | Download a single blob from Azure Blob Storage.                |
| extract_image_bytes_from_dataset         | Extract images from a dataset and convert them to bytes.       |
| get_current_time                         | Get the current time in seconds since the epoch                |
| get_dataset_split                        | Get a train/val split of a dataset.                            |
| get_dataset_stats                        | Get statistics of a dataset.                                   |
| get_dataset_subset                       | Get a subset of a dataset.                                     |
| get_fake_image_classification_dataset    | Generate a fake image classification dataset.                  |
| get_fake_object_detection_dataset        | Generate a fake object detection dataset.                      |
| get_int_from_json_strings                | Get an integer from a JSON string.                             |
| get_item                                 | Get an item from the given list.                               |
| get_kfold_cross_validation_dataset       | Get train/test dataset for k-fold cross validation.            |
| get_secret_from_azure_keyvault           | Get a secret from Azure KeyVault.                              |
| get_topk                                 | Get the largest Topk values and indices.                       |
| join_filepath                            | Join a given dir_path and a filename.                          |
| load_state_dict                          | Load a state_dict from various sources.                        |
| make_cached_dataset                      | Save dataset cache on disk.                                    |
| make_prompt_for_each_string              | Make a prompt for each string.                                 |
| make_prompt_with_strings                 | Make a prompt with a list of strings.                          |
| pickling_object                          | Pickling an object.                                            |
| print                                    | Print or Pretty Print the input object.                        |
| print_environment_info                   | Print various environment information to stdout/stderr.        |
| read_file                                | Reads a file and returns its contents as bytes.                |
| repeat_tasks                             | Repeat the given tasks for multiple times.                     |
| run_parallel                             | Run the given tasks in parallel. A new process will be forked for each task. Each task must have an unique name. |
| run_profiler                             | Run profiler on the given tasks.                               |
| run_sequential                           | Run the given tasks in sequence. Each task must have an unique name. |
| save_file                                | Save the given input binary to a file.                         |
| save_images_from_dataset                 | Save images from a dataset to disk.                            |
| save_state_dict                          | Save the model's state_dict to the specified file.             |
| search_grid_sequential                   | Grid search hyperparameters. Tasks are run in sequence.        |
| serialize_tensor                         | Serialize a pytorch tensor.                                    |
| switch_pick                              | pick from vals based on conditions. Task will return the first val with condition being True. |
| upload_azure_blob                        | Upload a binary file to Azure Storage Blob.                    |


## [irisml-tasks-training](https://github.com/microsoft/irisml-tasks-training)

This package contains tasks related to pytorch training

| Task                                     | Description                                                    |
| ---------------------------------------- | -------------------------------------------------------------- |
| append_classifier                        | Append a classifier model to a given model. A predictor and a loss module will be added, too. |
| benchmark_dataset                        | Benchmark dataset loading and preprocessing                    |
| benchmark_model                          | Benchmark a given model using a given dataset.                 |
| benchmark_model_with_grad_cache          | Benchmark a given model using a given dataset with grad caching. Useful for cases which require sub batching. |
| build_classification_prompt_dataset      | Create a classification prompt dataset.                        |
| build_zero_shot_classifier               | Create a zero-shot classification layer.                       |
| concatenate_datasets                     | Concatenate the given two datasets together.                   |
| create_classification_prompt_generator   | Create a prompt generator for a classification task.           |
| evaluate_accuracy                        | Calculate accuracy of the given prediction results.            |
| evaluate_detection_average_precision     | Calculate mean average precision for object detection task results. |
| exclude_negative_samples_from_classification_dataset | Exclude negative samples from classification dataset.          |
| export_onnx                              | Export the given model as ONNX.                                |
| get_targets_from_dataset                 | Extract only targets from a given Dataset.                     |
| load_simple_classification_dataset       | Load a simple classification dataset from a directory of images and an index file. |
| make_classification_dataset_from_object_detection | Convert an object detection dataset into a classification dataset. |
| make_classification_dataset_from_predictions | Make a classification dataset from predictions.                |
| make_feature_extractor_model             | Make a wrapper model to extract a feature vector from a vision model. |
| make_fixed_prompt_image_transform        | Make a transform function for image and a fixed prompt.        |
| make_image_text_contrastive_model        | Make a model for image-text contrastive training.              |
| make_image_text_transform                | Make a transform function for image-text classification.       |
| make_oversampled_dataset                 | Make an oversampled dataset.                                   |
| num_iters_to_epochs                      | Convert number of iterations to number of epochs. Min value is 1. |
| predict                                  | Predict using a given model.                                   |
| remove_empty_images_from_dataset         | Remove empty images from dataset.                              |
| sample_few_shot_dataset                  | Few-shot sampling of a IC/OD dataset.                          |
| split_image_text_model                   | Split a image-text model into an image model and a text model. |
| train                                    | Train a pytorch model.                                         |
| train_with_gradient_cache                | Train a model using gradient cache. Useful for contrastive learning with a large model. |


## [irisml-tasks-torchvision](https://github.com/microsoft/irisml-tasks-torchvision)
Adapter tasks for torchvision library.

| Task                                     | Description                                                    |
| ---------------------------------------- | -------------------------------------------------------------- |
| create_torchvision_model                 | Create a torchvision model.                                    |
| create_torchvision_transform             | Create transform objects in torchvision library.               |
| load_torchvision_dataset                 | Load a dataset from torchvision package.                       |

## [irisml-tasks-transformers](https://github.com/microsoft/irisml-tasks-transformers)
Adapter tasks for HuggingFace transformers library.
| Task | Description |
| ---- | ----------- |
| create_transformers_model | Create a model using the transformers library. |
| create_transformers_tokenizer | Create a tokenizer using the transformers library. |

## [irisml-tasks-timm](https://github.com/microsoft/irisml-tasks-timm)
Adapter for models in timm library.
| Task | Description |
| ---- | ----------- |
| create_timm_model | Create a model using the timm library. |
| create_timm_transform | Create a preprocessing function using the timm library. |

## [irisml-tasks-onnx](https://github.com/microsoft/irisml-tasks-onnx)
Adapter tasks for OnnxRuntime library.
| Task | Description |
| ---- | ----------- |
| predict_onnx | Run inference for an ONNX model. |

## [irisml-tasks-azureml](https://github.com/microsoft/irisml-tasks-azureml)
| Task | Description |
| ---- | ----------- |
| run_azureml_child | Run tasks as a new child AzureML Run. |
| add_aml_tag | Tag the AML Run with a string key and optional value. |

## [irisml-tasks-fiftyone](https://github.com/microsoft/irisml-tasks-fiftyone)
| Task | Description |
| ---- | ----------- |
| launch_fiftyone | Launch a fiftyone interface. |

# Development
## Create a new task
To create a Task, you must define a module that contains a "Task" class. Here is a simple example:
```python
# irisml/tasks/my_custom_task.py
import dataclasses
import irisml.core

class Task(irisml.core.TaskBase):  # The class name must be "Task".
  VERSION = '1.0.0'
  CACHE_ENABLED = True  # (default: True) This is optional.

  @dataclasses.dataclass
  class Inputs:  # You can remove this class if the task doesn't require inputs.
    int_value: int
    float_value: float

  @dataclasses.dataclass
  class Config:  # If there is no configuration, you can remove this class. All fields must be JSON-serializable.
    another_float: float
    child_dataclass: dataclass  # If you'd like to define a nested config, you can define another dataclass.

  @dataclasses.dataclass
  class Outputs:  # Can be removed if the task doesn't have outputs.
    float_value: float = 0  # If dry_run() is not implemented, Outputs fields must have default value or default factory.

  def execute(self, inputs: Inputs) -> Outputs:
    return self.Outputs(inputs.int_value * inputs.float_value * self.config.another_float)

  def dry_run(self, inputs: Inputs) -> Outputs:  # This method is optional.
    return self.Outputs(0)  # Must return immediately without actual processing.
```

Each Task must define "execute" method. The base class has empty implementation for Inputs, Config, Outputs and dry_run(). For the detail, please see the document for TaskBase class.

# Related repositories
- [irisml-tasks](https://github.com/microsoft/irisml-tasks)
- [irisml-tasks-training](https://github.com/microsoft/irisml-tasks-training)
- [irisml-tasks-torchvision](https://github.com/microsoft/irisml-tasks-torchvision)
- [irisml-tasks-transformers](https://github.com/microsoft/irisml-tasks-transformers)
- [irisml-tasks-timm](https://github.com/microsoft/irisml-tasks-timm)
- [irisml-tasks-azureml](https://github.com/microsoft/irisml-tasks-azureml)
- [irisml-tasks-fiftyone](https://github.com/microsoft/irisml-tasks-fiftyone)
