Metadata-Version: 2.1
Name: irisml
Version: 0.0.38
Summary: Simple ML pipeline platform
Home-page: https://github.com/microsoft/irisml
Author: irisdev
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: azure-identity
Requires-Dist: azure-storage-blob<13
Requires-Dist: numpy
Requires-Dist: tenacity
Requires-Dist: torch

# IrisML

Proof of Concept for a simple framework to create a ML pipeline.


# Features
- Run a ML training/inference with a simple JSON configuration.
- Modularized interfaces for task components.
- Cache task outputs for faster experiments.

# Getting started
## Installation
Prerequisite: python 3.8+

```
# Install the core framework and standard tasks.
pip install irisml irisml-tasks irisml-tasks-training
```

## Run an example job
```
# Install additional packages that are required for the example
pip install irisml-tasks-torchvision

# Run on local machine
irisml_run docs/examples/mobilenetv2_mnist_training.json
```

## Available commands
```
# Run the specified pipeline. You can provide environment variables by "-e" option, which will be acceible through $env variable in the json config.
irisml_run <pipeline_json_path> [-e <ENV_NAME>=<env_value>] [--no_cache] [--no_cache_read] [-v]

# Show information about the specified task. If <task_name> is not provided, shows a list of available tasks in the current environment.
irisml_show [<task_name>]

# Manage a cache storage on Azure Blob Storage
# list - Show a list of matched blobs.
# download - Download matched blobs.
# remove - Remove matched blobs.
# show - Show the contents of matched blobs.
irisml_cache <list|download|remove|show> [--mtime <+|->N] [--name NAME]
```

## Pipeline definition
```
PipelineDefinition = {"tasks": List[TaskDefinition], "on_error": Optional[List[TaskDescription]]}

TaskDefinition = {
    "task": <task module name>,
    "name": <optional unique name of the task>,
    "inputs": <list of input objects>,
    "config": <config for the task. Use irisml_show command to find the available configurations.>
}
```
In the TaskDefinition.inputs and TaskDefinition.config, you cna use the following two variable.
- $env.<variable_name>
  This variable will be replaced by the environment variable that was provided as arguments for irisml_run command.
- $outputs.<task_name>.<field_name>
  This variable will be replaced by the outputs of the specified previous task.

It raises an exception on runtime if the specified variable was not found.

If a task raised an exception, the tasks specified in on_error field will be executed. The exception object will be assigned to "$env.IRISML_EXCEPTION" variable.

## Patch definition (Experimental)
```
PatchesDefinition = {"patches": List[PatchDefinition], "patches_on_error": List[PatchDefinition]}  # At least one of the fields must be specified.

PatchDefinition = {  # One of the filtering conditions and one of the actions must be specified.
    # Filtering conditions
    "match": List[MatchCondition],
    "match_if_exists": List[MatchCondition],  # Matches the task if it exists. If not, the patch will be ignored.
    "match_oneof": List[MatchCondition],  # Matches the first task that matches one of the conditions.
    "top": bool,  # Matches the top of the pipeline. Used with "insert" action.
    "bottom": bool,  # Matches the bottom of the pipeline. Used with "insert" action.

    # Actions
    "insert": List[TaskDefinition],
    "remove": bool,
    "replace": Tuple[List[TaskDefinition], Dict[str, str]], # The second element is a mapping from the old output name to the new output name. All "$output" variables will be replaced by the new output name.
    "update": TaskDefinition
}

MatchCondition = {  # All fields are optional.
    "task": str,
    "name": str,
    "config": Dict[str, Any]
}
```

The available actions are as follows:
- insert: Insert the specified tasks after the matched task.
- remove: Remove the matched task.
- replace: Replace the matched task with the specified tasks.
- update: Update the matched task with the given configuration.

Note that the patch command doesn't guarantee the correctness of the patched pipeline. It is recommended to validate the patched pipeline.

## Pipeline cache
Using cache, you can modify and re-run a pipeline config with minimum cost. If the cache is enabled, IrisML will calculate hash values for all task inputs/configs and upload the task outputs to a specified storage. When it found a task with same hash values, it can download the cache and skip the task execution.

To enable cache, you must specify the cache storage location by setting IRISML_CACHE_URL environment variable. Currently Azure Blob Storage and local filesystem is supported.

To use Azure Blob Storage, a container URL must be provided. It the URL contains a SAS token, it will be used for authentication. Otherwise, interactive authentication and Managed Identity authentication will be used.

## Python API
To run a pipeline from python code, you can use the following APIs.
```python
import json
import pathlib
from irisml.core import JobRunner

job_description = json.loads(pathlib.Path('example.json').read_text())
runner = JobRunner(job_description)

runner.run({'DATASET_NAME': 'mnist'})

runner.run({'DATASET_NAME': 'cifar10'})
```

# Available official tasks

To show the detailed help for each task, run the following command after installing the package.
```
irisml_show <task_name>
```

## [irisml-tasks](https://github.com/microsoft/irisml-tasks)

| Task                                       | Description |
| ------------------------------------------ | ----------- |
| assertion                                  | Assert the given input. |
| assign_class_to_strings                    | Assigns a class to a string based on the class name being present in the string. |
| branch                                     | 'If' conditional branch. |
| calculate_cosine_similarity                | Calculate cosine similarity between two sets of vectors. |
| check_model_parameters                     | Check Inf/NaN values in model parameters. |
| compare                                    | Compare two values |
| compare_ints                               | Compare two int values. |
| convert_detection_to_multilabel            | Convert targets or predictions of object detection to multilabel. |
| convert_string_to_string_list              | Convert a string to a list of strings. |
| deserialize_tensor                         | Deserialize a pytorch tensor. |
| divide_float                               | Floating point division. |
| download_azure_blob                        | Download a single blob from Azure Blob Storage. |
| emulate_fp8_quantization                   | Emulate FP8 quantization. |
| extract_image_bytes_from_dataset           | Extract images from a dataset and convert them to bytes. |
| get_current_time                           | Get the current time in seconds since the epoch |
| get_dataset_split                          | Get a train/val split of a dataset. |
| get_dataset_stats                          | Get statistics of a dataset. |
| get_dataset_subset                         | Get a subset of a dataset. |
| get_fake_image_classification_dataset      | Generate a fake image classification dataset. |
| get_fake_image_text_classification_dataset | Generate a fake image-text classification dataset. |
| get_fake_object_detection_dataset          | Generate a fake object detection dataset. |
| get_fake_phrase_grounding_dataset          | Generate a fake phrase grounding dataset. |
| get_fake_visual_question_answering_dataset | Generate a fake visual question answering dataset. |
| get_int_from_json_strings                  | Get an integer from a JSON string. |
| get_int_list_from_json_strings             | Get a list of ints from a JSON string. |
| get_item                                   | Get an item from the given list. |
| get_key_and_int_list_from_json_string      | Parse a JSON string and return a list of keys and a list of lists of ints. |
| get_kfold_cross_validation_dataset         | Get train/test dataset for k-fold cross validation. |
| get_secret_from_azure_keyvault             | Get a secret from Azure KeyVault. |
| get_topk                                   | Get the largest Topk values and indices. |
| join_filepath                              | Join a given dir_path and a filename. |
| join_two_strings                           | Join two strings to one string. |
| load_coco_detections                       | Load coco detections from a JSON to a list of tensors. |
| load_float_tensor_jsonl                    | Load a 2D float tensor from a JSONL file. |
| load_state_dict                            | Load a state_dict from various sources. |
| load_str_list_jsonl                        | Load a list of strings from a JSONL file. |
| load_strs_from_json_file                   | Load strings from a JSON file. |
| load_tensor_list                           | Load a list of tensors from file. |
| make_cached_dataset                        | Save dataset cache on disk. |
| make_prompt_for_each_string                | Make a prompt for each string. |
| make_prompt_list_with_strings              | Make a list of prompts from a template and a list of strings. |
| make_prompt_with_strings                   | Make a prompt with a list of strings. |
| make_random_choice_text_transform          | Make a text transform function that randomly chooses one of the substrings separated by the delimiter. |
| make_text_transform                        | Make a text transform function. |
| map_int_list                               | Map a list of integers to a list of integers. |
| pickling_object                            | Pickling an object. |
| print                                      | Print or Pretty Print the input object. |
| print_environment_info                     | Print various environment information to stdout/stderr. |
| read_file                                  | Reads a file and returns its contents as bytes. |
| repeat_tasks                               | Repeat the given tasks for multiple times. |
| run_parallel                               | Run the given tasks in parallel. A new process will be forked for each task. Each task must have an unique name. |
| run_profiler                               | Run profiler on the given tasks. |
| run_sequential                             | Run the given tasks in sequence. Each task must have an unique name. |
| save_file                                  | Save the given input binary to a file. |
| save_float_tensor_jsonl                    | Save a 2D float tensor to a JSONL file. |
| save_images_from_dataset                   | Save images from a dataset to disk. |
| save_jit_model                             | Save an offline version of a pytorch model. torch.jit.save() |
| save_state_dict                            | Save the model's state_dict to the specified file. |
| save_str_list_jsonl                        | Save a list of strings to a JSONL file. |
| search_grid_sequential                     | Grid search hyperparameters. Tasks are run in sequence. |
| serialize_tensor                           | Serialize a pytorch tensor. |
| split_string                               | Split string to a list of strings. |
| switch_pick                                | pick from vals based on conditions. Task will return the first val with condition being True. |
| upload_azure_blob                          | Upload a binary file to Azure Storage Blob. |
| upload_azure_blob_directory                | Upload a directory to Azure Blob Storage. |

## [irisml-tasks-training](https://github.com/microsoft/irisml-tasks-training)

This package contains tasks related to pytorch training.

| Task                                                     | Description |
| -------------------------------------------------------- | ----------- |
| append_classifier                                        | Append a classifier model to a given model. A predictor and a loss module will be added, too. |
| benchmark_dataset                                        | Benchmark dataset loading and preprocessing |
| benchmark_model                                          | Benchmark a given model using a given dataset. |
| benchmark_model_with_grad_cache                          | Benchmark a given model using a given dataset with grad caching. Useful for cases which require sub batching. |
| build_classification_prompt_dataset                      | Create a classification prompt dataset. |
| build_zero_shot_classifier                               | Create a zero-shot classification layer. |
| concatenate_datasets                                     | Concatenate the given two datasets together. |
| convert_vqa_dataset_to_image_text_classification_dataset | Convert VQA dataset to image text classification dataset. |
| create_classification_prompt_generator                   | Create a prompt generator for a classification task. |
| create_prompt_generator                                  | Create a prompt generator that returns a list of prompts for a given label. |
| evaluate_accuracy                                        | Calculate accuracy of the given prediction results. |
| evaluate_captioning                                      | Evaluate captioning prediction results. |
| evaluate_detection_average_precision                     | Calculate mean average precision for object detection task results. |
| evaluate_phrase_grounding                                | Calculate precision/recall for phrase grounding. |
| evaluate_phrase_grounding_recall                         | Calculate recall for phrase grounding. |
| evaluate_string_matching_accuracy                        | Calculate accuracy of string matching. |
| exclude_negative_samples_from_classification_dataset     | Exclude negative samples from classification dataset. |
| export_coco_from_torch_dataset                           | Export coco dataset from a given torch dataset. Support IC and OD only. |
| export_onnx                                              | Export the given model as ONNX. |
| extract_val_by_key_from_jsonl                            | Extract value for each entry in a JSONL by a key. |
| find_incorrect_classification_indices                    | Find incorrect classification indices. |
| find_incorrect_classification_multilabel_indices         | Find incorrect classification indices for multilabel classification. |
| flatten_captioning_dataset                               | Flatten a captioning dataset with multiple targets per image into a dataset with a single target per image. |
| get_questions_from_vqa_dataset                           | Extracts questions from a VQA dataset. |
| get_subclass_dataset                                     | Get the sub-dataset with given class ids from a dataset. |
| get_targets_from_dataset                                 | Extract only targets from a given Dataset. |
| load_jsonl_vqa_dataset                                   | Load a VQA dataset from a jsonl file. |
| load_simple_classification_dataset                       | Load a simple classification dataset from a directory of images and an index file. |
| make_classification_dataset_from_object_detection        | Convert an object detection dataset into a classification dataset. |
| make_classification_dataset_from_predictions             | Make a classification dataset from predictions. |
| make_detection_dataset_from_predictions                  | Make a detection dataset from predictions. |
| make_feature_extractor_model                             | Make a wrapper model to extract a feature vector from a vision model. |
| make_fixed_prompt_image_transform                        | Make a transform function for image and a fixed prompt. |
| make_fixed_text_dataset                                  | Create a dataset with a list of strings. |
| make_image_text_contrastive_model                        | Make a model for image-text contrastive training. |
| make_image_text_transform                                | Make a transform function for image-text classification. |
| make_oversampled_dataset                                 | Make an oversampled dataset. |
| make_phrase_grounding_image_transform                    | Make phrase grounding image transform. |
| make_prompt_list_image_transform                         | Make a transform function for image and prompt list. |
| make_vqa_collate_function                                | Creates a collate_function for Visual Question Answering (VQA) and Phrase Grounding task. |
| make_vqa_image_transform                                 | Creates a transform function for VQA task. |
| map_classification_predictions_to_detection              | Map classification predictions back to detection predictions or targets. |
| num_iters_to_epochs                                      | Convert number of iterations to number of epochs. Min value is 1. |
| predict                                                  | Predict using a given model. |
| remove_empty_images_from_dataset                         | Remove empty images from dataset. |
| sample_few_shot_dataset                                  | Few-shot sampling of a IC/OD dataset. |
| save_jsonl_vqa_dataset                                   | Save a VQA dataset to a JSONL file. |
| split_image_text_model                                   | Split a image-text model into an image model and a text model. |
| train                                                    | Train a pytorch model. |
| train_with_gradient_cache                                | Train a model using gradient cache. Useful for contrastive learning with a large model. |


## [irisml-tasks-azure-computervision](https://github.com/microsoft/irisml-tasks-azure-computervision)

| Task                                                 | Description |
| ---------------------------------------------------- | ----------- |
| create_azure_computervision_caption_model            | Create Azure Computer Vision Caption Model. |
| create_azure_computervision_classification_model     | Create Azure Computer Vision Caption Model. |
| create_azure_computervision_custom_model             | Create a model that run inference with a custom model in Azure Computer Vision. |
| create_azure_computervision_ocr_model                | Create Azure Computer Vision OCR model. |
| create_azure_computervision_product_recognizer_model | Create a model that run inference with a product recognizer model in Azure Computer Vision. |
| create_azure_computervision_vectorization_model      | Create Azure Computer Vision Vectorization Model. |
| delete_azure_computervision_custom_model             | Delete Azure Computer Vision Custom Model. |
| train_azure_computervision_custom_model              | Train Azure Computer Vision Custom Model. |

## [irisml-tasks-azure-customvision](https://github.com/microsoft/irisml-tasks-azure-customvision)

| Task                                   | Description |
| -------------------------------------- | ----------- |
| create_azure_customvision_docker_model | Create a model from an exported Azure Custom Vision Docker image. |
| create_azure_customvision_model        | Create a prediction model from an Azure Custom Vision project. |
| create_azure_customvision_project      | Create a new Azure Custom Vision project. |
| delete_azure_customvision_project      | Delete an Azure Custom Vision project |
| export_azure_customvision_model        | Export a model from an Azure Custom Vision project. |
| train_azure_customvision_project       | Train an Azure Custom Vision project. |

## [irisml-tasks-azure-openai](https://github.com/microsoft/irisml-tasks-azure-openai)

| Task                                 | Description |
| ------------------------------------ | ----------- |
| call_azure_openai_completion         | Call Azure OpenAI Text Completion API. |
| create_azure_openai_chat_model       | Create a model that generates text using Azure OpenAI completion API. |
| create_azure_openai_completion_model | Create a model that generates text using Azure OpenAI completion API. |

## [irisml-tasks-azureml](https://github.com/microsoft/irisml-tasks-azureml)
| Task | Description |
| ---- | ----------- |
| run_azureml_child | Run tasks as a new child AzureML Run. |

## [irisml-tasks-fiftyone](https://github.com/microsoft/irisml-tasks-fiftyone)
| Task            | Description |
| --------------- | ----------- |
| launch_fiftyone | Launch a fiftyone app. |

## [irisml-tasks-llava](https://github.com/microsoft/irisml-tasks-llava)
| Task               | Description |
| ------------------ | ----------- |
| create_llava_model | Create a LLaVA model from a pretrained weights. |

## [irisml-tasks-onnx](https://github.com/microsoft/irisml-tasks-onnx)
Adapter tasks for OnnxRuntime library.

| Task           | Description |
| -------------- | ----------- |
| benchmark_onnx | Bencharmk a given onnx model using onnxruntime. |
| predict_onnx   | Predict using a given onnx model traced with the export_onnx task |

## [irisml-tasks-timm](https://github.com/microsoft/irisml-tasks-timm)
Adapter for models in timm library.

| Task                  | Description |
| --------------------- | ----------- |
| create_timm_model     | Create a timm model. |
| create_timm_transform | Create timm transforms. |

## [irisml-tasks-torchmetrics](https://github.com/microsoft/irisml-tasks-torchmetrics)
Adapter tasks for torchmetrics library.

| Task                                            | Description |
| ----------------------------------------------- | ----------- |
| evaluate_torchmetrics_classification_multiclass | Evaluate predictions results using torchmetrics classification metrics for multiclass classification problems. |
| evaluate_torchmetrics_classification_multilabel | Evaluate predictions results using torchmetrics classification metrics for multilabel classification problems. |

## [irisml-tasks-torchvision](https://github.com/microsoft/irisml-tasks-torchvision)
Adapter tasks for torchvision library.

| Task                            | Description |
| ------------------------------- | ----------- |
| create_torchvision_model        | Create a torchvision model. |
| create_torchvision_transform    | Create transform objects in torchvision library. |
| create_torchvision_transform_v2 | Create torchvision transform v2 object from string expressions. |
| load_torchvision_dataset        | Load a dataset from torchvision package. |

## [irisml-tasks-transformers](https://github.com/microsoft/irisml-tasks-transformers)
Adapter tasks for HuggingFace transformers library.

| Task                                                     | Description |
| -------------------------------------------------------- | ----------- |
| cache_transformers_model_on_azure_blob                   | Cache a model from transformers on Azure Blob Storage. |
| create_transformers_model                                | Create a model using transformers library. |
| create_transformers_raw_tokenizer                        | Create a Tokenizer using transformers library. Return the tokenizer as-is. |
| create_transformers_text_model                           | Create a text-generation model using transformers library. |
| create_transformers_tokenizer                            | Create a Tokenizer using transformers library. |


# Development
## Create a new task
To create a Task, you must define a module that contains a "Task" class. Here is a simple example:
```python
# irisml/tasks/my_custom_task.py
import dataclasses
import irisml.core

class Task(irisml.core.TaskBase):  # The class name must be "Task".
  VERSION = '1.0.0'
  CACHE_ENABLED = True  # (default: True) This is optional.

  @dataclasses.dataclass
  class Inputs:  # You can remove this class if the task doesn't require inputs.
    int_value: int
    float_value: float

  @dataclasses.dataclass
  class Config:  # If there is no configuration, you can remove this class. All fields must be JSON-serializable.
    another_float: float
    child_dataclass: dataclass  # If you'd like to define a nested config, you can define another dataclass.

  @dataclasses.dataclass
  class Outputs:  # Can be removed if the task doesn't have outputs.
    float_value: float = 0  # If dry_run() is not implemented, Outputs fields must have default value or default factory.

  def execute(self, inputs: Inputs) -> Outputs:
    return self.Outputs(inputs.int_value * inputs.float_value * self.config.another_float)

  def dry_run(self, inputs: Inputs) -> Outputs:  # This method is optional.
    return self.Outputs(0)  # Must return immediately without actual processing.
```

Each Task must define "execute" method. The base class has empty implementation for Inputs, Config, Outputs and dry_run(). For the detail, please see the document for TaskBase class.

# Related repositories
- [irisml-tasks](https://github.com/microsoft/irisml-tasks)
- [irisml-tasks-training](https://github.com/microsoft/irisml-tasks-training)
- [irisml-tasks-azure-computervision](https://github.com/microsoft/irisml-tasks-azure-computervision)
- [irisml-tasks-azure-customvision](https://github.com/microsoft/irisml-tasks-azure-customvision)
- [irisml-tasks-azure-openai](https://github.com/microsoft/irisml-tasks-azure-openai)
- [irisml-tasks-azureml](https://github.com/microsoft/irisml-tasks-azureml)
- [irisml-tasks-fiftyone](https://github.com/microsoft/irisml-tasks-fiftyone)
- [irisml-tasks-llava](https://github.com/microsoft/irisml-tasks-llava)
- [irisml-tasks-onnx](https://github.com/microsoft/irisml-tasks-onnx)
- [irisml-tasks-torchmetrics](https://github.com/microsoft/irisml-tasks-torchmetrics)
- [irisml-tasks-torchvision](https://github.com/microsoft/irisml-tasks-torchvision)
- [irisml-tasks-transformers](https://github.com/microsoft/irisml-tasks-transformers)
- [irisml-tasks-timm](https://github.com/microsoft/irisml-tasks-timm)
