Metadata-Version: 2.4
Name: hirundo
Version: 0.1.21
Summary: This package is used to interface with Hirundo's platform. It provides a simple API to optimize your ML datasets.
Author-email: Hirundo <dev@hirundo.io>
License: MIT License
        
        Copyright (c) 2024, Hirundo
        
        Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
        
Project-URL: Homepage, https://github.com/Hirundo-io/hirundo-python-sdk
Keywords: dataset,machine learning,data science,data engineering
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: types-PyYAML>=6.0.12
Requires-Dist: pydantic>=2.7.1
Requires-Dist: twine>=5.0.0
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: types-requests>=2.31.0
Requires-Dist: typer>=0.12.3
Requires-Dist: httpx>=0.27.0
Requires-Dist: stamina>=24.2.0
Requires-Dist: httpx-sse>=0.4.0
Requires-Dist: tqdm>=4.66.5
Requires-Dist: h11>=0.16.0
Requires-Dist: requests>=2.32.4
Requires-Dist: urllib3>=2.5.0
Requires-Dist: setuptools>=78.1.1
Provides-Extra: dev
Requires-Dist: pyyaml>=6.0.1; extra == "dev"
Requires-Dist: types-PyYAML>=6.0.12; extra == "dev"
Requires-Dist: pydantic>=2.7.1; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Requires-Dist: python-dotenv>=1.0.1; extra == "dev"
Requires-Dist: types-requests>=2.31.0; extra == "dev"
Requires-Dist: types-setuptools>=69.5.0; extra == "dev"
Requires-Dist: typer>=0.12.3; extra == "dev"
Requires-Dist: httpx>=0.27.0; extra == "dev"
Requires-Dist: stamina>=24.2.0; extra == "dev"
Requires-Dist: httpx-sse>=0.4.0; extra == "dev"
Requires-Dist: pytest>=8.2.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.6; extra == "dev"
Requires-Dist: uv>=0.8.6; extra == "dev"
Requires-Dist: pre-commit>=3.7.1; extra == "dev"
Requires-Dist: virtualenv>=20.6.6; extra == "dev"
Requires-Dist: ruff>=0.12.0; extra == "dev"
Requires-Dist: bumpver; extra == "dev"
Requires-Dist: platformdirs>=4.3.6; extra == "dev"
Requires-Dist: safety>=3.2.13; extra == "dev"
Requires-Dist: cryptography>=44.0.1; extra == "dev"
Requires-Dist: jinja2>=3.1.6; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.4.7; extra == "docs"
Requires-Dist: sphinx-autobuild>=2024.9.3; extra == "docs"
Requires-Dist: sphinx-click>=5.0.1; extra == "docs"
Requires-Dist: autodoc_pydantic>=2.2.0; extra == "docs"
Requires-Dist: furo; extra == "docs"
Requires-Dist: sphinx-multiversion; extra == "docs"
Requires-Dist: esbonio; extra == "docs"
Requires-Dist: starlette>=0.47.2; extra == "docs"
Requires-Dist: markupsafe>=3.0.2; extra == "docs"
Requires-Dist: jinja2>=3.1.6; extra == "docs"
Provides-Extra: pandas
Requires-Dist: pandas>=2.2.3; extra == "pandas"
Provides-Extra: polars
Requires-Dist: polars>=1.0.0; extra == "polars"
Dynamic: license-file

# Hirundo

This package exposes access to Hirundo APIs for dataset QA for Machine Learning.

Dataset QA is currently available for datasets labelled for classification and object detection.

Support dataset storage configs include:

- Google Cloud (GCP) Storage
- Amazon Web Services (AWS) S3
- Git LFS (Large File Storage) repositories (e.g. GitHub or HuggingFace)

Note: This Python package must be used alongside a Hirundo server, either the SaaS platform, a custom VPC deployment or an on-premises installation.

Optimizing a classification dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently `hirundo` requires a CSV file with the following columns (all columns are required):

- `image_path`: The location of the image within the dataset `data_root_url`
- `class_name`: The semantic label, i.e. the class name of the class that the image was annotated as belonging to

And outputs two Pandas DataFrames with the dataset columns as well as:

Suspect DataFrame (filename: `mislabel_suspects.csv`) columns:

- ``suspect_score``: mislabel suspect score
- ``suspect_level``: mislabel suspect level
- ``suspect_rank``: mislabel suspect ranking
- ``suggested_class_name``: suggested semantic label
- ``suggested_class_conf``: suggested semantic label confidence

Errors and warnings DataFrame (filename: `invalid_data.csv`) columns:

   - ``status``: status message (one of ``NO_LABELS`` / ``MISSING_IMAGE`` / ``INVALID_IMAGE``)

Optimizing an object detection (OD) dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently ``hirundo`` requires a CSV file with the following columns (all columns are required):

- ``image_path``: The location of the image within the dataset ``data_root_url``
- ``object_id``: The ID of the bounding box within the dataset. Used to indicate object suspects
- ``class_name``: Object semantic label, i.e. the class name of the object that was annotated
- ``xmin``: leftmost horizontal pixel coordinate of the object's bounding box
- ``ymin``: uppermost vertical pixel coordinate of the object's bounding box
- ``xmax``: rightmost horizontal pixel coordinate of the object's bounding box
- ``ymax``: lowermost vertical pixel coordinate of the object's bounding box


And outputs two Pandas DataFrames with the dataset columns as well as:

Suspect DataFrame (filename: `mislabel_suspects.csv`) columns:

- ``suspect_score``: object mislabel suspect score
- ``suspect_level``: object mislabel suspect level
- ``suspect_rank``: object mislabel suspect ranking
- ``suggested_class_name``: suggested object semantic label
- ``suggested_class_conf``: suggested object semantic label confidence

Errors and warnings DataFrame (filename: `invalid_data.csv`) columns:
   - ``status``: status message (one of ``NO_LABELS`` / ``MISSING_IMAGE`` / ``INVALID_IMAGE`` / ``INVALID_BBOX`` / ``INVALID_BBOX_SIZE``)

## Installation

You can install the codebase with a simple `pip install hirundo` to install the latest version of this package. If you prefer to install from the Git repository and/or need a specific version or branch, you can simply clone the repository, check out the relevant commit and then run `pip install .` to install that version. A full list of dependencies can be found in `requirements.txt`, but these will be installed automatically by either of these commands.

## Usage

Classification example:

```python
from hirundo import (
    HirundoCSV,
    LabelingType,
    QADataset,
    StorageGCP,
    StorageConfig,
    StorageTypes,
)

gcp_bucket = StorageGCP(
    bucket_name="cifar100bucket",
    project="Hirundo-global",
    credentials_json=json.loads(os.environ["GCP_CREDENTIALS"]),
)
test_dataset = QADataset(
    name="TEST-GCP cifar 100 classification dataset",
    labeling_type=LabelingType.SINGLE_LABEL_CLASSIFICATION,
    storage_config=StorageConfig(
        name="cifar100bucket",
        type=StorageTypes.GCP,
        gcp=gcp_bucket,
    ),
    data_root_url=gcp_bucket.get_url(path="/pytorch-cifar/data"),
    labeling_info=HirundoCSV(
        csv_url=gcp_bucket.get_url(path="/pytorch-cifar/data/cifar100.csv"),
    ),
    classes=cifar100_classes,
)

test_dataset.run_qa()
results = test_dataset.check_run()
print(results)
```

Object detection example:

```python
from hirundo import (
    GitRepo,
    HirundoCSV,
    LabelingType,
    QADataset,
    StorageGit,
    StorageConfig,
    StorageTypes,
)

git_storage = StorageGit(
    repo=GitRepo(
        name="BDD-100k-validation-dataset",
        repository_url="https://huggingface.co/datasets/hirundo-io/bdd100k-validation-only",
    ),
    branch="main",
)
test_dataset = QADataset(
    name="TEST-HuggingFace-BDD-100k-validation-OD-validation-dataset",
    labeling_type=LabelingType.OBJECT_DETECTION,
    storage_config=StorageConfig(
        name="BDD-100k-validation-dataset",
        type=StorageTypes.GIT,
        git=git_storage,
    ),
    data_root_url=git_storage.get_url(path="/BDD100K Val from Hirundo.zip/bdd100k"),
    labeling_info=HirundoCSV(
        csv_url=git_storage.get_url(
            path="/BDD100K Val from Hirundo.zip/bdd100k/bdd100k.csv"
        ),
    ),
)

test_dataset.run_qa()
results = test_dataset.check_run()
print(results)
```

Note: Currently we only support the main CPython release 3.9, 3.10, 3.11, 3.12 & 3.13. PyPy support may be introduced in the future.

## Further documentation

To learn more about how to use this library, please visit the [http://docs.hirundo.io/](documentation) or see the [Google Colab examples](https://github.com/Hirundo-io/hirundo-python-sdk/tree/main/notebooks).
