Metadata-Version: 2.1
Name: datachain
Version: 0.6.9
Summary: Wrangle unstructured AI data at scale
Author-email: Dmitry Petrov <support@dvc.org>
License: Apache-2.0
Project-URL: Documentation, https://datachain.dvc.ai
Project-URL: Issues, https://github.com/iterative/datachain/issues
Project-URL: Source, https://github.com/iterative/datachain
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Development Status :: 2 - Pre-Alpha
Requires-Python: >=3.9
Description-Content-Type: text/x-rst
License-File: LICENSE
Requires-Dist: pyyaml
Requires-Dist: tomlkit
Requires-Dist: tqdm
Requires-Dist: numpy <3,>=1
Requires-Dist: pandas >=2.0.0
Requires-Dist: pyarrow
Requires-Dist: typing-extensions
Requires-Dist: python-dateutil >=2
Requires-Dist: attrs >=21.3.0
Requires-Dist: s3fs >=2024.2.0
Requires-Dist: gcsfs >=2024.2.0
Requires-Dist: adlfs >=2024.2.0
Requires-Dist: dvc-data <4,>=3.10
Requires-Dist: dvc-objects <6,>=4
Requires-Dist: shtab <2,>=1.3.4
Requires-Dist: sqlalchemy >=2
Requires-Dist: multiprocess ==0.70.16
Requires-Dist: cloudpickle
Requires-Dist: orjson >=3.10.5
Requires-Dist: pydantic <3,>=2
Requires-Dist: jmespath >=1.0
Requires-Dist: datamodel-code-generator >=0.25
Requires-Dist: Pillow <12,>=10.0.0
Requires-Dist: msgpack <2,>=1.0.4
Requires-Dist: psutil
Requires-Dist: huggingface-hub
Requires-Dist: iterative-telemetry >=0.0.9
Requires-Dist: platformdirs
Requires-Dist: dvc-studio-client <1,>=0.21
Requires-Dist: tabulate
Provides-Extra: dev
Requires-Dist: datachain[docs,tests] ; extra == 'dev'
Requires-Dist: mypy ==1.13.0 ; extra == 'dev'
Requires-Dist: types-python-dateutil ; extra == 'dev'
Requires-Dist: types-pytz ; extra == 'dev'
Requires-Dist: types-PyYAML ; extra == 'dev'
Requires-Dist: types-requests ; extra == 'dev'
Requires-Dist: types-tabulate ; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs >=1.5.2 ; extra == 'docs'
Requires-Dist: mkdocs-gen-files >=0.5.0 ; extra == 'docs'
Requires-Dist: mkdocs-material >=9.3.1 ; extra == 'docs'
Requires-Dist: mkdocs-section-index >=0.3.6 ; extra == 'docs'
Requires-Dist: mkdocstrings-python >=1.6.3 ; extra == 'docs'
Requires-Dist: mkdocs-literate-nav >=0.6.1 ; extra == 'docs'
Provides-Extra: examples
Requires-Dist: datachain[tests] ; extra == 'examples'
Requires-Dist: numpy <2,>=1 ; extra == 'examples'
Requires-Dist: defusedxml ; extra == 'examples'
Requires-Dist: accelerate ; extra == 'examples'
Requires-Dist: unstructured[embed-huggingface,pdf] <0.16.0 ; extra == 'examples'
Requires-Dist: pdfplumber ==0.11.4 ; extra == 'examples'
Requires-Dist: huggingface-hub[hf_transfer] ; extra == 'examples'
Requires-Dist: onnx ==1.16.1 ; extra == 'examples'
Provides-Extra: hf
Requires-Dist: numba >=0.60.0 ; extra == 'hf'
Requires-Dist: datasets[audio,vision] >=2.21.0 ; extra == 'hf'
Provides-Extra: remote
Requires-Dist: lz4 ; extra == 'remote'
Requires-Dist: requests >=2.22.0 ; extra == 'remote'
Provides-Extra: tests
Requires-Dist: datachain[hf,remote,torch,vector] ; extra == 'tests'
Requires-Dist: pytest <9,>=8 ; extra == 'tests'
Requires-Dist: pytest-sugar >=0.9.6 ; extra == 'tests'
Requires-Dist: pytest-cov >=4.1.0 ; extra == 'tests'
Requires-Dist: pytest-mock >=3.12.0 ; extra == 'tests'
Requires-Dist: pytest-servers[all] >=0.5.7 ; extra == 'tests'
Requires-Dist: pytest-benchmark[histogram] ; extra == 'tests'
Requires-Dist: pytest-xdist >=3.3.1 ; extra == 'tests'
Requires-Dist: virtualenv ; extra == 'tests'
Requires-Dist: dulwich ; extra == 'tests'
Requires-Dist: hypothesis ; extra == 'tests'
Requires-Dist: open-clip-torch ; extra == 'tests'
Requires-Dist: aiotools >=1.7.0 ; extra == 'tests'
Requires-Dist: requests-mock ; extra == 'tests'
Requires-Dist: scipy ; extra == 'tests'
Provides-Extra: torch
Requires-Dist: torch >=2.1.0 ; extra == 'torch'
Requires-Dist: torchvision ; extra == 'torch'
Requires-Dist: transformers >=4.36.0 ; extra == 'torch'
Provides-Extra: vector
Requires-Dist: usearch ; extra == 'vector'

================
|logo| DataChain
================

|PyPI| |Python Version| |Codecov| |Tests|

.. |logo| image:: docs/assets/datachain.svg
   :height: 24
.. |PyPI| image:: https://img.shields.io/pypi/v/datachain.svg
   :target: https://pypi.org/project/datachain/
   :alt: PyPI
.. |Python Version| image:: https://img.shields.io/pypi/pyversions/datachain
   :target: https://pypi.org/project/datachain
   :alt: Python Version
.. |Codecov| image:: https://codecov.io/gh/iterative/datachain/graph/badge.svg?token=byliXGGyGB
   :target: https://codecov.io/gh/iterative/datachain
   :alt: Codecov
.. |Tests| image:: https://github.com/iterative/datachain/actions/workflows/tests.yml/badge.svg
   :target: https://github.com/iterative/datachain/actions/workflows/tests.yml
   :alt: Tests

DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
data like images, audio, videos, text and PDFs. It integrates with external storage
(e.g., S3) to process data efficiently without data duplication and manages metadata
in an internal database for easy and efficient querying.


Use Cases
=========

1. **Multimodal Dataset Preparation and Curation**: ideal for organizing and
   refining data in pre-training, finetuning or LLM evaluating stages.
2. **GenAI Data Analytics**: Enables advanced analytics for multimodal data and
   ad-hoc analytics using LLMs.

Key Features
============

📂 **Multimodal Dataset Versioning.**
   - Version unstructured data without redundant data copies, by supporitng
     references to S3, GCP, Azure, and local file systems.
   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
   - Unite files and metadata together into persistent, versioned, columnar datasets.

🐍 **Python-friendly.**
   - Operate on Python objects and object fields: float scores, strings, matrixes,
     LLM response objects.
   - Run Python code in a high-scale, terabytes size datasets, with built-in
     parallelization and memory-efficient computing — no SQL or Spark required.

🧠 **Data Enrichment and Processing.**
   - Generate metadata using local AI models and LLM APIs.
   - Filter, join, and group datasets by metadata. Search by vector embeddings.
   - High-performance vectorized operations on Python objects: sum, count, avg, etc.
   - Pass datasets to Pytorch and Tensorflow, or export them back into storage.


Quick Start
-----------

.. code:: console

   $ pip install datachain


Selecting files using JSON metadata
======================================

A storage consists of images of cats and dogs (`dog.1048.jpg`, `cat.1009.jpg`),
annotated with ground truth and model inferences in the 'json-pairs' format,
where each image has a matching JSON file like `cat.1009.json`:

.. code:: json

    {
        "class": "cat", "id": "1009", "num_annotators": 8,
        "inference": {"class": "dog", "confidence": 0.68}
    }

Example of downloading only "high-confidence cat" inferred images using JSON metadata:


.. code:: py

    from datachain import Column, DataChain

    meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta")
    images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")

    images_id = images.map(id=lambda file: file.path.split('.')[-2])
    annotated = images_id.merge(meta, on="id", right_on="meta.id")

    likely_cats = annotated.filter((Column("meta.inference.confidence") > 0.93) \
                                   & (Column("meta.inference.class_") == "cat"))
    likely_cats.export_files("high-confidence-cats/", signal="file")


Data curation with a local AI model
===================================
Batch inference with a simple sentiment model using the `transformers` library:

.. code:: shell

    pip install transformers

The code below downloads files from the cloud, and applies a user-defined function
to each one of them. All files with a positive sentiment
detected are then copied to the local directory.

.. code:: py

    from transformers import pipeline
    from datachain import DataChain, Column

    classifier = pipeline("sentiment-analysis", device="cpu",
                    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

    def is_positive_dialogue_ending(file) -> bool:
        dialogue_ending = file.read()[-512:]
        return classifier(dialogue_ending)[0]["label"] == "POSITIVE"

    chain = (
       DataChain.from_storage("gs://datachain-demo/chatbot-KiT/",
                              object_name="file", type="text")
       .settings(parallel=8, cache=True)
       .map(is_positive=is_positive_dialogue_ending)
       .save("file_response")
    )

    positive_chain = chain.filter(Column("is_positive") == True)
    positive_chain.export_files("./output")

    print(f"{positive_chain.count()} files were exported")



13 files were exported

.. code:: shell

    $ ls output/datachain-demo/chatbot-KiT/
    15.txt 20.txt 24.txt 27.txt 28.txt 29.txt 33.txt 37.txt 38.txt 43.txt ...
    $ ls output/datachain-demo/chatbot-KiT/ | wc -l
    13


LLM judging chatbots
=============================

LLMs can work as universal classifiers. In the example below,
we employ a free API from Mistral to judge the `publicly available`_ chatbot dialogs. Please get a free
Mistral API key at https://console.mistral.ai


.. code:: shell

    $ pip install mistralai (Requires version >=1.0.0)
    $ export MISTRAL_API_KEY=_your_key_

DataChain can parallelize API calls; the free Mistral tier supports up to 4 requests at the same time.

.. code:: py

    from mistralai import Mistral
    from datachain import File, DataChain, Column

    PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."

    def eval_dialogue(file: File) -> bool:
         client = Mistral()
         response = client.chat.complete(
             model="open-mixtral-8x22b",
             messages=[{"role": "system", "content": PROMPT},
                       {"role": "user", "content": file.read()}])
         result = response.choices[0].message.content
         return result.lower().startswith("success")

    chain = (
       DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
       .settings(parallel=4, cache=True)
       .map(is_success=eval_dialogue)
       .save("mistral_files")
    )

    successful_chain = chain.filter(Column("is_success") == True)
    successful_chain.export_files("./output_mistral")

    print(f"{successful_chain.count()} files were exported")


With the instruction above, the Mistral model considers 31/50 files to hold the successful dialogues:

.. code:: shell

    $ ls output_mistral/datachain-demo/chatbot-KiT/
    1.txt  15.txt 18.txt 2.txt  22.txt 25.txt 28.txt 33.txt 37.txt 4.txt  41.txt ...
    $ ls output_mistral/datachain-demo/chatbot-KiT/ | wc -l
    31



Serializing Python-objects
==========================

LLM responses may contain valuable information for analytics – such as the number of tokens used, or the
model performance parameters.

Instead of extracting this information from the Mistral response data structure (class
`ChatCompletionResponse`), DataChain can serialize the entire LLM response to the internal DB:


.. code:: py

    from mistralai import Mistral
    from mistralai.models import ChatCompletionResponse
    from datachain import File, DataChain, Column

    PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."

    def eval_dialog(file: File) -> ChatCompletionResponse:
         client = MistralClient()
         return client.chat(
             model="open-mixtral-8x22b",
             messages=[{"role": "system", "content": PROMPT},
                       {"role": "user", "content": file.read()}])

    chain = (
       DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
       .settings(parallel=4, cache=True)
       .map(response=eval_dialog)
       .map(status=lambda response: response.choices[0].message.content.lower()[:7])
       .save("response")
    )

    chain.select("file.name", "status", "response.usage").show(5)

    success_rate = chain.filter(Column("status") == "success").count() / chain.count()
    print(f"{100*success_rate:.1f}% dialogs were successful")

Output:

.. code:: shell

         file   status      response     response          response
         name                  usage        usage             usage
                       prompt_tokens total_tokens completion_tokens
    0   1.txt  success           547          548                 1
    1  10.txt  failure          3576         3578                 2
    2  11.txt  failure           626          628                 2
    3  12.txt  failure          1144         1182                38
    4  13.txt  success          1100         1101                 1

    [Limited by 5 rows]
    64.0% dialogs were successful


Iterating over Python data structures
=============================================

In the previous examples, datasets were saved in the embedded database
(`SQLite`_ in folder `.datachain` of the working directory).
These datasets were automatically versioned, and can be accessed using
`DataChain.from_dataset("dataset_name")`.

Here is how to retrieve a saved dataset and iterate over the objects:

.. code:: py

    chain = DataChain.from_dataset("response")

    # Iterating one-by-one: support out-of-memory workflow
    for file, response in chain.limit(5).collect("file", "response"):
        # verify the collected Python objects
        assert isinstance(response, ChatCompletionResponse)

        status = response.choices[0].message.content[:7]
        tokens = response.usage.total_tokens
        print(f"{file.get_uri()}: {status}, file size: {file.size}, tokens: {tokens}")

Output:

.. code:: shell

    gs://datachain-demo/chatbot-KiT/1.txt: Success, file size: 1776, tokens: 548
    gs://datachain-demo/chatbot-KiT/10.txt: Failure, file size: 11576, tokens: 3578
    gs://datachain-demo/chatbot-KiT/11.txt: Failure, file size: 2045, tokens: 628
    gs://datachain-demo/chatbot-KiT/12.txt: Failure, file size: 3833, tokens: 1207
    gs://datachain-demo/chatbot-KiT/13.txt: Success, file size: 3657, tokens: 1101


Vectorized analytics over Python objects
========================================

Some operations can run inside the DB without deserialization.
For instance, let's calculate the total cost of using the LLM APIs, assuming the Mixtral call costs $2 per 1M input tokens and $6 per 1M output tokens:

.. code:: py

    chain = DataChain.from_dataset("mistral_dataset")

    cost = chain.sum("response.usage.prompt_tokens")*0.000002 \
               + chain.sum("response.usage.completion_tokens")*0.000006
    print(f"Spent ${cost:.2f} on {chain.count()} calls")

Output:

.. code:: shell

    Spent $0.08 on 50 calls


PyTorch data loader
===================

Chain results can be exported or passed directly to PyTorch dataloader.
For example, if we are interested in passing image and a label based on file
name suffix, the following code will do it:

.. code:: py

    from torch.utils.data import DataLoader
    from transformers import CLIPProcessor

    from datachain import C, DataChain

    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    chain = (
        DataChain.from_storage("gs://datachain-demo/dogs-and-cats/", type="image")
        .map(label=lambda name: name.split(".")[0], params=["file.name"])
        .select("file", "label").to_pytorch(
            transform=processor.image_processor,
            tokenizer=processor.tokenizer,
        )
    )
    loader = DataLoader(chain, batch_size=1)


DataChain Studio Platform
-------------------------

`DataChain Studio`_ is a proprietary solution for teams that offers:

- **Centralized dataset registry** to manage data, code and dependency
  dependencies in one place.
- **Data Lineage** for data sources as well as direvative dataset.
- **UI for Multimodal Data** like images, videos, and PDFs.
- **Scalable Compute** to handle large datasets (100M+ files) and in-house
  AI model inference.
- **Access control** including SSO and team based collaboration.

Tutorials
---------

* `Getting Started`_
* `Multimodal <https://github.com/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb>`_ (try in `Colab <https://colab.research.google.com/github/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb>`__)
* `LLM evaluations <https://github.com/iterative/datachain-examples/blob/main/llm/llm_chatbot_evaluation.ipynb>`_ (try in `Colab <https://colab.research.google.com/github/iterative/datachain-examples/blob/main/llm/llm_chatbot_evaluation.ipynb>`__)
* `Reading JSON metadata <https://github.com/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb>`_ (try in `Colab <https://colab.research.google.com/github/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb>`__)


Contributions
-------------

Contributions are very welcome.
To learn more, see the `Contributor Guide`_.


Community and Support
---------------------

* `Docs <https://datachain.dvc.ai/>`_
* `File an issue`_ if you encounter any problems
* `Discord Chat <https://dvc.org/chat>`_
* `Email <mailto:support@dvc.org>`_
* `Twitter <https://twitter.com/DVCorg>`_


.. _PyPI: https://pypi.org/
.. _file an issue: https://github.com/iterative/datachain/issues
.. github-only
.. _Contributor Guide: CONTRIBUTING.rst
.. _Pydantic: https://github.com/pydantic/pydantic
.. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
.. _SQLite: https://www.sqlite.org/
.. _Getting Started: https://docs.datachain.ai/
.. _DataChain Studio: https://studio.datachain.ai/
