Metadata-Version: 2.1
Name: pyDVL
Version: 0.9.0
Summary: The Python Data Valuation Library
Author: appliedAI Institute gGmbH
Project-URL: Source, https://github.com/aai-institute/pydvl
Project-URL: Documentation, https://pydvl.org
Project-URL: TransferLab, https://transferlab.ai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Typing :: Typed
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: COPYING.LESSER
Requires-Dist: pyDeprecate>=0.3.2
Requires-Dist: numpy>=1.20
Requires-Dist: pandas>=1.3
Requires-Dist: scikit-learn
Requires-Dist: scipy>=1.7.0
Requires-Dist: cvxpy>=1.3.0
Requires-Dist: joblib>=1.3.0
Requires-Dist: cloudpickle
Requires-Dist: tqdm
Requires-Dist: matplotlib
Provides-Extra: cupy
Requires-Dist: cupy-cuda11x>=12.1.0; extra == "cupy"
Provides-Extra: memcached
Requires-Dist: pymemcache; extra == "memcached"
Provides-Extra: influence
Requires-Dist: torch>=2.0.0; extra == "influence"
Requires-Dist: dask>=2023.5.0; extra == "influence"
Requires-Dist: distributed>=2023.5.0; extra == "influence"
Requires-Dist: zarr>=2.16.1; extra == "influence"
Provides-Extra: ray
Requires-Dist: ray>=0.8; extra == "ray"

<p align="center" style="text-align:center;">
    <img alt="pyDVL Logo" src="https://raw.githubusercontent.com/aai-institute/pyDVL/develop/logo.svg" width="200"/>
</p>

<p align="center" style="text-align:center;">
    A library for data valuation.
</p>

<p align="center" style="text-align:center;">
    <a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/v/pydvl.svg" alt="PyPI"></a>
    <a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/pyversions/pydvl.svg" alt="Version"></a>
    <a href="https://pydvl.org"><img src="https://img.shields.io/badge/docs-All%20versions-009485" alt="documentation"></a>
    <a href="https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE"><img alt="License" src="https://img.shields.io/pypi/l/pydvl"></a>
    <a href="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml"><img src="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg" alt="Build status" ></a>
    <a href="https://codecov.io/gh/aai-institute/pyDVL"><img src="https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV"/></a>
    <a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
</p>

**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.

Refer to the [Methods](https://pydvl.org/devel/getting-started/methods/)
page of our documentation for a list of all implemented methods. 

**Data Valuation** for machine learning is the task of assigning a scalar
to each element of a training set which reflects its contribution to the final
performance or outcome of some model trained on it. Some concepts of
value depend on a specific model of interest, while others are model-agnostic.
pyDVL focuses on model-dependent methods.

<div align="center" style="text-align:center;">
    <img
        width="70%"
        align="center"
        style="display: block; margin-left: auto; margin-right: auto;"
        src="https://pydvl.org/devel/value/img/mclc-best-removal-10k-natural.svg"
        alt="best sample removal"
    />
    <p align="center" style="text-align:center;">
        Comparison of different data valuation methods
        on best sample removal.
    </p>
</div>

The **Influence Function** is an infinitesimal measure of the effect that single
training points have over the parameters of a model, or any function thereof.
In particular, in machine learning they are also used to compute the effect
of training samples over individual test points.

<div align="center" style="text-align:center;">
    <img
        width="70%"
        align="center"
        style="display: block; margin-left: auto; margin-right: auto;"
        src="https://pydvl.org/devel/examples/img/influence_functions_example.png"
        alt="best sample removal"
    />
    <p align="center" style="text-align:center;">
        Influences of input points with corrupted data.
        Highlighted points have flipped labels.
    </p>
</div>

# Installation

To install the latest release use:

```shell
$ pip install pyDVL
```

You can also install the latest development version from
[TestPyPI](https://test.pypi.org/project/pyDVL/):

```shell
pip install pyDVL --index-url https://test.pypi.org/simple/
```

pyDVL has also extra dependencies for certain functionalities, 
e.g. for using influence functions run
```shell
$ pip install pyDVL[influence]
```

For more instructions and information refer to [Installing pyDVL
](https://pydvl.org/stable/getting-started/#installation) in the
documentation.

# Usage

In the following subsections, we will showcase the usage of pyDVL
for Data Valuation and Influence Functions using simple examples.

For more instructions and information refer to [Getting
Started](https://pydvl.org/stable/getting-started/first-steps/) in
the documentation.
We provide several examples for data valuation
(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
and for influence functions
(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
with details on the algorithms and their applications.

## Influence Functions

For influence computation, follow these steps:

1. Import the necessary packages (The exact packages depend on your specific use case).

   ```python
   import torch
   from torch import nn
   from torch.utils.data import DataLoader, TensorDataset
   
   from pydvl.influence.torch import DirectInfluence
   from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
   from pydvl.influence import SequentialInfluenceCalculator
   ```

2. Create PyTorch data loaders for your train and test splits.

   ```python
   input_dim = (5, 5, 5)
   output_dim = 3
   train_x = torch.rand((10, *input_dim))
   train_y = torch.rand((10, output_dim))
   test_x = torch.rand((5, *input_dim))
   test_y = torch.rand((5, output_dim))

   train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
   test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
   ```

3. Instantiate your neural network model.

   ```python
   nn_architecture = nn.Sequential(
     nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
     nn.Flatten(),
     nn.Linear(27, 3),
   )
   ```

4. Define your loss:

   ```python
   loss = nn.MSELoss()
   ```

5. Instantiate an `InfluenceFunctionModel` and fit it to the training data

   ```python
   infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
   infl_model = infl_model.fit(train_data_loader)
   ```

6. For small input data call influence method on the fitted instance. 

   ```python
   influences = infl_model.influences(test_x, test_y, train_x, train_y)
   ```
   The result is a tensor of shape `(training samples x test samples)`
   that contains at index `(i, j`) the influence of training sample `i` on
   test sample `j`.

7. For larger data, wrap the model into a
   calculator and call methods on the calculator.
   ```python
   infl_calc = SequentialInfluenceCalculator(infl_model)
   
    # Lazy object providing arrays batch-wise in a sequential manner
   lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)

   # Trigger computation and pull results to memory
   influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())

   # Trigger computation and write results batch-wise to disk
   lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
   ```
   

   The higher the absolute value of the influence of a training sample
   on a test sample, the more influential it is for the chosen test sample, model
   and data loaders. The sign of the influence determines whether it is 
   useful (positive) or harmful (negative).

> **Note** pyDVL currently only support PyTorch for Influence Functions. 
> We are planning to add support for Jax and perhaps TensorFlow or even Keras.

## Data Valuation

The steps required to compute data values for your samples are:

1. Import the necessary packages (The exact packages depend on your specific use case).

   ```python
   import matplotlib.pyplot as plt
   from sklearn.datasets import load_breast_cancer
   from sklearn.linear_model import LogisticRegression
   from pydvl.utils import Dataset, Scorer, Utility
   from pydvl.value import (
      compute_shapley_values,
      ShapleyMode,
      MaxUpdates,
   )
   ```
 
2. Create a `Dataset` object with your train and test splits.

   ```python
   data = Dataset.from_sklearn(
       load_breast_cancer(),
       train_size=10,
       stratify_by_target=True,
       random_state=16,
   )
   ```

3. Create an instance of a `SupervisedModel` (basically any sklearn compatible
   predictor).

   ```python
   model = LogisticRegression()
   ```  

4. Create a `Utility` object to wrap the Dataset, the model and a scoring
   function.

   ```python
   u = Utility(
      model,
      data,
      Scorer("accuracy", default=0.0)
   )
   ```

5. Use one of the methods defined in the library to compute the values.
   In our example, we will use *Permutation Montecarlo Shapley*,
   an approximate method for computing Data Shapley values.

   ```python
   values = compute_shapley_values(
      u,
      mode=ShapleyMode.PermutationMontecarlo,
      done=MaxUpdates(100),
      seed=16,  
      progress=True
   )
   ```
   The result is a variable of type `ValuationResult` that contains
   the indices and their values as well as other attributes.

   The higher the value for an index, the more important it is for the chosen
   model, dataset and scorer.

6. (Optional) Convert the valuation result to a dataframe and analyze and visualize the values.

   ```python
   df = values.to_dataframe(column="data_value")
   ```

# Contributing

Please open new issues for bugs, feature requests and extensions. You can read
about the structure of the project, the toolchain and workflow in the [guide for
contributions](CONTRIBUTING.md).

# License

pyDVL is distributed under
[LGPL-3.0](https://www.gnu.org/licenses/lgpl-3.0.html). A complete version can
be found in two files: [here](LICENSE) and [here](COPYING.LESSER).

All contributions will be distributed under this license.
