Metadata-Version: 2.1
Name: pyDVL
Version: 0.8.1
Summary: The Python Data Valuation Library
Home-page: UNKNOWN
Author: appliedAI Institute gGmbH
License: UNKNOWN
Project-URL: Source, https://github.com/aai-institute/pydvl
Project-URL: Documentation, https://aai-institute.github.io/pyDVL
Project-URL: TransferLab, https://transferlab.appliedai.de
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Typing :: Typed
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)
Description-Content-Type: text/markdown
Requires-Dist: pyDeprecate >=0.3.2
Requires-Dist: numpy >=1.20
Requires-Dist: pandas >=1.3
Requires-Dist: scikit-learn
Requires-Dist: scipy >=1.7.0
Requires-Dist: cvxpy >=1.3.0
Requires-Dist: joblib
Requires-Dist: cloudpickle
Requires-Dist: tqdm
Requires-Dist: matplotlib
Provides-Extra: cupy
Requires-Dist: cupy-cuda11x >=12.1.0 ; extra == 'cupy'
Provides-Extra: influence
Requires-Dist: torch >=2.0.0 ; extra == 'influence'
Requires-Dist: dask >=2023.5.0 ; extra == 'influence'
Requires-Dist: distributed >=2023.5.0 ; extra == 'influence'
Requires-Dist: zarr >=2.16.1 ; extra == 'influence'
Provides-Extra: memcached
Requires-Dist: pymemcache ; extra == 'memcached'
Provides-Extra: ray
Requires-Dist: ray >=0.8 ; extra == 'ray'

<p align="center" style="text-align:center;">
    <img alt="pyDVL Logo" src="https://raw.githubusercontent.com/aai-institute/pyDVL/develop/logo.svg" width="200"/>
</p>

<p align="center" style="text-align:center;">
    A library for data valuation.
</p>

<p align="center" style="text-align:center;">
    <a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/v/pydvl.svg" alt="PyPI"></a>
    <a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/pyversions/pydvl.svg" alt="Version"></a>
    <a href="https://pydvl.org"><img src="https://img.shields.io/badge/docs-All%20versions-009485" alt="documentation"></a>
    <a href="https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE"><img alt="License" src="https://img.shields.io/pypi/l/pydvl"></a>
    <a href="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml"><img src="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg" alt="Build status" ></a>
    <a href="https://codecov.io/gh/aai-institute/pyDVL"><img src="https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV"/></a>
    <a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
</p>

**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.

**Data Valuation** for machine learning is the task of assigning a scalar
to each element of a training set which reflects its contribution to the final
performance or outcome of some model trained on it. Some concepts of
value depend on a specific model of interest, while others are model-agnostic.
pyDVL focuses on model-dependent methods.

<div align="center" style="text-align:center;">
    <img
        width="70%"
        align="center"
        style="display: block; margin-left: auto; margin-right: auto;"
        src="docs/value/img/mclc-best-removal-10k-natural.svg"
        alt="best sample removal"
    />
    <p align="center" style="text-align:center;">
        Comparison of different data valuation methods
        on best sample removal.
    </p>
</div>

The **Influence Function** is an infinitesimal measure of the effect that single
training points have over the parameters of a model, or any function thereof.
In particular, in machine learning they are also used to compute the effect
of training samples over individual test points.

<div align="center" style="text-align:center;">
    <img
        width="70%"
        align="center"
        style="display: block; margin-left: auto; margin-right: auto;"
        src="docs/assets/influence_functions_example.png"
        alt="best sample removal"
    />
    <p align="center" style="text-align:center;">
        Influences of input points with corrupted data.
        Highlighted points have flipped labels.
    </p>
</div>

# Installation

To install the latest release use:

```shell
$ pip install pyDVL
```

You can also install the latest development version from
[TestPyPI](https://test.pypi.org/project/pyDVL/):

```shell
pip install pyDVL --index-url https://test.pypi.org/simple/
```

pyDVL has also extra dependencies for certain functionalities (e.g. influence functions).

For more instructions and information refer to [Installing pyDVL
](https://pydvl.org/stable/getting-started/installation/) in the
documentation.

# Usage

In the following subsections, we will showcase the usage of pyDVL
for Data Valuation and Influence Functions using simple examples.

For more instructions and information refer to [Getting
Started](https://pydvl.org/stable/getting-started/first-steps/) in
the documentation.
We provide several examples for data valuation
(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
and for influence functions
(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
with details on the algorithms and their applications.

## Influence Functions

For influence computation, follow these steps:

1. Import the necessary packages (The exact packages depend on your specific use case).

   ```python
   import torch
   from torch import nn
   from torch.utils.data import DataLoader, TensorDataset

   from pydvl.influence.torch import DirectInfluence
   from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
   from pydvl.influence import SequentialInfluenceCalculator
   ```

2. Create PyTorch data loaders for your train and test splits.

   ```python
   input_dim = (5, 5, 5)
   output_dim = 3
   train_x = torch.rand((10, *input_dim))
   train_y = torch.rand((10, output_dim))
   test_x = torch.rand((5, *input_dim))
   test_y = torch.rand((5, output_dim))

   train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
   test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
   ```

3. Instantiate your neural network model.

   ```python
   nn_architecture = nn.Sequential(
     nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
     nn.Flatten(),
     nn.Linear(27, 3),
   )
   ```

4. Define your loss:

   ```python
   loss = nn.MSELoss()
   ```

5. Instantiate an `InfluenceFunctionModel` and fit it to the training data

   ```python
   infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
   infl_model = infl_model.fit(train_data_loader)
   ```

6. For small input data call influence method on the fitted instance. 

   ```python
   influences = infl_model.influences(test_x, test_y, train_x, train_y)
   ```
   The result is a tensor of shape `(training samples x test samples)`
   that contains at index `(i, j`) the influence of training sample `i` on
   test sample `j`.

7. For larger data, wrap the model into a
   calculator and call methods on the calculator.
   ```python
   infl_calc = SequentialInfluenceCalculator(infl_model)

    # Lazy object providing arrays batch-wise in a sequential manner
   lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)

   # Trigger computation and pull results to memory
   influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())

   # Trigger computation and write results batch-wise to disk
   lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
   ```


   The higher the absolute value of the influence of a training sample
   on a test sample, the more influential it is for the chosen test sample, model
   and data loaders. The sign of the influence determines whether it is 
   useful (positive) or harmful (negative).

> **Note** pyDVL currently only support PyTorch for Influence Functions. 
> We are planning to add support for Jax and perhaps TensorFlow or even Keras.

## Data Valuation

The steps required to compute data values for your samples are:

1. Import the necessary packages (The exact packages depend on your specific use case).

   ```python
   import matplotlib.pyplot as plt
   from sklearn.datasets import load_breast_cancer
   from sklearn.linear_model import LogisticRegression
   from pydvl.utils import Dataset, Scorer, Utility
   from pydvl.value import (
      compute_shapley_values,
      ShapleyMode,
      MaxUpdates,
   )
   ```

2. Create a `Dataset` object with your train and test splits.

   ```python
   data = Dataset.from_sklearn(
       load_breast_cancer(),
       train_size=10,
       stratify_by_target=True,
       random_state=16,
   )
   ```

3. Create an instance of a `SupervisedModel` (basically any sklearn compatible
   predictor).

   ```python
   model = LogisticRegression()
   ```  

4. Create a `Utility` object to wrap the Dataset, the model and a scoring
   function.

   ```python
   u = Utility(
      model,
      data,
      Scorer("accuracy", default=0.0)
   )
   ```

5. Use one of the methods defined in the library to compute the values.
   In our example, we will use *Permutation Montecarlo Shapley*,
   an approximate method for computing Data Shapley values.

   ```python
   values = compute_shapley_values(
      u,
      mode=ShapleyMode.PermutationMontecarlo,
      done=MaxUpdates(100),
      seed=16,  
      progress=True
   )
   ```
   The result is a variable of type `ValuationResult` that contains
   the indices and their values as well as other attributes.

   The higher the value for an index, the more important it is for the chosen
   model, dataset and scorer.

6. (Optional) Convert the valuation result to a dataframe and analyze and visualize the values.

   ```python
   df = values.to_dataframe(column="data_value")
   ```

# Contributing

Please open new issues for bugs, feature requests and extensions. You can read
about the structure of the project, the toolchain and workflow in the [guide for
contributions](CONTRIBUTING.md).

# Papers

We currently implement the following papers:

## Data Valuation

- Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the
  Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004).
  Computers & Operations Research, Selected papers presented at the Tenth
  International Symposium on Locational Decisions (ISOLDE X), 36, no. 5 (May 1,
  2009): 1726–30.
- Ghorbani, Amirata, and James Zou. [Data Shapley: Equitable Valuation of Data
  for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html). In
  International Conference on Machine Learning, 2242–51. PMLR, 2019.
- Wang, Tianhao, Yu Yang, and Ruoxi Jia. 
  [Improving Cooperative Game Theory-Based Data Valuation via Data Utility
  Learning](https://doi.org/10.48550/arXiv.2107.06336). arXiv, 2022.
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo
  Li, Ce Zhang, Costas Spanos, and Dawn Song. [Efficient Task-Specific Data
  Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637).
  Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
- Okhrati, Ramin, and Aldo Lipani. [A Multilinear Sampling Algorithm to Estimate
  Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511). In 25th
  International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE,
  2021.
- Yan, T., and Procaccia, A. D. [If You Like Shapley Then You’ll Love the
  Core](https://ojs.aaai.org/index.php/AAAI/article/view/16721). Proceedings of
  the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve
  Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. [Towards Efficient
  Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
  In 22nd International Conference on Artificial Intelligence and Statistics,
  1167–76. PMLR, 2019.
- Wang, Jiachen T., and Ruoxi Jia. [Data Banzhaf: A Robust Data Valuation
  Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
  arXiv, October 22, 2022.
- Kwon, Yongchan, and James Zou. [Beta Shapley: A Unified and Noise-Reduced Data
  Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
  In Proceedings of the 25th International Conference on Artificial Intelligence
  and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.
- Kwon, Yongchan, and James Zou. [Data-OOB: Out-of-Bag Estimate as a Simple and
  Efficient Data Value](https://proceedings.mlr.press/v202/kwon23e.html). In
  Proceedings of the 40th International Conference on Machine Learning, 18135–52.
  PMLR, 2023.
- Schoch, Stephanie, Haifeng Xu, and Yangfeng Ji. [CS-Shapley: Class-Wise
  Shapley Values for Data Valuation in
  Classification](https://openreview.net/forum?id=KTOcrOR5mQ9). In Proc. of the
  Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS).
  New Orleans, Louisiana, USA, 2022.

## Influence Functions

- Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via
  Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In
  Proceedings of the 34th International Conference on Machine Learning,
  70:1885–94. Sydney, Australia: PMLR, 2017.
- Naman Agarwal, Brian Bullins, and Elad Hazan, [Second-Order Stochastic Optimization
  for Machine Learning in Linear Time](https://www.jmlr.org/papers/v18/16-491.html),
  Journal of Machine Learning Research 18 (2017): 1-40.
- Schioppa, Andrea, Polina Zablotskaia, David Vilar, and Artem Sokolov. 
  [Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052). 
  In Proceedings of the AAAI-22. arXiv, 2021.
- James Martens, Roger Grosse, [Optimizing Neural Networks with Kronecker-factored Approximate Curvature](https://arxiv.org/abs/1503.05671), International Conference on Machine Learning (ICML), 2015.
- George, Thomas, César Laurent, Xavier Bouthillier, Nicolas Ballas, Pascal Vincent, [Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis](https://arxiv.org/abs/1806.03884), Advances in Neural Information Processing Systems 31,2018.

# License

pyDVL is distributed under
[LGPL-3.0](https://www.gnu.org/licenses/lgpl-3.0.html). A complete version can
be found in two files: [here](LICENSE) and [here](COPYING.LESSER).

All contributions will be distributed under this license.


