Metadata-Version: 2.1
Name: tatter
Version: 1.0.0
Summary: Two-Sample Hypothesis Test. A hypothesis testing tool for multi-dimensional data.
Home-page: https://github.com/afarahi/tatter
Author: Arya Farahi
Author-email: arya.farahi@austin.utexas.edu
License: MIT
Keywords: MMD,Two-sample test,K-S,K-L
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Astronomy
Classifier: Topic :: Scientific/Engineering :: Physics
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: matplotlib
Requires-Dist: sklearn
Requires-Dist: joblib
Requires-Dist: tqdm
Requires-Dist: pathlib

![GitHub](https://img.shields.io/github/license/afarahi/tatter)
![PyPI](https://img.shields.io/pypi/v/tatter)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/tatter)
<a href="http://ascl.net/2006.007"><img src="https://img.shields.io/badge/ascl-2006.007-blue.svg?colorB=262255" alt="ascl:2006.007" /></a>

<p align="center">
  <img src="logo.png" width="300" title="logo">
</p>

# Introduction

TATTER (Two-sAmple TesT EstimatoR) is a tool to perform two-sample hypothesis test.
 The two-sample hypothesis test is concerned with whether distributions
 p(x) and q(x) are different on the basis of finite samples drawn from each
 of them. This ubiquitous problem appears in a legion of applications,
 ranging from data mining to data analysis and inference.
 This implementation can perform the Kolmogorov-Smirnov test
 (for one-dimensional data only), Kullback-Leibler divergence,
 and Maximum Mean Discrepancy (MMD) test. The module perform a bootstrap
 algorithm to estimate the null distribution, and compute p-value.

## Dependencies

`numpy`, `matplotlib`, `sklearn`, `joblib`, `tqdm`, `pathlib`

## Cautions

- The employed implementation of the Kullback-Leibler divergence is slow
 and generating a few thousands of bootstrap realizations when the
 sample size is large (n, m >1000) is not practical.

- The provided tests reproduce Figures X, X, and X in the paper. Running
all of these tests takes ~30 minutes. If your are impatient to reproduce
one of the figures try `mnist_digits_distance.py` first.

## References

[1]. A. Farahi, Y. Chen "[TATTER: A hypothesis testing tool for multi-dimensional data](sciencedirect.com/science/article/abs/pii/S2213133720300998)." Astronomy and Computing, Volume 34, January (2021).

[2]. A. Gretton, B. M. Karsten, R. J. Malte, B. Schölkopf, and A. Smola,
 "[A kernel two-sample test](http://www.jmlr.org/papers/v13/gretton12a.html)."
  Journal of Machine Learning Research 13, no. Mar (2012): 723-773.

[3]. Q. Wang, S. R. Kulkarni, and S. Verdú,
"[Divergence estimation for multidimensional densities via k-nearest-neighbor distances](https://ieeexplore.ieee.org/abstract/document/4839047)."
 IEEE Transactions on Information Theory 55, no. 5 (2009): 2392-2405.

[4]. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling,
 "Numerical recipes." (1989).



## Quickstart

To start using TATTER, simply use `from tatter import two_sample_test` to
access the primary function. The exact requirements for the inputs are
listed in the docstring of the two_sample_test() function further below.
An example for using TATTER looks like this:

      from tatter import two_sample_test

      test_value, test_null, p_value =
               two_sample_test(X, Y,
                               model='MMD',
                               iterations=1000,
                               kernel_function='rbf',
                               gamma=gamma,
                               n_jobs=4,
                               verbose=True,
                               random_state=0)



