Metadata-Version: 2.1
Name: dolma
Version: 0.6.0
Classifier: Development Status :: 3 - Alpha
Classifier: Typing :: Typed
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Dist: requests
Requires-Dist: tqdm
Requires-Dist: boto3
Requires-Dist: cached-path ==1.3.4
Requires-Dist: msgspec >=0.14.2
Requires-Dist: presidio_analyzer ==2.2.32
Requires-Dist: pycld2 ==0.41
Requires-Dist: fasttext >=0.9.2
Requires-Dist: tokenizers >=0.13.3, <1.0.0
Requires-Dist: omegaconf >=2.3.0
Requires-Dist: anyascii >=0.3.2
Requires-Dist: uniseg
Requires-Dist: pyyaml
Requires-Dist: blingfire ==0.1.8
Requires-Dist: detect-secrets ==1.4.0
Requires-Dist: rich >=10.12.0
Requires-Dist: smart-open >=6.3.0
Requires-Dist: nltk ==3.8.1
Requires-Dist: fsspec >=2021.10.0
Requires-Dist: s3fs >=2021.10.0
Requires-Dist: black >=22.6.0 ; extra == 'dev'
Requires-Dist: isort >=5.10.1 ; extra == 'dev'
Requires-Dist: mypy >=0.971 ; extra == 'dev'
Requires-Dist: pytest >=5.2 ; extra == 'dev'
Requires-Dist: ipython >=8.4.0 ; extra == 'dev'
Requires-Dist: autopep8 >=1.7.0 ; extra == 'dev'
Requires-Dist: flake8 >=5.0 ; extra == 'dev'
Requires-Dist: ipdb >=0.13.0 ; extra == 'dev'
Requires-Dist: flake8-pyi >=22.8.1 ; extra == 'dev'
Requires-Dist: Flake8-pyproject >=1.1.0 ; extra == 'dev'
Requires-Dist: awscli >=1.16.0 ; extra == 'dev'
Provides-Extra: dev
License-File: LICENSE
Summary: Data filters
Author-email: Allen Institute for Artificial Intelligence <contact@allenai.org>, Luca Soldaini <luca@soldaini.net>, Kyle Lo <kylel@allenai.org>, Rodney Kinney <rodneyk@allenai.org>, Aakanksha Naik <aakankshan@allenai.org>, Abhilasha Ravichander <abhilashar@allenai.org>, Akshita Bhagia <akshitab@allenai.org>, Dirk Groeneveld <dirkg@allenai.org>, Dustin Schwenk <dustins@allenai.org>, Ian Magnusson <ianm@allenai.org>, Khyathi Chandu <khyathic@allenai.org>
Maintainer-email: Allen Institute for Artificial Intelligence <contact@allenai.org>
License: Apache-2.0
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/allenai/dolma

# dolma

*Data to feed OLMo's Appetite*


<img alt="DOLMa logo. It's a watercolor of grape leaves with the word DOLMa in the top left." src="res/logo.png" width="256"></img>

Data and tools for generating and inspecting OLMo pre-training data.


## Setup

Install Rust
```
curl https://sh.rustup.rs -sSf | sh
```

Install [CMake](https://cmake.org/install/)

  * On **Mac OSX** with `brew install cmake`
  * On **Linux** with `apt-get install cmake`


Install [OpenSSL](https://www.openssl.org/)

  * On **Mac OSX** with `brew install openssl re2`
  * On **Linux** with `apt-get install openssl`

Install [Protobuf]()

  * On **Mac OSX** with `brew install protobuf`
  * On **Linux** with `apt-get install protobuf-compiler`

Setting up Python
```
conda create -n dolma python=3.10
```


Install [Maturin](https://www.maturin.rs/)

```
pip install maturin
maturin develop
```


Installing this repository
```
cd dolma
pip install -e .
```


## Citation

If you use this repository, please cite it as:

```bibtex
@software{dolma,
    author = {{Soldaini, Luca and Lo, Kyle and Kinney, Rodney and Naik, Aakanksha and Ravichander, Abhilasha and Bhagia, Akshita and Groeneveld, Dirk and Schwenk, Dustin and Magnusson, Ian and Chandu, Khyathi}},
    license = {{Apache-2.0}},
    title = {{DOLMa}},
    url = {https://github.com/allenai/dolma}
}
```

