Metadata-Version: 2.4
Name: capricho
Version: 1.0.0
Summary: Transparent, flexible and reproducible ChEMBL data curation, aggregation and analysis.
Author-email: David Araripe <david@araripe.nl>
Maintainer-email: David Araripe <david@araripe.nl>
License: MIT License
        
        Copyright (c) 2023 David Araripe
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: homepage, https://github.com/David-Araripe/Capricho
Project-URL: repository, https://github.com/David-Araripe/Capricho
Project-URL: documentation, https://capricho.readthedocs.io/en/latest/
Keywords: ChEMBL,drug discovery,QSAR,cheminformatics,activity curation
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Chemistry
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: chem-filters>=1.0.0
Requires-Dist: chembl_webresource_client
Requires-Dist: chembl_downloader
Requires-Dist: tabulate
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: loguru
Requires-Dist: scipy
Requires-Dist: tqdm
Requires-Dist: joblib
Requires-Dist: rdkit>=2022.09.1
Requires-Dist: pyarrow
Requires-Dist: typer
Requires-Dist: job-tqdflex
Provides-Extra: dev
Requires-Dist: ruff; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0; extra == "docs"
Requires-Dist: furo; extra == "docs"
Requires-Dist: myst-parser>=3.0; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints; extra == "docs"
Dynamic: license-file

<div align="center">
  <img src="logo.svg" alt="" width=240>
  <p><strong>The ChEMBL data curator that flags issues instead of silently dropping them.</strong></p>

[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Code style: black](https://img.shields.io/badge/code%20style-black-black?style=flat-square)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat-square&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![License: MIT](https://img.shields.io/badge/License-MIT-purple?style=flat-square)](https://opensource.org/licenses/MIT)
[![GH Actions](https://github.com/David-Araripe/Capricho/actions/workflows/ci.yml/badge.svg?event=push)](https://github.com/David-Araripe/Capricho/actions)

</div>

> Inspired in the Portuguese word "*capricho*" [🔊](https://ipa-reader.com/?text=ka%CB%88p%C9%BEi.%CA%83u&voice=Ricardo). Doing someting *with capricho* means doing it *meticulously*, *with care* and *attention to detail*.

CAPRICHO (**C**hEMBL **A**ggregation **P**ackage with **R**obust **I**nspection and **C**uration **H**andling **O**ptions) is a Python package that streamlines fetching, curating, and aggregating ChEMBL data into a machine learning-ready format for drug discovery in a flexible and reproducible manner. Instead of making opiniated decisions on the source data, CAPRICHO curates it based on several quality control filters that can be chosen by the user. Its guiding principle is to never silently drop data. Entries that don't meet the criteria are marked, allowing the user to analyze how each curation step affects the comparability of assay readouts for the same compound.

## 🎯 Goals

The development of CAPRICHO is guided by two core principles:
- **Transparency Above All**: Data curation should never be a black box. Removed data points should be saved to be scrutinized by the user and the original data should be always preserved to ensure data integrity.
- **Flexibility by Design**: Every modeling project is unique. The tool must support flexible data collection and aggregation, allowing the incorporation of any ChEMBL metadata column to be incorporated into same-compound bioactivity values.

## ✨ Features:

- Data retrieval by any ChEMBL identifier (molecule IDs, target IDs, assay IDs, or document IDs)
- ADMET data curation support with unit conversion and non-pChEMBL aggregation
- Quality control through data flagging — never silently drops data
- Customizable filtering options with max curation standards introduced by [Landrum & Riniker (2024)](https://doi.org/10.1021/acs.jcim.4c00049)
- Configurable data aggregation options
- Binary classification support with censored data handling
- Save a fetching and processing recipe for reproducibility
- Command-line interface for easy use

## ⚙️ Installation

The most recent release can be installed from PyPI with uv:
```shell
uv pip install capricho
```

or with pip:
```shell
python -m pip install capricho
```

Alternatively, install directly from the GitHub repository with uv using the command:
```shell
uv pip install git+https://github.com/David-Araripe/Capricho.git
```
or with pip
```shell
python -m pip install git+https://github.com/David-Araripe/Capricho.git
```

## 🚀 Quick Start

### Basic Usage
```bash
# Download ChEMBL database
capricho download

# Get bioactivity data for EGFR
capricho get --target-ids CHEMBL203 --output-path egfr_data.csv

# Get high-confidence data for multiple targets
capricho get --target-ids CHEMBL203,CHEMBL204 --confidence-scores 8,9 --output-path results.csv
```

### Tab Completion

Our CLI supports tab completion for commands and options. To enable it, run the following command in your terminal:

```bash
capricho --install-completion
```

### Key Features
- **Five main commands**: `download`, `explore`, `get`, `prepare`, `binarize`
- **Flexible filtering**: By confidence, assay type, bioactivity type
- **Transparent processing**: All filtering steps are logged and flagged
- **Reproducible workflows**: Automatic recipe generation
- **Multiple backends**: Local SQL or web API
- **Binary classification support**: Convert continuous activity values to binary labels

## 📖 Documentation

For comprehensive documentation including detailed CLI options, advanced usage, tutorials, and API reference, visit our [full documentation](https://capricho.readthedocs.io/en/latest/).

**Quick Links:**
- [Installation Guide](https://capricho.readthedocs.io/en/latest/installation.html)
- [Quick Start](https://capricho.readthedocs.io/en/latest/quickstart.html)
- [CLI Reference](https://capricho.readthedocs.io/en/latest/cli-reference.html)
- [Key Concepts](https://capricho.readthedocs.io/en/latest/concepts.html)
- [ADMET Data Guide](https://capricho.readthedocs.io/en/latest/guides/admet-data.html)
- [API Reference](https://capricho.readthedocs.io/en/latest/api/index.html)

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
