Metadata-Version: 2.4
Name: debase
Version: 0.6.1
Summary: Enzyme lineage analysis and sequence extraction package
Home-page: https://github.com/YuemingLong/DEBase
Author: DEBase Team
Author-email: DEBase Team <ylong@caltech.edu>
License: MIT
Project-URL: Homepage, https://github.com/YuemingLong/DEBase
Project-URL: Documentation, https://github.com/YuemingLong/DEBase#readme
Project-URL: Repository, https://github.com/YuemingLong/DEBase
Project-URL: Issues, https://github.com/YuemingLong/DEBase/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Chemistry
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.0.0
Requires-Dist: PyMuPDF>=1.18.0
Requires-Dist: numpy>=1.19.0
Requires-Dist: google-generativeai>=0.3.0
Requires-Dist: biopython>=1.78
Requires-Dist: requests>=2.25.0
Requires-Dist: httpx>=0.24.0
Requires-Dist: tqdm>=4.60.0
Requires-Dist: openpyxl>=3.0.0
Requires-Dist: PyPDF2>=2.0.0
Requires-Dist: Pillow>=8.0.0
Requires-Dist: networkx>=2.5
Provides-Extra: rdkit
Requires-Dist: rdkit>=2020.03.1; extra == "rdkit"
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=4.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme; extra == "docs"
Requires-Dist: myst-parser; extra == "docs"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# DEBase

DEBase is a Python package for extracting and analyzing enzyme lineage data from scientific papers using AI-powered parsing.

## Features

- Extract enzyme variant lineages from PDF documents
- Parse protein and DNA sequences with mutation annotations
- Extract reaction performance metrics (yield, TTN, ee)
- Extract and organize substrate scope data
- Match enzyme variants across different data sources using AI
- Generate structured CSV outputs for downstream analysis

## Installation

```bash
pip install debase
```

## Quick Start

```bash
# Run the complete pipeline
debase --manuscript paper.pdf --si supplementary.pdf --output results.csv

# Enable debug mode to save Gemini prompts and responses
debase --manuscript paper.pdf --si supplementary.pdf --output results.csv --debug-dir ./debug_output

# Individual components with debugging
python -m debase.enzyme_lineage_extractor --manuscript paper.pdf --output lineage.csv --debug-dir ./debug_output
python -m debase.reaction_info_extractor --manuscript paper.pdf --lineage-csv lineage.csv --output reactions.csv --debug-dir ./debug_output
python -m debase.substrate_scope_extractor --manuscript paper.pdf --lineage-csv lineage.csv --output substrate_scope.csv --debug-dir ./debug_output
python -m debase.lineage_format -r reactions.csv -s substrate_scope.csv -o final.csv -v
```

## Debugging

Use the `--debug-dir` flag to save all Gemini API prompts and responses for debugging:
- Location extraction prompts
- Sequence extraction prompts (can be very large, up to 150K characters)
- Enzyme matching prompts
- All API responses with timestamps
- Note: lineage_format.py uses `-v` for verbose output instead of `--debug-dir`

## Requirements

- Python 3.8+
- Google Gemini API key (set as GEMINI_API_KEY environment variable)

## Version

0.4.4

## License

MIT License

## Authors

DEBase Team - Caltech

## Contact

ylong@caltech.edu
