Metadata-Version: 2.4
Name: information-composer
Version: 0.3.0
Summary: A comprehensive toolkit for collecting, composing, and filtering information from various web resources with AI-powered markdown processing
Author-email: Tao Zhang <forrest_zhang@163.com>
License: MIT License
        
        Copyright (c) 2024 Tao Zhang
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/yourusername/information-composer
Project-URL: Documentation, https://information-composer.readthedocs.io/
Project-URL: Repository, https://github.com/yourusername/information-composer.git
Project-URL: Issues, https://github.com/yourusername/information-composer/issues
Keywords: web scraping,information collection,data composition,markdown,llm,filter,academic,paper,nlp,ai,dashscope,llama-index
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28.0
Requires-Dist: beautifulsoup4>=4.11.0
Requires-Dist: habanero>=1.2.0
Requires-Dist: pandas>=2.2.0
Requires-Dist: pubmed-parser>=0.5.0
Requires-Dist: biopython>=1.84
Requires-Dist: pony>=0.7.17
Requires-Dist: llama-index>=0.10.0
Requires-Dist: dashscope>=1.14.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: llama-index-llms-dashscope>=0.1.0
Requires-Dist: markdown>=3.5.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: nltk>=3.8.0
Requires-Dist: spacy>=3.7.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: asyncio-throttle>=1.0.0
Requires-Dist: aiofiles>=23.0.0
Requires-Dist: pypdfium2>=4.0.0
Requires-Dist: feedparser>=6.0.0
Requires-Dist: python-dateutil>=2.8.0
Requires-Dist: click>=8.0.0
Requires-Dist: langchain>=0.1.0
Requires-Dist: langchain-core>=0.1.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: types-requests; extra == "dev"
Requires-Dist: types-beautifulsoup4; extra == "dev"
Requires-Dist: types-PyYAML; extra == "dev"
Dynamic: license-file

# Information Composer

[![Code Quality](https://github.com/yourusername/information-composer/actions/workflows/code-quality.yaml/badge.svg)](https://github.com/yourusername/information-composer/actions/workflows/code-quality.yaml)
[![Python Version](https://img.shields.io/badge/python-3.12%2B-blue)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

A comprehensive toolkit for collecting, composing, and filtering information from various web resources with AI-powered markdown processing.

## Features

### Core Modules
- **PDF Validation**: Validate PDF file formats and integrity
- **Markdown Processing**: Advanced markdown processing with LLM filtering
- **DOI Management**: Download and manage academic papers by DOI
- **PubMed Integration**: Query and process PubMed data with CLI tool

### AI-Powered Features
- **LLM Filtering**: Support for DashScope, Ollama, and OpenAI
- **PubMed Analyzer**: AI-powered literature analysis
- **Markdown Filter**: Intelligent content extraction and filtering

### Web Scraping & Data Collection
- **Crossref Integration**: Query Crossref API for bibliographic data
- **Google Scholar Integration**: Crawl and process Google Scholar citations
- **RSS Feed Processing**: Parse and manage scientific RSS feeds
- **RiceDataCN Parser**: Extract gene data from RiceDataCN database

### Developer Tools
- **Code Quality**: Ruff linter and formatter (primary tool)
- **Testing**: Pytest with 51%+ coverage (570 tests passed)
- **Multi-format Support**: PDF, Markdown, JSON, XML, TXT

## Installation

### Prerequisites

- Python 3.12 or 3.13 (Python 3.12 is the minimum required version)
- Virtual environment (recommended)

### Setup

1. Clone the repository:
```bash
git clone https://github.com/yourusername/information-composer.git
cd information-composer
```

2. Create and activate virtual environment:
```bash
# Linux/macOS
python -m venv .venv
source .venv/bin/activate

# Windows
python -m venv .venv
.venv\Scripts\activate
```

3. Install dependencies:
```bash
pip install -e .
```

## Quick Start

### Activate Environment
```bash
# Linux/macOS
source activate.sh

# Windows
activate.bat
```

### Available CLI Commands

| Command | Description |
|---------|-------------|
| `pdf-validator` | Validate PDF files |
| `md-llm-filter` | Filter markdown with LLM |
| `pubmed-cli` | Search and fetch PubMed data |
| `google-scholar-crawler` | Crawl Google Scholar citations |
| `rss-fetcher` | Fetch and process RSS feeds |
| `crossref-cli` | Query Crossref API |

### Examples

```bash
# Validate PDF files
pdf-validator document.pdf

# Validate directory of PDFs
pdf-validator -d /path/to/directory -r

# Filter markdown with LLM
md-llm-filter -i input.md -o output.md

# Search PubMed
pubmed-cli search "cancer research" -e user@example.com

# Get details for specific PMIDs
pubmed-cli details 12345678 23456789 -e user@example.com

# Crawl Google Scholar
google-scholar-crawler -q "machine learning" -n 20

# Fetch RSS feeds
rss-fetcher -u "https://example.com/feed.xml" -o output.json

# Query Crossref
crossref-cli query --doi "10.1038/nature12373"
```

## Python API Usage

### PubMed Integration
```python
from information_composer.pubmed import query_pmid, fetch_pubmed_details_batch_sync

# Search for articles
pmids = query_pmid("cancer immunotherapy", "your-email@example.com", 50)

# Fetch detailed information
details = fetch_pubmed_details_batch_sync(pmids, "your-email@example.com")
```

### Crossref Integration
```python
from information_composer import CrossrefClient, query_crossref

# Query Crossref API
client = CrossrefClient()
results = client.query_works(query="machine learning", limit=10)

# Or use the convenience function
works = query_crossref("machine learning")
```

### DOI Downloader
```python
from information_composer import DOIDownloader

# Download paper by DOI
downloader = DOIDownloader()
result = downloader.download_doi("10.1038/nature12373")
```

### Markdown Processing
```python
from information_composer import jsonify, markdownify

# Convert markdown to JSON
json_data = jsonify(markdown_content)

# Convert JSON to markdown
markdown_content = markdownify(json_data)
```

### PDF Validation
```python
from information_composer import PDFValidator

# Validate PDF
validator = PDFValidator(verbose=True)
is_valid, error = validator.validate_single_pdf("document.pdf")
```

### Google Scholar Crawling
```python
from information_composer.sites.google_scholar import SearchConfig, google_scholar_search

# Search Google Scholar
config = SearchConfig(query="deep learning", num_results=20)
results = google_scholar_search(config)
```

## Development

### Code Quality

This project uses Ruff as the primary code quality tool:

```bash
# Run all checks
python scripts/check_code.py

# Auto-fix issues
python scripts/check_code.py --fix

# With tests
python scripts/check_code.py --with-tests

# Verbose output
python scripts/check_code.py --verbose
```

### Testing

```bash
# Run tests
pytest tests/ -v

# Run tests with coverage
python scripts/check_code.py --with-tests
```

### Project Structure

```
information-composer/
├── src/information_composer/
│   ├── core/              # Core functionality (DOI downloader)
│   ├── crossref/          # Crossref API integration
│   ├── llm_filter/        # LLM-based markdown filtering
│   ├── markdown/          # Markdown processing utilities
│   ├── pdf/               # PDF validation
│   ├── pubmed/            # PubMed integration
│   ├── rss/               # RSS feed processing
│   └── sites/             # Web scraping (Google Scholar, RiceDataCN)
├── examples/              # Usage examples
├── scripts/               # Utility scripts
├── docs/                  # Documentation
└── tests/                 # Test files
```

## Documentation

- [📚 Complete Documentation](docs/README.md) - Full project documentation
- [🚀 Quick Start](docs/quickstart.md) - Get started in 5 minutes
- [⚙️ Configuration](docs/configuration.md) - Configuration options
- [📖 Feature Guides](docs/guides/) - Detailed feature documentation
  - [PDF Validation](docs/guides/pdf-validator.md)
  - [Markdown Processing](docs/guides/markdown-processing.md)
  - [PubMed Integration](docs/guides/pubmed-integration.md)
  - [Crossref Integration](docs/guides/crossref-integration.md)
  - [Google Scholar](docs/guides/google-scholar-integration.md)
  - [RSS Processing](docs/guides/rss-integration.md)
  - [DOI Download](docs/guides/doi-download.md)
  - [LLM Filtering](docs/guides/llm-filtering.md)
- [🔧 Development](docs/development/) - Development and contributing guide

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run code quality checks: `python scripts/check_code.py --fix`
5. Run tests: `python scripts/check_code.py --with-tests`
6. Submit a pull request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Support

For questions and support, please open an issue on GitHub.
