Metadata-Version: 2.4
Name: pyrtex
Version: 0.1.3
Summary: A Python library for batch text extraction and processing using Google Cloud Vertex AI
Author-email: CaptainTrojan <your-email@example.com>
License: MIT
Project-URL: Homepage, https://github.com/CaptainTrojan/pyrtex
Project-URL: Repository, https://github.com/CaptainTrojan/pyrtex
Project-URL: Issues, https://github.com/CaptainTrojan/pyrtex/issues
Project-URL: Documentation, https://github.com/CaptainTrojan/pyrtex#readme
Project-URL: Changelog, https://github.com/CaptainTrojan/pyrtex/releases
Keywords: ai,vertex-ai,google-cloud,text-extraction,batch-processing,gemini,pydantic
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: jinja2>=3.0.0
Requires-Dist: google-cloud-aiplatform>=1.40.0
Requires-Dist: google-cloud-storage>=2.10.0
Requires-Dist: google-cloud-bigquery>=3.11.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: bump2version>=1.0.0; extra == "dev"
Requires-Dist: build>=0.10.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Dynamic: license-file

# PyRTex

[![CI](https://github.com/CaptainTrojan/pyrtex/actions/workflows/ci.yml/badge.svg)](https://github.com/CaptainTrojan/pyrtex/actions/workflows/ci.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

A simple Python library for batch text extraction and processing using Google Cloud Vertex AI.

PyRTex makes it easy to process multiple documents, images, or text snippets with Gemini models and get back structured, type-safe results using Pydantic models.

## ✨ Features

- **🚀 Simple API**: Just 3 steps - configure, submit, get results
- **📦 Batch Processing**: Process multiple inputs efficiently  
- **🔒 Type Safety**: Pydantic models for structured output
- **🎨 Flexible Templates**: Jinja2 templates for prompt engineering
- **☁️ GCP Integration**: Seamless Vertex AI and BigQuery integration
- **🧪 Testing Mode**: Simulate without GCP costs

## 📦 Installation

Install from PyPI (recommended):
```bash
pip install pyrtex
```

Or install from source:
```bash
git clone https://github.com/CaptainTrojan/pyrtex.git
cd pyrtex
pip install -e .
```

For development:
```bash
pip install -e .[dev]
```

## 🚀 Quick Start

```python
from pydantic import BaseModel
from pyrtex import Job

# Define your data structures
class TextInput(BaseModel):
    content: str

class Analysis(BaseModel):
    summary: str
    sentiment: str
    key_points: list[str]

# Create a job
job = Job(
    model="gemini-2.0-flash-lite-001",
    output_schema=Analysis,
    prompt_template="Analyze this text: {{ content }}",
    simulation_mode=True  # Set to False for real processing
)

# Add your data
job.add_request("doc1", TextInput(content="Your text here"))
job.add_request("doc2", TextInput(content="Another document"))

# Process and get results
for result in job.submit().wait().results():
    if result.was_successful:
        print(f"Summary: {result.output.summary}")
        print(f"Sentiment: {result.output.sentiment}")
    else:
        print(f"Error: {result.error}")
```

## 📋 Core Workflow

PyRTex uses a simple 3-step workflow:

### 1. Configure & Add Data
```python
job = Job[YourSchema](model="gemini-2.0-flash-lite-001", ...)
job.add_request("key1", YourModel(data="value1"))
job.add_request("key2", YourModel(data="value2"))
```

### 2. Submit & Wait  
```python
job.submit().wait()  # Can be chained
```

### 3. Get Results
```python
for result in job.results():
    if result.was_successful:
        # Use result.output (typed!)
    else:
        # Handle result.error
```

## ⚙️ Configuration

### For Simulation Mode (No GCP Required)
```python
job = Job(
    model="gemini-2.0-flash-lite-001",
    output_schema=YourSchema,
    prompt_template="Your prompt",
    simulation_mode=True  # No GCP setup needed
)
```

### For Production (GCP Required)

Set your GCP project ID:
```bash
export GOOGLE_PROJECT_ID="your-project-id"
```

Or configure directly in code:
```python
from pyrtex.config import InfrastructureConfig

config = InfrastructureConfig(project_id="your-project-id")
job = Job(
    model="gemini-2.0-flash-lite-001",
    output_schema=YourSchema,
    prompt_template="Your prompt",
    config=config,
    simulation_mode=False
)
```

Then authenticate with GCP:
```bash
gcloud auth application-default login
```

### Troubleshooting

**Error: "Project was not passed and could not be determined from the environment"**

This happens when GCP project ID is not set. You have three options:

1. **Use simulation mode** (recommended for testing):
   ```python
   simulation_mode=True  # No GCP setup needed
   ```

2. **Set environment variable**:
   ```bash
   export GOOGLE_PROJECT_ID="your-project-id"
   ```

3. **Configure in code**:
   ```python
   from pyrtex.config import InfrastructureConfig
   config = InfrastructureConfig(project_id="your-project-id")
   job = Job(..., config=config)
   ```

## 📚 Examples

The `examples/` directory contains complete working examples:

```bash
cd examples

# Generate sample files
python generate_sample_data.py

# Extract contact info from business cards
python 01_simple_text_extraction.py

# Parse product catalogs  
python 02_pdf_product_parsing.py

# Extract invoice data from PDFs
python 03_image_description.py
```

### Example Use Cases

- **📇 Business Cards**: Extract contact information
- **📄 Documents**: Process PDFs, images (PNG, JPEG)  
- **🛍️ Product Catalogs**: Parse pricing and inventory
- **🧾 Invoices**: Extract structured financial data
- **📊 Batch Processing**: Handle multiple files efficiently

## 🧪 Development

### Running Tests

```bash
# All tests (mocked, safe)
./test_runner.sh

# Specific test types
./test_runner.sh --unit
./test_runner.sh --integration
./test_runner.sh --flake

# Real GCP tests (costs money!)
./test_runner.sh --real --project-id your-project-id
```

Windows users:
```cmd
test_runner.bat --unit
test_runner.bat --flake
```

### Code Quality

- **flake8**: Linting
- **black**: Code formatting  
- **isort**: Import sorting
- **pytest**: Testing with coverage

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run tests: `./test_runner.sh`
5. Submit a pull request

## 📄 License

MIT License - see [LICENSE](LICENSE) for details.

## 🆘 Support

- **Issues**: [GitHub Issues](https://github.com/CaptainTrojan/pyrtex/issues)
- **Examples**: Check the `examples/` directory
- **Testing**: Use `simulation_mode=True` for development

### Common Issues

**"Project was not passed and could not be determined from the environment"**
- Solution: Set `GOOGLE_PROJECT_ID` environment variable or use `simulation_mode=True`

**"Failed to initialize GCP clients"**  
- Solution: Run `gcloud auth application-default login` or use simulation mode

**Examples not working**
- Solution: Run `python generate_sample_data.py` first to create sample files
