Metadata-Version: 2.4
Name: kallia
Version: 0.1.2
Summary: Semantic Document Processing Library
Author-email: CK <ck@kallia.net>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/kallia-project/kallia
Project-URL: Issues, https://github.com/kallia-project/kallia/issues
Keywords: document-processing,semantic-chunking,document-analysis,text-processing,machine-learning
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi[standard]==0.116.1
Requires-Dist: docling==2.41.0
Dynamic: license-file

# Kallia

**Semantic Document Processing Library**

Kallia is a FastAPI-based document processing service that converts documents into intelligent semantic chunks. The library specializes in extracting meaningful content segments from documents while preserving context and semantic relationships.

## 🚀 Features

- **Document-to-Markdown Conversion**: Standardized processing pipeline for consistent output
- **Semantic Chunking**: Intelligent content segmentation that respects document structure and meaning
- **PDF Support**: Currently supports PDF documents with extensible architecture for additional formats
- **RESTful API**: Clean, well-documented API interface with comprehensive error handling
- **Configurable Processing**: Adjustable parameters for temperature, token limits, and page selection
- **Docker Ready**: Containerized deployment with Docker and docker-compose support
- **Vision-Language Model Integration**: Leverages advanced AI models for content understanding
- **Interactive Playground**: Chainlit-based demo application for testing and exploration

## 📋 Prerequisites

- Python 3.11 or higher (3.11, 3.12, 3.13 supported)
- Docker (optional, for containerized deployment)
- Access to a compatible language model API (OpenRouter, Ollama, etc.)

## 🛠️ Installation

### PyPI Installation (Recommended)

Install Kallia directly from PyPI:

```bash
pip install kallia
```

### Local Development Setup

1. **Clone the repository**

   ```bash
   git clone https://github.com/kallia-project/kallia.git
   cd kallia
   ```

2. **Install dependencies**

   ```bash
   cd kallia
   pip install -r requirements.txt
   ```

3. **Configure environment variables**

   ```bash
   cp .env.example .env
   ```

   Edit `.env` with your configuration:

   ```env
   KALLIA_PROVIDER_API_KEY=your_api_key_here
   KALLIA_PROVIDER_BASE_URL=https://openrouter.ai/api/v1
   KALLIA_PROVIDER_MODEL=qwen/qwen2.5-vl-32b-instruct
   ```

4. **Run the application**
   ```bash
   fastapi run kallia/main.py --port 8000
   ```

### Docker Deployment

1. **Using Docker Compose (Recommended)**

   ```bash
   cd kallia
   docker-compose up -d
   ```

2. **Manual Docker Build**

   ```bash
   # Build the Docker image
   docker build -t kallia:0.1.2 .

   # Run the container
   docker run -p 8000:80 \
     -e KALLIA_PROVIDER_API_KEY=ollama \
     -e KALLIA_PROVIDER_BASE_URL=http://localhost:11434/v1 \
     -e KALLIA_PROVIDER_MODEL=qwen2.5vl:32b \
     kallia:0.1.2
   ```

## ⚙️ Configuration

### Environment Variables

| Variable                   | Description                              | Example                     |
| -------------------------- | ---------------------------------------- | --------------------------- |
| `KALLIA_PROVIDER_API_KEY`  | API key for your language model provider | `ollama`                    |
| `KALLIA_PROVIDER_BASE_URL` | Base URL for the API endpoint            | `http://localhost:11434/v1` |
| `KALLIA_PROVIDER_MODEL`    | Model identifier to use                  | `qwen2.5vl:32b`             |

### Supported Providers

- **OpenRouter**: Use OpenRouter API for access to various models
- **Ollama**: Local model deployment with Ollama
- **Custom Endpoints**: Any OpenAI-compatible API endpoint

## 📖 Usage

### API Endpoint

**POST** `/documents`

Converts a document into semantic chunks with concise summaries.

#### Request Body

```json
{
  "url": "https://raw.githubusercontent.com/kallia-project/kallia/refs/tags/v0.1.2/assets/pdf/01.pdf",
  "page_number": 1,
  "temperature": 0.0,
  "max_tokens": 8192
}
```

#### Parameters

- `url` (string, required): URL to the document to process
- `page_number` (integer, optional): Specific page to process (default: 1)
- `temperature` (float, optional): Model temperature for processing (default: 0.0)
- `max_tokens` (integer, optional): Maximum tokens for processing (default: 8192)

#### Response

```json
{
  "documents": [
    {
      "page_number": 1,
      "chunks": [
        {
          "original_text": "Original document content...",
          "concise_summary": "Concise summary of the content..."
        }
      ]
    }
  ]
}
```

### Example Usage

#### cURL

```bash
curl -X POST "http://localhost:8000/documents" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://raw.githubusercontent.com/kallia-project/kallia/refs/tags/v0.1.2/assets/pdf/01.pdf",
    "page_number": 1,
    "temperature": 0.0,
    "max_tokens": 4096
  }'
```

#### Python API Client

```python
import requests

response = requests.post(
    "http://localhost:8000/documents",
    json={
        "url": "https://raw.githubusercontent.com/kallia-project/kallia/refs/tags/v0.1.2/assets/pdf/01.pdf",
        "page_number": 1,
        "temperature": 0.0,
        "max_tokens": 4096
    }
)

result = response.json()
documents = result["documents"]

for document in documents:
    print(f"Page {document['page_number']}:")
    for chunk in document["chunks"]:
        print(f"  Original: {chunk['original_text']}")
        print(f"  Summary: {chunk['concise_summary']}")
        print("  ---")
```

### Programmatic Usage

You can also use Kallia directly as a Python library without running the API server:

#### Convert PDF to Markdown

```python
from kallia.documents import Documents

# Convert a PDF document to markdown
url = "./assets/pdf/01.pdf"
page_number = 1
temperature = 0.0
max_tokens = 8192

markdown_content = Documents.to_markdown(
    source=url,
    page_number=page_number,
    temperature=temperature,
    max_tokens=max_tokens,
)

print(markdown_content)
```

#### Create Semantic Chunks

```python
from kallia.documents import Documents
from kallia.chunker import Chunker

# First convert document to markdown
url = "./assets/pdf/01.pdf"
page_number = 1
temperature = 0.0
max_tokens = 8192

markdown_content = Documents.to_markdown(
    source=url,
    page_number=page_number,
    temperature=temperature,
    max_tokens=max_tokens,
)

# Then create semantic chunks from the markdown
semantic_chunks = Chunker.create(
    text=markdown_content,
    temperature=temperature,
    max_tokens=max_tokens,
)

# Process the chunks
for chunk in semantic_chunks:
    print(f"Original: {chunk.original_text}")
    print(f"Summary: {chunk.concise_summary}")
    print("---")
```

## 🎮 Interactive Playground

Kallia includes an interactive playground built with Chainlit for easy testing and exploration:

### Running the Playground

1. **Navigate to the playground directory**

   ```bash
   cd kallia-playground
   ```

2. **Install playground dependencies**

   ```bash
   pip install -r requirements.txt
   ```

3. **Configure environment variables**

   ```bash
   cp .env.example .env
   # Edit .env with your configuration
   ```

4. **Run the playground**

   ```bash
   chainlit run kallia-playground/main.py
   ```

5. **Access the interface**
   Open your browser to `http://localhost:8000`

### Playground Features

- **File Upload**: Upload PDF documents directly through the web interface
- **Real-time Processing**: Watch as documents are processed page by page
- **Interactive Q&A**: Ask questions about uploaded documents
- **Source References**: View the original chunks that inform each answer
- **Multi-page Support**: Automatically processes all pages of uploaded documents

## 🏗️ Project Structure

```
kallia/
├── kallia/                      # Main package directory
│   ├── kallia/
│   │   ├── __init__.py
│   │   ├── main.py              # FastAPI application entry point
│   │   ├── models.py            # Pydantic models for API
│   │   ├── constants.py         # Application constants
│   │   ├── documents.py         # Document processing logic
│   │   ├── chunker.py           # Semantic chunking implementation
│   │   ├── utils.py             # Utility functions
│   │   ├── logger.py            # Logging configuration
│   │   ├── settings.py          # Application settings
│   │   ├── exceptions.py        # Custom exceptions
│   │   ├── messages.py          # Message handling
│   │   ├── prompts.py           # AI model prompts
│   │   ├── image_caption_serializer.py
│   │   └── unordered_list_serializer.py
│   ├── requirements.txt         # Python dependencies
│   ├── Dockerfile               # Docker container configuration
│   ├── docker-compose.yml       # Docker Compose setup
│   └── .env.example             # Environment variables template
├── kallia-playground/           # Interactive demo application
│   ├── kallia-playground/
│   │   ├── __init__.py
│   │   ├── main.py              # Chainlit application
│   │   ├── qa.py                # Q&A functionality
│   │   ├── settings.py          # Playground settings
│   │   ├── constants.py         # Playground constants
│   │   └── chainlit.md          # Chainlit configuration
│   ├── requirements.txt         # Playground dependencies
│   ├── Dockerfile               # Playground Docker config
│   ├── docker-compose.yml       # Playground Docker Compose
│   └── .env.example             # Playground environment template
├── tests/                       # Test suite
│   ├── __init__.py
│   ├── test_pdf_to_markdown.py
│   └── test_markdown_to_chunks.py
├── assets/                      # Test assets
│   └── pdf/
│       └── 01.pdf               # Sample PDF for testing
├── LICENSE                      # Apache 2.0 License
├── pyproject.toml               # Project configuration
└── README.md                    # This file
```

## 🔧 Development

### Code Style

The project follows Python best practices and uses:

- FastAPI for web framework
- Pydantic for data validation
- Structured logging
- Comprehensive error handling

### Testing

The project includes comprehensive tests for core functionality:

```bash
# Run tests
pytest tests/

# Run specific tests
pytest tests/test_pdf_to_markdown.py
pytest tests/test_markdown_to_chunks.py
```

Test coverage includes:

- PDF to markdown conversion
- Markdown to semantic chunks processing
- End-to-end document processing pipeline

## 📦 Dependencies

### Core Dependencies

- **FastAPI**: Modern, fast web framework for building APIs
- **Docling**: Document processing and conversion library

### Full Dependency List

See `kallia/requirements.txt` for complete dependency specifications:

- `fastapi[standard]==0.116.1`
- `docling==2.41.0`

### Playground Dependencies

The interactive playground has additional dependencies listed in `kallia-playground/requirements.txt`:

- `chainlit`: For the interactive web interface
- `langchain`: For document processing and Q&A functionality
- `pdfminer`: For PDF metadata extraction

## 🚨 Error Handling

The API provides comprehensive error handling with appropriate HTTP status codes:

- **400 Bad Request**: Invalid parameters or unsupported file format
- **500 Internal Server Error**: Processing errors
- **503 Service Unavailable**: External service connectivity issues

## 📄 License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

## 👨‍💻 Author

**CK**

- Email: ck@kallia.net
- GitHub: [@kallia-project](https://github.com/kallia-project/kallia)

## 🔗 Links

- [GitHub Repository](https://github.com/kallia-project/kallia)
- [Issues](https://github.com/kallia-project/kallia/issues)
- [PyPI Package](https://pypi.org/project/kallia/)
- [Docker Hub](https://hub.docker.com/r/overheatsystem/kallia)

## 📈 Version

Current version: **0.1.2**

---

Built with ❤️ for intelligent document processing
