Metadata-Version: 2.4
Name: ontocast
Version: 0.1.5
Summary: Agentic ontology and knowledge graph co-generation
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: asyncio>=3.4.3
Requires-Dist: click>=8.1.8
Requires-Dist: docling>=2.32.0
Requires-Dist: langchain-core>=0.3.60
Requires-Dist: langchain-experimental>=0.3.4
Requires-Dist: langchain-huggingface>=0.2.0
Requires-Dist: langchain-ollama>=0.3.3
Requires-Dist: langchain-openai>=0.3.17
Requires-Dist: langchain>=0.3.25
Requires-Dist: langgraph>=0.2.35
Requires-Dist: neo4j>=5.28.1
Requires-Dist: owlready2>=0.47
Requires-Dist: rapidfuzz>=3.13.0
Requires-Dist: rdflib>=7.1.4
Requires-Dist: rich>=14.0.0
Requires-Dist: robyn>=0.66.2
Requires-Dist: sentence-transformers>=4.1.0
Requires-Dist: simsimd>=6.2.1
Requires-Dist: suthing>=0.4.1
Description-Content-Type: text/markdown

# OntoCast <img src="https://raw.githubusercontent.com/growgraph/ontocast/refs/heads/main/docs/assets/favicon.ico" alt="Agentic Ontology Triplecast logo" style="height: 32px; width:32px;"/>

### Agentic ontology-assisted framework for semantic triple extraction

![Python](https://img.shields.io/badge/python-3.12-blue.svg) 
[![PyPI version](https://badge.fury.io/py/ontocast.svg)](https://badge.fury.io/py/ontocast)
[![PyPI Downloads](https://static.pepy.tech/badge/ontocast)](https://pepy.tech/projects/ontocast)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![pre-commit](https://github.com/growgraph/ontocast/actions/workflows/pre-commit.yml/badge.svg)](https://github.com/growgraph/ontocast/actions/workflows/pre-commit.yml)

---

## Overview

OntoCast is a framework for extracting semantic triples (creating a Knowledge Graph) from documents using an agentic, ontology-driven approach. It combines ontology management, natural language processing, and knowledge graph serialization to turn unstructured text into structured, queryable data.

---

## Key Features

- **Ontology-Guided Extraction**: Ensures semantic consistency and co-evolves ontologies
- **Entity Disambiguation**: Resolves references across document chunks
- **Multi-Format Support**: Handles text, JSON, PDF, and Markdown
- **Semantic Chunking**: Splits text based on semantic similarity
- **MCP Compatibility**: Implements Model Control Protocol endpoints
- **RDF Output**: Produces standardized RDF/Turtle
- **Triple Store Integration**: Supports Neo4j (n10s) and Apache Fuseki

---

## Applications

OntoCast can be used for:

- **Knowledge Graph Construction**: Build domain-specific or general-purpose knowledge graphs from documents
- **Semantic Search**: Power search and retrieval with structured triples
- **GraphRAG**: Enable retrieval-augmented generation over knowledge graphs (e.g., with LLMs)
- **Ontology Management**: Automate ontology creation, validation, and refinement
- **Data Integration**: Unify data from diverse sources into a semantic graph

---

## Installation

```sh
uv add ontocast 
# or
pip install ontocast
```

---

## Configuration

### Environment Variables

Copy the example file and edit as needed:

```bash
cp env.example .env
# Edit with your values
```

**Main options:**
```bash
# LLM Configuration
# common
LLM_PROVIDER=openai # or ollama
LLM_MODEL_NAME=gpt-4o-mini # ollama model
LLM_TEMPERATURE=0.0

# openai
OPENAI_API_KEY=your_openai_api_key_here

# ollama
LLM_BASE_URL=

# Server
PORT=8999
RECURSION_LIMIT=1000
ESTIMATED_CHUNKS=30

# Optional: Triple Store Configuration (Fuseki preferred over Neo4j)
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin/abc123-qwe

NEO4J_URI=bolt://localhost:7689
NEO4J_AUTH=neo4j/test!passfortesting
```

---

## Triple Store Setup

OntoCast supports multiple triple store backends. When both Fuseki and Neo4j are configured, **Fuseki is preferred**.

- See [Triple Store Setup](docs/user_guide/triple_stores.md) for detailed Docker Compose instructions and sample `.env.example` files.
- Quick summary: copy and edit the provided `.env.example` in `docker/fuseki` or `docker/neo4j`, then run `docker compose --env-file .env <service> up -d` in the respective directory.

---

## Running OntoCast Server

```bash
uv run serve \
    --ontology-directory ONTOLOGY_DIR \
    --working-directory WORKING_DIR
```

---

## API Usage

- **POST /process**: Accepts `application/json` or file uploads (`multipart/form-data`).
- Returns: JSON with extracted facts (Turtle), ontology (Turtle), and processing metadata. Triples are also serialized to the configured triple store.

**Example:**
```bash
curl -X POST http://localhost:8999/process \
    -H "Content-Type: application/json" \
    -d '{"text": "Your document text here"}'
    
# Process a PDF file
curl -X POST http://url:port/process -F "file=@data/pdf/sample.pdf"

# Process a json file
curl -X POST http://url:port/process -F "file=@test2/sample.json"
```

---

## MCP Endpoints

- `GET /health`: Health check
- `GET /info`: Service info
- `POST /process`: Document processing

---

## Filesystem Mode

If no triple store is configured, OntoCast stores ontologies and facts as Turtle files in the working directory.

---

## Notes

- JSON documents must contain a `text` field, e.g.:
  ```json
  { "text": "abc" }
  ```
- `recursion_limit` is calculated as `max_visits * estimated_chunks` (default 30, or set via `.env`)
- Default port: 8999

---

## Docker

To build the OntoCast Docker image:
```sh
docker buildx build -t growgraph/ontocast:0.1.4 . 2>&1 | tee build.log
```

---

## Project Structure

```
ontocast/
├── agent/           # Agent workflow and orchestration
├── cli/             # CLI utilities and server
├── prompt/          # LLM prompt templates
├── stategraph/      # State graph logic
├── tool/            # Triple store, chunking, and ontology tools
├── toolbox.py       # Toolbox for agent tools
├── onto.py          # Ontology and RDF graph handling
├── util.py          # Utilities
```
Other directories:
- `docker/` – Docker Compose and .env.example files for triple stores
- `data/` – Example data, ontologies, and test files
- `docs/` – Documentation and user guides
- `test/` – Test suite

---

## Workflow

The extraction follows a multi-stage workflow:

<img src="https://github.com/growgraph/ontocast/blob/main/docs/assets/graph.png?raw=true" alt="Workflow diagram" width="350" style="float: right; margin-left: 20px;"/>

1. **Document Preparation**
    - [Optional] Convert to Markdown
    - Text chunking
2. **Ontology Processing**
    - Ontology selection
    - Text to ontology triples
    - Ontology critique
3. **Fact Extraction**
    - Text to facts
    - Facts critique
    - Ontology sublimation
4. **Chunk Normalization**
    - Chunk KG aggregation
    - Entity/Property Disambiguation
5. **Storage**
    - Triple/KG serialization

---

## Documentation

- Full documentation: [growgraph.github.io/ontocast](https://growgraph.github.io/ontocast)

---

## Roadmap

- [x] Add Jena Fuseki triple store for triple serialization
- [x] Add Neo4j n10s for triple serialization
- [ ] Replace triple prompting with a tool for local graph retrieval

---

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

---

## Acknowledgments

- Uses RDFlib for semantic triple management
- Uses docling for pdf/pptx conversion
- Uses OpenAI language models / open models served via Ollama for fact extraction
- Uses langchain/langgraph
