Metadata-Version: 2.4
Name: ungraph
Version: 0.1.3
Summary: Python package for Knowledge graph construction in the Graph Query Language Standard
Author-email: Alejandro Giraldo Londoño <alejandro@qnow.tech>
License: MIT
License-File: LICENSE
Keywords: knowledge-graph,langchain,neo4j,rag,unstructured-data
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.12
Requires-Dist: chardet>=5.2.0
Requires-Dist: langchain-community>=0.4.1
Requires-Dist: langchain-docling>=2.0.0
Requires-Dist: langchain-experimental>=0.4.1
Requires-Dist: langchain-huggingface>=1.2.0
Requires-Dist: langchain-neo4j>=0.6.0
Requires-Dist: langchain-text-splitters>=1.1.0
Requires-Dist: langchain>=1.2.0
Requires-Dist: langgraph>=1.0.5
Requires-Dist: markdown>=3.10
Requires-Dist: neo4j>=5.28.2
Requires-Dist: pydantic-settings>=2.12.0
Requires-Dist: pydantic>=2.12.5
Requires-Dist: python-dotenv>=1.2.1
Requires-Dist: sentence-transformers>=5.2.0
Requires-Dist: unstructured>=0.18.21
Provides-Extra: all
Requires-Dist: graphdatascience>=1.18; extra == 'all'
Requires-Dist: matplotlib>=3.10.8; extra == 'all'
Requires-Dist: mypy>=1.19.1; extra == 'all'
Requires-Dist: opik>=1.9.66; extra == 'all'
Requires-Dist: pip>=25.3; extra == 'all'
Requires-Dist: ruff>=0.14.10; extra == 'all'
Requires-Dist: spacy>=3.8.11; extra == 'all'
Requires-Dist: yfiles-jupyter-graphs-for-neo4j>=1.7.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: matplotlib>=3.10.8; extra == 'dev'
Requires-Dist: mypy>=1.19.1; extra == 'dev'
Requires-Dist: pip>=25.3; extra == 'dev'
Requires-Dist: ruff>=0.14.10; extra == 'dev'
Provides-Extra: experiments
Requires-Dist: opik>=1.9.66; extra == 'experiments'
Provides-Extra: gds
Requires-Dist: graphdatascience>=1.18; extra == 'gds'
Provides-Extra: infer
Requires-Dist: spacy>=3.8.11; extra == 'infer'
Provides-Extra: infer-all
Requires-Dist: spacy>=3.8.11; extra == 'infer-all'
Provides-Extra: infer-en
Requires-Dist: spacy>=3.8.11; extra == 'infer-en'
Provides-Extra: infer-es
Requires-Dist: spacy>=3.8.11; extra == 'infer-es'
Provides-Extra: ynet
Requires-Dist: yfiles-jupyter-graphs-for-neo4j>=1.7.0; extra == 'ynet'
Description-Content-Type: text/markdown

# Ungraph

<div align="center">
  <img src="https://assets.zyrosite.com/cdn-cgi/image/format=auto,w=375,h=375,fit=crop/YKbJXD52PBIp5944/logo-negro-mxBMJBrV9GTPeBBq.png" alt="Qnow Logo" width="200">
</div>

<div align="center">
  <strong>A Python framework for building Knowledge Graphs from unstructured text.</strong>
</div>

---

## What is Ungraph?

Ungraph transforms unstructured documents into structured **Lexical Graphs** stored in Neo4j, enabling advanced information retrieval and semantic search through GraphRAG patterns. Built on the **Extract-Transform-Inference (ETI)** pattern, Ungraph goes beyond traditional ETL by adding an explicit inference phase that generates traceable knowledge artifacts with PROV-O provenance.

**Note:** GraphRAG refers to retrieval patterns for expressing text in graph structures. Neo4j is the knowledge graph database where Ungraph stores the graphs (other graph databases can be supported).

**Universal Knowledge Extraction**: Ungraph is designed for **any knowledge domain**—scientific papers, financial reports, research literature, or any field requiring structured knowledge extraction.

Ungraph uses a **File-Page-Chunk** topology as its base graph structure, providing hierarchical document representation that preserves document structure while enabling granular semantic search. The framework implements the ETI pattern with full traceability, ensuring that every extracted fact can be traced back to its source document.


### Problems It Solves

- **Information Overload**: Converts unstructured text into queryable knowledge graphs
- **Context Loss**: Preserves document structure and relationships through hierarchical graph patterns
- **Limited Search**: Enables semantic, hybrid, and graph-enhanced search beyond keyword matching
- **Knowledge Fragmentation**: Connects related concepts across documents through entity extraction and relationships

### Project Orientation

Ungraph is designed for:

- **RAG Applications**: Enhanced retrieval for LLM-based systems using GraphRAG patterns
- **Knowledge Management**: Building searchable knowledge bases from document collections
- **Research & Analysis**: Extracting and connecting entities, facts, and relationships from text
- **Production Systems**: Clean architecture with comprehensive testing and error handling

### Cross-Domain Applicability

Ungraph is **domain-agnostic** and designed to extract knowledge from any field: sciences, finance, quantum computing, machine learning, biomedical research, legal documents, or any other knowledge domain. The framework's **inference phase** enables domain-specific knowledge discovery through NER for general entities or LLM-based extraction for domain-specific relationships.

## Installation

### Requirements

- Python 3.12+
- Neo4j 5.x+ (running and accessible)

### Basic Installation

```bash
pip install ungraph
```

### Optional Add-ons

```bash
# Entity extraction and inference (spaCy NER)
pip install ungraph[infer]
python -m spacy download en_core_web_sm  # or es_core_news_sm for Spanish

# Advanced search patterns (Neo4j GDS)
pip install ungraph[gds]

# Graph visualization in Jupyter
pip install ungraph[ynet]

# Development tools
pip install ungraph[dev]

# All extensions
pip install ungraph[all]
```

### Neo4j Setup

**Docker (recommended):**

```bash
docker run -d --name neo4j -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password neo4j:latest
```

**Or download:** [Neo4j Desktop](https://neo4j.com/download/) | [Community Edition](https://neo4j.com/download-center/#community)

### Configuration

```python
import ungraph

ungraph.configure(
    neo4j_uri="bolt://localhost:7687",
    neo4j_user="neo4j",
    neo4j_password="your_password",
    neo4j_database="neo4j"
)
```

**Or use environment variables:**

```bash
export UNGRAPH_NEO4J_URI="bolt://localhost:7687"
export UNGRAPH_NEO4J_USER="neo4j"
export UNGRAPH_NEO4J_PASSWORD="your_password"
export UNGRAPH_NEO4J_DATABASE="neo4j"
```

## Core Functions: The ETI Pattern

Ungraph implements the **Extract-Transform-Inference (ETI)** pattern, an evolution of traditional ETL that explicitly adds an inference phase to generate traceable knowledge artifacts. This pattern addresses the fundamental need to transform information into a queryable format that goes beyond raw data—enabling knowledge extraction through LLMs, neuro-symbolic reasoning systems, and other inference mechanisms to discover new relationships within any knowledge domain.

The ETI pattern is designed for building **traceable knowledge graphs** with PROV-O provenance, making it suitable for any domain requiring reliable knowledge extraction: sciences, finance, quantum computing, machine learning, and beyond.

### 1. Extract

Extract text from documents and split into semantically meaningful chunks.

```python
import ungraph

# Extract and chunk a document
chunks = ungraph.ingest_document("document.pdf")
print(f"Extracted {len(chunks)} chunks")

# Get intelligent chunking recommendations
recommendation = ungraph.suggest_chunking_strategy("document.md")
print(f"Strategy: {recommendation.strategy}")
print(f"Chunk size: {recommendation.chunk_size}")
```

**Supported formats:** Markdown, TXT, Word, PDF
**Features:** Automatic encoding detection, intelligent chunking, text cleaning

### 2. Transform

Transform extracted chunks into a structured graph with embeddings and relationships.

```python
import ungraph

# Transform document into graph (automatic with ingest_document)
chunks = ungraph.ingest_document("document.md")

# The graph structure is automatically created:
# File → Page → Chunk (with NEXT_CHUNK relationships)
# Each chunk has vector embeddings for semantic search
```

**Graph Pattern:**

```
File -[:CONTAINS]-> Page -[:HAS_CHUNK]-> Chunk
Chunk -[:NEXT_CHUNK]-> Chunk
```

**Features:** Vector embeddings (HuggingFace), configurable graph patterns, automatic indexing


### Graph Topology Constructor

Ungraph provides services that enable the construction of any graph pattern topology. These services allow you to define custom graph structures beyond the base File-Page-Chunk pattern, creating domain-specific knowledge graph topologies while maintaining traceability and provenance.


### 3. Infer

The **Inference** phase distinguishes ETI from traditional ETL. It generates normalized facts, relations, and explanations with confidence scores and PROV-O traceability using inference models (NER, LLM, or neuro-symbolic systems).

**Key Capabilities:**
- **Entity Extraction**: Named Entity Recognition (NER) for general entities
- **Relation Extraction**: Identify relationships between entities
- **Fact Generation**: Create structured facts (subject-predicate-object triplets) with confidence scores
- **Provenance Tracking**: Every fact is traceable to its source via PROV-O `wasDerivedFrom` relationships

**Inference Modes:**

- **NER** (default): Fast, production-ready entity extraction with spaCy. Generates simple facts like `(chunk_id, "MENTIONS", entity_name)` and co-occurrence relationships.
- **LLM** (experimental): Domain-specific extraction using language models for complex relationship extraction and entity normalization
- **Hybrid** (planned): Combines NER speed with LLM accuracy for optimal performance and precision

**Traceability:** All inferred facts include provenance metadata, allowing you to trace any fact back to its source document, page, and chunk.

## Example Usage

```python
import ungraph

# Configure and ingest a document
ungraph.configure(neo4j_uri="bolt://localhost:7687", neo4j_password="your_password")
chunks = ungraph.ingest_document("document.pdf", extract_entities=True)

# Search by entity
results = ungraph.search_by_entity("Apple Inc.", limit=5)
for result in results:
    print(f"Content: {result.content}")
    print(f"Entities: {[e.name for e in result.entities]}")
```

## Search Capabilities

### Basic Search

```python
# Text search
results = ungraph.search("quantum computing", limit=5)

# Vector search (semantic similarity)
results = ungraph.vector_search("machine learning", limit=5)

# Hybrid search (text + vector)
results = ungraph.hybrid_search(
    "artificial intelligence",
    limit=10,
    weights=(0.4, 0.6)  # text_weight, vector_weight
)
```

### GraphRAG Patterns

```python
# Basic Retriever: Direct vector search
results = ungraph.search_with_pattern(
    "neural networks",
    pattern_type="basic",
    limit=5
)

# Parent-Child Retriever: Small chunks + full context
results = ungraph.search_with_pattern(
    "quantum entanglement",
    pattern_type="parent_child",
    limit=3
)

# Graph-Enhanced Search (requires ungraph[gds])
results = ungraph.search_with_pattern(
    "machine learning",
    pattern_type="graph_enhanced",
    limit=5,
    max_traversal_depth=2
)
```

## Research Foundation

Ungraph implements the **Extract-Transform-Inference (ETI)** pattern, a research-driven approach to building traceable knowledge graphs. The ETI pattern is formally defined as a pipeline P = (E, T, I, O, M) where:

- **E (Extractors)**: Extract structured documents with metadata from various sources
- **T (Transformers)**: Transform documents into chunks with embeddings and semantic annotations
- **I (Inference)**: Generate facts, relations, and explanations with confidence and traceability
- **O (Ontology)**: Formal schema defining entity types, relationships, and mappings to standard vocabularies (schema.org, PROV-O)
- **M (Metadata)**: PROV-O provenance structure tracking derivation chains

### ETI vs Traditional ETL

| Aspect | ETL (Traditional) | ETI (Ungraph) |
|--------|-------------------|---------------|
| **Phases** | Extract, Transform, Load | Extract, Transform, **Infer** |
| **Objective** | Prepare data for storage | Build traceable knowledge |
| **Artifacts** | Structured data | Facts, relations, explanations |
| **Traceability** | Limited (basic metadata) | Complete (PROV-O provenance) |
| **Validation** | Format verification | Knowledge validation |
| **Reasoning** | Not included | Explicit inference phase |
| **Domain Focus** | Data preparation | Knowledge construction |

**Key Research Contributions:**
- **Evolution from ETL to ETI**: Adding explicit inference phase for knowledge construction, not just data transformation
- **PROV-O traceability**: Every fact is traceable to its source through provenance chains, enabling validation and trust
- **Domain-agnostic design**: Validated across finance, biomedical, scientific, and general domains
- **GraphRAG patterns**: Implements retrieval patterns from the GraphRAG literature (Peng et al., 2024) for expressing text in graph structures
- **Neuro-symbolic computing**: Combines statistical models (LLMs) with symbolic reasoning for explainable inferences

For detailed research methodology, experimental design, and validation results, see the [research article](article/ungraph.md) (in preparation).

## Architecture

Ungraph follows **Clean Architecture** principles:

```
src/
├── domain/          # Entities, Value Objects, Interfaces
├── application/     # Use cases
├── infrastructure/  # Neo4j, LangChain implementations
└── utils/           # Legacy code (being migrated)
```

**Key Features:**

- **Domain-driven design**: Clear separation of concerns with domain, application, and infrastructure layers
- **Configurable graph patterns**: Build custom graph topologies through graph construction services
- **Production-ready**: Comprehensive testing, error handling, and clean architecture
- **Modular design**: Optional dependencies for inference (`ungraph[infer]`), graph algorithms (`ungraph[gds]`), and visualization (`ungraph[ynet]`)
- **ETI pattern implementation**: Full Extract-Transform-Inference pipeline with PROV-O traceability

## Documentation

- [Complete Documentation](docs/README.md)
- [Quick Start Guide](docs/guides/quickstart.md)
- [GraphRAG Search Patterns](docs/api/search-patterns.md)
- [Advanced Search Patterns](docs/api/advanced-search-patterns.md)
- [Graph Patterns](docs/concepts/graph-patterns.md)

## Contributing

Contributions are welcome! Please see our contributing guidelines for code style, testing requirements, and pull request process.

## License

MIT License

## Author

Alejandro Giraldo Londoño - alejandro@qnow.tech

<div align="center">
  <small>
    Developed by <a href="https://www.linkedin.com/company/qnow-tech" target="_blank">Qnow</a>
  </small>
</div>

## Citation

If you use Ungraph in your research, please cite:

```bibtex
@software{ungraph2026,
  author = {Giraldo Londoño, Alejandro},
  title = {Ungraph: Knowledge Graph Construction with GraphRAG Patterns},
  year = {2026},
  note = {In preparation},
  url = {https://github.com/Alejandro-qnow}
}

@article{giraldo2026eti,
  author = {Giraldo Londoño, Alejandro},
  title = {Extract-Transform-Inference: A Pattern for Building Traceable Knowledge Graphs in GraphRAG Systems},
  journal = {arXiv preprint},
  year = {2026},
  note = {In preparation},
}
```
