Metadata-Version: 2.4
Name: raglineage
Version: 0.2.8
Summary: Lineage-aware RAG engine for auditable, reproducible, versioned retrieval and answers
Author-email: Pranav Motarwar <pranav.motarwar@example.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/PranavMotarwar/raglineage
Project-URL: Documentation, https://github.com/PranavMotarwar/raglineage
Project-URL: Repository, https://github.com/PranavMotarwar/raglineage
Project-URL: Issues, https://github.com/PranavMotarwar/raglineage/issues
Keywords: rag,lineage,provenance,vector-search,nlp,llm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0.0
Requires-Dist: networkx>=3.0
Requires-Dist: faiss-cpu>=1.7.4
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: typer>=0.9.0
Requires-Dist: rich>=13.0.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Dynamic: license-file

# raglineage

**Lineage-aware RAG engine for auditable, reproducible, versioned retrieval and answers**

[![PyPI version](https://badge.fury.io/py/raglineage.svg)](https://badge.fury.io/py/raglineage)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-orange.svg)](https://opensource.org/licenses/Apache-2.0)
[![PyPI Downloads](https://static.pepy.tech/personalized-badge/raglineage?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/raglineage)

> **v0.2.8**: `rag.retrieve()` for raw chunks + lineage (use with your own LLM), `raglineage --version`

## The Unique Idea

Most RAG tools store text chunks and embeddings. They lose provenance and cannot explain answer drift.

**raglineage** treats RAG as a data lineage and provenance problem, not just vector search. Every retrievable unit is a **Lineage Node (LN)** with:

- Immutable ID and dataset version
- Precise source reference (file path, page, row, URL)
- Full transform chain (ordered list of transforms applied)
- Content hash for integrity
- Timestamps for auditing

The system maintains a **Lineage Graph (DAG)** linking nodes through structural and semantic relationships, enabling:

- Dataset versioning and diffing
- Incremental rebuilds (only recompute what changed)
- Answer auditing (reconstruct provenance of any answer)
- Version consistency checks
- Staleness detection

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                      Data Sources                            │
│  (PDFs, CSVs, JSON, APIs, Text Files)                       │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                    Ingestion Layer                           │
│  AutoIngestor → FileIngestor → TabularIngestor              │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                   Transform Layer                            │
│  Chunkers → Dedupe → Normalize                              │
│  (Each transform recorded in transform_chain)               │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│              Lineage Node Creation                           │
│  ln_id, source, transform_chain, content_hash, version      │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│              Lineage Graph (DAG)                             │
│  networkx DAG: nodes=LN, edges=relationships                │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│              Embedding + Vector Store                        │
│  Embeddings → FAISS Store → LN ID Mapping                   │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│              Retrieval + Audit                               │
│  Query → Top-K → Graph Walk → Answer + Lineage             │
│  Audit → Version Check → Staleness → Risk Flags             │
└─────────────────────────────────────────────────────────────┘
```

## Lineage Node Example

Every retrievable chunk is a Lineage Node with complete provenance:

```json
{
  "ln_id": "ln_92af",
  "content": "Revenue declined due to supply constraints",
  "source": {
    "type": "pdf",
    "uri": "data/10Q_Q3_2023.pdf",
    "page": 14,
    "section": "Management Discussion"
  },
  "dataset_version": "v3.1",
  "transform_chain": [
    "pdf_parse",
    "section_split",
    "semantic_chunk",
    "deduplicate"
  ],
  "content_hash": "sha256:a3f5b8c9d2e1f4a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0",
  "created_at": "2026-01-20T00:00:00Z"
}
```

## Audited Answer Example

Every answer includes full lineage and audit metadata:

```json
{
  "question": "Why did revenue fall in Q3?",
  "answer": "Revenue declined primarily due to supply constraints affecting shipments.",
  "lineage": [
    {
      "ln_id": "ln_92af",
      "score": 0.91,
      "source": {
        "uri": "data/10Q_Q3_2023.pdf",
        "page": 14
      },
      "dataset_version": "v3.1",
      "transform_chain": ["pdf_parse","section_split","semantic_chunk","deduplicate"]
    }
  ],
  "audit": {
    "staleness_check": "pass",
    "version_consistency": "single_version",
    "transform_risk_flags": []
  }
}
```

## Quickstart

### Installation

```bash
pip install raglineage
```

### Basic Usage

```python
from raglineage import RagLineage

rag = RagLineage(
    source="examples/data",
    store_backend="faiss",
    embed_backend="local"
)

# Build initial version
rag.build(version="v1.0")

# Query with lineage
ans = rag.query("What is the refund policy?", k=5)
print(ans.model_dump_json(indent=2))

# Audit the answer
report = rag.audit(ans)
print(report.model_dump_json(indent=2))

# Get dataset statistics
stats = rag.stats()
print(f"Nodes: {stats.node_count}, Version: {stats.current_version}")

# Batch query multiple questions
answers = rag.batch_query(["Question 1?", "Question 2?"], k=3)

# Export answer to Markdown for reports
print(ans.to_markdown())

# Retrieve raw chunks only (use with your own LLM)
hits = rag.retrieve("What is the refund policy?", k=5)
context = "\n\n".join(h.content for h in hits)
# Pass context to OpenAI, Claude, etc.
```

### Examples

Check out the examples directory:
- `examples/example/example.py` - Full walkthrough of all features
- `examples/basic_usage/example.py` - Quick start example

### CLI Usage

```bash
# Initialize a project
raglineage init ./my_project

# Build from source
raglineage build --source ./data --version v1.0

# Update incrementally
raglineage update --source ./data --version v1.1 --changed-only

# Query (table or JSON output)
raglineage query "What is the refund policy?" --k 5
raglineage query "What is the refund policy?" --output json  # For piping, CI/CD

# Show dataset statistics
raglineage stats --source ./data

# Diff versions
raglineage diff v1.0 v1.1
```

## Comparison with Other RAG Tools

| Feature | raglineage | LangChain | LlamaIndex |
|---------|-----------|-----------|------------|
| **Lineage Tracking** | First-class | Not built-in | Not built-in |
| **Dataset Versioning** | Native | Manual | Manual |
| **Incremental Updates** | Automatic | Full rebuild | Full rebuild |
| **Answer Auditing** | Built-in | Manual | Manual |
| **Transform Chain Tracking** | Every LN | Not tracked | Not tracked |
| **Version Diffing** | Structured | Not available | Not available |
| **Graph Relationships** | DAG-based | Optional | Optional |
| **Source Provenance** | Complete | Basic | Basic |

**Key Difference**: raglineage treats lineage as a core requirement, not an afterthought. Every operation preserves and tracks provenance.

## Core Concepts

### Lineage Nodes (LN)

A Lineage Node is the atomic unit of retrieval. Each LN has:
- **ln_id**: Stable, deterministic identifier
- **content**: The actual text content
- **source**: Precise reference to origin (file, page, row, etc.)
- **dataset_version**: Version tag for the dataset
- **transform_chain**: Ordered list of transforms applied
- **content_hash**: SHA-256 hash for integrity
- **timestamps**: Created/updated timestamps

### Lineage Graph

A directed acyclic graph (DAG) where:
- **Nodes**: Lineage Node IDs
- **Edges**: Typed relationships (adjacent, semantic, references, same_entity, etc.)

Enables graph-walk retrieval and relationship exploration.

### Dataset Versioning

Each dataset build produces a versioned manifest:
- Tracks all source files and their hashes
- Enables diffing between versions
- Supports incremental updates (only recompute changed files)

### Answer Auditing

Every answer includes:
- **Lineage**: List of LNs used with scores and metadata
- **Audit Report**: 
  - Version consistency check
  - Staleness detection
  - Transform risk flags

## Complete Feature Set

### 1. Data Ingestion
- **Auto-detection**: Automatically detects file types and routes to appropriate ingestor
- **File Ingestion**: Text files (.txt, .md, .rst)
- **Tabular Ingestion**: CSV, JSON files (row-by-row processing)
- **Extensible**: Easy to add custom ingestors for PDFs, APIs, databases

### 2. Transform Pipeline
- **Chunking Strategies**:
  - Simple chunking (character-based with overlap)
  - Semantic chunking (sentence-aware with overlap)
- **Deduplication**: Content hash-based duplicate detection
- **Normalization**: Text cleanup and normalization (with aggressive mode)
- **Transform Chain Tracking**: Every transform is recorded in the lineage

### 3. Embedding Backends
- **Local Embeddings**: sentence-transformers (default: all-MiniLM-L6-v2)
- **OpenAI Embeddings**: Optional OpenAI API integration
- **Extensible**: Easy to add custom embedding backends

### 4. Vector Storage
- **FAISS Store**: Efficient similarity search with L2 distance
- **LN ID Mapping**: Bidirectional mapping between vector indices and Lineage Node IDs
- **Persistence**: Stores index and mappings to disk
- **Incremental Updates**: Add/update vectors without full rebuild

### 5. Retrieval
- **Top-K Retrieval**: Standard vector similarity search
- **Graph-Walk Expansion**: Expand results using graph relationships
- **Filtering**: Filter by version, source URI, source type, minimum score
- **Configurable Depth**: Control graph walk depth for expansion

### 6. Lineage Graph
- **DAG Structure**: NetworkX-based directed acyclic graph
- **Relationship Types**: adjacent, semantic, references, same_entity, derived, parent_child
- **Graph Operations**:
  - Add nodes and edges
  - Get neighbors at specified depth
  - Export/import as JSON
  - Query node information

### 7. Dataset Versioning
- **Version Manifests**: Track all versions with file lists and hashes
- **File Tracking**: SHA-256 hashes for change detection
- **Version Comparison**: Diff between any two versions
- **Current Version**: Track active version

### 8. Incremental Updates
- **Change Detection**: Automatically detects changed files using hashes
- **Selective Processing**: Only processes added/modified files
- **Efficient Rebuilds**: Recomputes embeddings only for changed content
- **Graph Updates**: Incrementally updates graph relationships

### 9. Answer Auditing
- **Staleness Detection**: Checks if answer uses outdated data
- **Version Consistency**: Verifies all sources are from same version
- **Transform Risk Analysis**: Flags risky transforms (OCR, aggressive normalization, etc.)
- **Complete Audit Reports**: JSON-serializable audit metadata

### 10. CLI Interface
- **raglineage init**: Initialize a new project
- **raglineage build**: Build database from source
- **raglineage update**: Incrementally update database
- **raglineage query**: Query with lineage output (supports `--output json` for CI/piping)
- **raglineage stats**: Show dataset statistics (node count, versions, build status)
- **raglineage diff**: Compare dataset versions

### 11. Export and Integration
- **JSON Export**: Export lineage graph, answers, audit reports as JSON
- **Markdown Export**: `answer.to_markdown()` for reports and sharing
- **Retrieve only**: `rag.retrieve(question, k)` returns raw chunks + lineage for your own LLM (no built-in answer)
- **Batch Query**: `rag.batch_query(questions)` for processing multiple questions
- **Dataset Stats**: `rag.stats()` for node count, versions, and build status
- **CLI version**: `raglineage --version` to check installed version
- **Python API**: Full programmatic access to all features
- **Type Hints**: Complete type annotations for IDE support
- **Pydantic Models**: All data structures are Pydantic models for validation

## Requirements

- Python ≥ 3.10
- Strict type hints throughout
- Pydantic models for schemas
- NetworkX for graph operations
- FAISS for vector storage
- Sentence-transformers for local embeddings

## Development

```bash
# Clone repository
git clone https://github.com/PranavMotarwar/raglineage.git
cd raglineage
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## License

Apache-2.0 License. See [LICENSE](LICENSE) for details.

## Author

Pranav Motarwar - [GitHub](https://github.com/PranavMotarwar)

---

**raglineage** - Where every answer has a traceable origin.
