Metadata-Version: 2.4
Name: chunkhive
Version: 0.1.8
Summary: Hierarchical, semantic code chunking for AI systems
Author-email: ChunkHive <contact@chunkhive.ai>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/AgentAhmed/ChunkHive
Project-URL: Documentation, https://github.com/AgentAhmed/ChunkHive#readme
Project-URL: Issues, https://github.com/AgentAhmed/ChunkHive/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tree-sitter>=0.25.2
Requires-Dist: tree-sitter-python>=0.25.0
Requires-Dist: toml>=0.10.2
Requires-Dist: typer>=0.21.1
Requires-Dist: rich>=14.2.0
Requires-Dist: PyYAML>=6.0.3
Provides-Extra: javascript
Requires-Dist: tree-sitter-javascript>=0.25.0; extra == "javascript"
Provides-Extra: typescript
Requires-Dist: tree-sitter-typescript>=0.23.0; extra == "typescript"
Provides-Extra: cpp
Requires-Dist: tree-sitter-cpp>=0.23.0; extra == "cpp"
Provides-Extra: java
Requires-Dist: tree-sitter-java>=0.23.0; extra == "java"
Provides-Extra: csharp
Requires-Dist: tree-sitter-c-sharp>=0.23.0; extra == "csharp"
Provides-Extra: all
Requires-Dist: tree-sitter-javascript>=0.25.0; extra == "all"
Requires-Dist: tree-sitter-typescript>=0.23.0; extra == "all"
Requires-Dist: tree-sitter-cpp>=0.23.0; extra == "all"
Requires-Dist: tree-sitter-java>=0.23.0; extra == "all"
Requires-Dist: tree-sitter-c-sharp>=0.23.0; extra == "all"
Dynamic: license-file

# ChunkHive

**Semantic, hierarchical code chunking for AI systems**

ChunkHive is a production-grade code chunking engine designed for modern AI workflows such as
code embeddings, retrieval-augmented generation (RAG), agentic systems, and dataset synthesis.

It converts raw repositories into **clean, structured, semantically accurate chunks**
with byte-level precision and preserved hierarchy.

---

## 🚀 Why ChunkHive?

Modern AI systems need **more than naive text splitting**.

chunkhive provides:
- AST-first semantic correctness
- Hierarchical structure awareness
- Byte-accurate spans
- Robust parsing across real-world repositories

---

## 🧠 Core Principle

> **AST is the Authority, Tree-sitter is Enrichment**

- **Primary source of truth**: Language AST (semantic accuracy)
- **Fallback & enrichment**: Tree-sitter (structural robustness)
- **Result**: Maximum parsing success across diverse codebases

---

## ✨ Features

- Semantic AST-first chunking (no filename-based chunks)
- Preserves hierarchy: Module → Class → Method / Function
- Accurate parent–child relationships
- Byte-level precision (`start_byte`, `end_byte`)
- Clean symbol naming (`ast.name`)
- Import & decorator capture
- Robust handling of edge cases (empty files, `__init__.py`)
- Supports documentation + code chunking flows

---

## 🔄 Multi-Language Support

### Currently Supported:
Python: Full AST parsing with decorators, imports, docstrings

Markdown/RST: Documentation chunking with code block detection

Configuration Files: JSON, YAML, TOML, INI, Dockerfiles

Text Files: README, LICENSE, requirements.txt, scripts

### 🔄 Coming Soon:
JavaScript/TypeScript 

C++/Java/Go

## 🗂 Supported Chunk Types

module

class

method

function

documentation

configuration (JSON, YAML, TOML)

text

imports

## 🏢 Production Features
Deterministic IDs: Same code → same chunk ID across runs

Progress Indicators: Real-time processing feedback

Error Resilience: Graceful handling of malformed code

Statistics Generation: Detailed analytics and metrics

Batch Processing: Process multiple repositories from config file

Permission Handling: Intelligent output path resolution



## 📦 Installation

pip install chunkhive

## Quick Start

### Basic usage (creates in current directory)
chunkhive chunk local ./my_project

### With output directory
chunkhive chunk local ./my_project -o ./output

### With custom name and statistics
chunkhive chunk local ./my_project --name my_dataset --stats 

chunkhive chunk repo https://github.com/user/repo --name my_dataset --stats  

### Clone and chunk any GitHub repository
chunkhive chunk repo https://github.com/user/repo

### With filtering and limits
chunkhive chunk repo https://github.com/langchain-ai/langchain \
  --extensions .py,.md \
  --max-files 100 \
  --name langchain_chunks 

### Single File Processing
chunkhive chunk file example.py

chunkhive chunk file example.py -o ./chunks.jsonl --stats

## Repository Analysis

### Analyze repository metadata
chunkhive analyze https://github.com/crewAIInc/crewAI

chunkhive analyze ./local/repo --output analysis.json

### Show Examples

chunkhive examples

### Check Version & Info

chunkhive version    # Show current version

chunkhive info       # Show system information


## 📦 Output Schema (Simplified)

```json
{
  "chunk_id": "primary_a1b2c3d4",
  "file_path": "src/example.py",
  "chunk_type": "function",
  "language": "python",
  "code": "...",
  "ast": {
    "name": "my_function",
    "parent": "MyClass",
    "symbol_type": "function",
    "docstring": "Function documentation",
    "decorators": ["@decorator"],
    "imports": ["import module"]
  },
  "span": {
    "start_byte": 123,
    "end_byte": 456,
    "start_line": 10,
    "end_line": 25
  },
  "hierarchy": {
    "parent_id": "parent_chunk_id",
    "children_ids": ["child1", "child2"],
    "depth": 2,
    "is_primary": true
  },
  "metadata": {
    "byte_accuracy": "exact_bytes",
    "repo_info": {
      "agentic_detection": {"langchain": "usage"},
      "dependencies": {"python_packages": ["pandas", "numpy"]},
      "git": {"remote_url": "https://github.com/user/repo"},
      "structure": {"file_types": {".py": 50, ".md": 10}}
    },
    "repository_context": {
      "similar_files": ["src/other.py"],
      "total_similar_files": 5
    }
  }
}

```

## 🛠 Use Cases

Code embedding model training

RAG pipelines

Agentic AI systems

Code search & navigation

QA dataset generation

Static analysis & tooling

Enterprise codebase intelligence

AI training data generation


## 📜 License

Apache License 2.0 — free to use, modify, and distribute, including commercial use.

