Metadata-Version: 2.4
Name: mschunker
Version: 0.1.3
Summary: MSchunker – Smart text chunker for LLM preprocessing
Author: MS
License: MIT
Project-URL: Homepage, https://github.com/cspnms/MSchunker
Project-URL: Source, https://github.com/cspnms/MSchunker
Project-URL: Issues, https://github.com/cspnms/MSchunker/issues
Keywords: llm,rag,chunking,nlp,ai
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# MSchunker – Intelligent Text Chunking for LLMs

**MSchunker** is a lightweight, structure-aware, and deterministic text chunker designed for modern LLM pipelines.

It transforms long documents into **LLM-ready chunks** that maintain semantic integrity and are optimized for:

- Retrieval-Augmented Generation (**RAG**)
- Question Answering (**QA**)
- Summarization
- Memory systems
- Any workflow requiring precise text segmentation

MSchunker respects natural document structure (sections → paragraphs → sentences) and provides rich metadata, task-aware defaults, and optional overlap for cross-chunk context.

---

## Features

- **Structure-aware splitting**
  - Detects headings, sections, paragraphs, and sentences
- **Token/character limits**
  - Enforces `max_tokens` and/or `max_chars`
- **Hierarchical strategy**
  - Paragraphs → sentences → hard splits (fallback)
- **Optional token overlap**
  - Adds continuity across chunks
- **Rich metadata**
  - Section index, paragraph indices, sentence indices, split reasons, offsets
- **Deterministic output**
  - Same input + same settings → identical chunks
- **Lightweight**
  - Zero heavy NLP / ML dependencies
- **Clean, simple API**
  - `chunk_text(...)` handles everything
  - `Chunker` for stateful usage

---

## Installation

Install from PyPI:

```bash
pip install mschunker

Or directly from GitHub:

pip install git+https://github.com/cspnms/MSchunker.git


⸻

Quickstart

from smartchunk import chunk_text

text = "... your long document ..."

chunks = chunk_text(
    text,
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

for ch in chunks:
    print("---- CHUNK ----")
    print(ch.text[:200], "...")
    print(ch.meta)


⸻

API Reference

chunk_text(...) — Main function

chunks = chunk_text(
    text: str,
    max_tokens: int | None = 512,
    max_chars: int | None = None,
    overlap_tokens: int = 64,
    strategy: str = "auto",
    token_counter: callable | None = None,
    source_id: str | None = None,
    task: str | None = None,   # "rag" | "qa" | "summarization" | "memory"
)

Returns: List[Chunk]

⸻

Chunker — Stateful wrapper

from smartchunk import Chunker

c = Chunker(
    max_tokens=512,
    overlap_tokens=64,
    strategy="auto",
    task="rag",
)

chunks = c.chunk(text, source_id="doc-1")


⸻

Chunk — Data Model

Each chunk contains:
	•	.text – content
	•	.meta – dictionary with:
	•	section_index
	•	section_heading
	•	paragraph_indices
	•	sentence_indices
	•	split_reason
	•	strategy
	•	chunk_index
	•	overlap_from_prev
	•	overlap_tokens
	•	source_id

⸻

analyze_chunks(chunks) — Statistics

from smartchunk import analyze_chunks

stats = analyze_chunks(chunks)
print(stats)

Example output:

{
  "num_chunks": 12,
  "min_tokens": 118,
  "max_tokens": 482,
  "avg_tokens": 311.9
}


⸻

explain_chunk(chunk) — Human-readable explanation

from smartchunk import explain_chunk

print(explain_chunk(chunks[0]))

Example:

Strategy: auto | Split reason: paragraph_boundary |
Section #0 heading='Introduction' |
Paragraphs: (0, 1) | Chunk index: 0


⸻

How MSchunker Works

MSchunker uses a hierarchical, structure-preserving algorithm:
	1.	Sections / Headings
	2.	Paragraphs
	3.	Sentences
	4.	Hard splits (when paragraphs or sentences exceed limits)

This ensures chunks are semantically coherent and optimized for LLM input.

Optional overlap_tokens adds continuity across chunks — ideal for RAG and QA.

⸻

Design Principles
	•	Semantic integrity first
Meaning preserved whenever possible.
	•	Deterministic and transparent
Output + reasoning are reproducible.
	•	Lightweight
No NLP or transformer dependencies.
	•	Extensible foundation
Future roadmap:
	•	Semantic (embedding-aware) chunking
	•	Multi-granularity chunk outputs
	•	Benchmark-driven tuning
	•	RAG framework adapters

⸻

License

MIT License © 2025 MS

⸻

Contributing

Issues and pull requests are welcome.
MSchunker is designed to evolve into a fully intelligent, future-proof chunking engine.
