Metadata-Version: 2.4
Name: docchunker
Version: 0.2.0
Summary: A specialized document chunking library for complex DOCX and PDF document structures
Author-email: Vlad Griguta <vlad.griguta@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/vladGriguta/DocChunker
Project-URL: Bug Tracker, https://github.com/vladGriguta/DocChunker/issues
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing :: Markup
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: python-docx>=0.8.11
Requires-Dist: tiktoken>=0.9.0
Requires-Dist: pypdf>=3.15.1
Requires-Dist: pymupdf>=1.23.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: mypy>=1.5.1; extra == "dev"
Requires-Dist: black>=23.9.1; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: ruff>=0.0.292; extra == "dev"
Requires-Dist: pyyaml>=6.0.2; extra == "dev"
Requires-Dist: notebook>=7.4.0; extra == "dev"
Requires-Dist: langchain>=0.3.0; extra == "dev"
Requires-Dist: langchain-community>=0.3.0; extra == "dev"
Requires-Dist: openai>=1.80.0; extra == "dev"
Requires-Dist: faiss-cpu>=1.11.0; extra == "dev"
Dynamic: license-file

# DocChunker

A specialized document chunking library designed to handle complex document structures in DOCX and PDF files. DocChunker intelligently processes structured documents containing tables, nested lists, images, and other complex elements to create semantically meaningful chunks that preserve context.

DocChunker supports flexible input methods - process documents from file paths, raw bytes, or file-like objects, making it ideal for web applications, database integration, and cloud-based document processing pipelines.

## Key Features

*   **In-Memory Processing**: Process documents from bytes, BytesIO objects, or file paths - perfect for web uploads, databases, and cloud storage.
*   **Multi-Format Support**: Full support for DOCX and PDF documents with intelligent structure detection.
*   **Advanced Document Parsing**: Handles complex elements like nested lists, tables with merged cells, headings, and paragraphs.
*   **Contextual Chunking**: Preserves document hierarchy (headings, etc.) within chunks for better semantic understanding.
*   **Overlap Control**: Configure element-based overlap between chunks to maintain context continuity.
*   **Configurable Strategy**: Tune chunk size (tokens) and overlap parameters for optimal performance.
*   **Semantic Cohesion**: Aims to keep related content (list items, table rows) together.
*   **RAG-Optimized**: Produces chunks ideal for effective information retrieval.

## Installation

```bash
pip install docchunker
```

DocChunker requires Python 3.9+ and supports both DOCX and PDF processing out of the box.

## Quick Start

### Basic Usage

```python
from docchunker import DocChunker

# Initialize the chunker with desired settings
chunker = DocChunker(chunk_size=200)

# Process DOCX from file path
chunks = chunker.process_document("document.docx")

# Process PDF from file path  
chunks = chunker.process_document("document.pdf")

# Process from bytes (web uploads, database, etc.)
with open("document.docx", "rb") as f:
    document_bytes = f.read()
chunks = chunker.process_document_bytes(document_bytes, "docx")

# Work with chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk.metadata['node_type']} - {len(chunk.text)} chars")
    print(f"Headings: {chunk.metadata['headings']}")
```

### Advanced Configuration

```python
from docchunker import DocChunker

# Configure chunk size and overlap for better context preservation
chunker = DocChunker(
    chunk_size=300,           # Target tokens per chunk
    num_overlapping_elements=2 # Elements to overlap between chunks
)

chunks = chunker.process_document("complex_document.pdf")

# Overlap provides better context continuity
for chunk in chunks:
    if chunk.metadata.get('has_overlap'):
        print(f"Overlapped {chunk.metadata['overlap_elements']} elements from previous chunk")
```

### Common Use Cases

```python
from docchunker import DocChunker
from io import BytesIO
import requests

chunker = DocChunker(chunk_size=200, num_overlapping_elements=1)

# 1. Web uploads/API
response = requests.get("https://example.com/document.pdf")
chunks = chunker.process_document_bytes(response.content, "pdf")

# 2. BytesIO objects (direct processor access)
file_obj = BytesIO(document_bytes)
pdf_processor = chunker.processors["pdf"]
chunks = pdf_processor.process(file_obj)

# 3. Database BLOBs
# document_bytes = database.get_document_blob(doc_id)
# chunks = chunker.process_document_bytes(document_bytes, "docx")

# 4. Batch processing
for file_path in ["doc1.docx", "doc2.pdf", "doc3.docx"]:
    chunks = chunker.process_document(file_path)
    print(f"Processed {len(chunks)} chunks from {file_path}")
```

## RAG DEMO
For an end-to-end example of building a simple RAG system using DocChunker with LangChain, check out the `examples/RAG_demo.ipynb` notebook.

## Development

To contribute to DocChunker:

```bash
# Clone the repository
git clone https://github.com/vladGriguta/DocChunker
cd docchunker

# Set up development environment
python -m venv .venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# Run tests
pytest
```

## Configuration Parameters

### DocChunker Parameters

- **`chunk_size`** (int, default: 200): Target number of tokens per chunk. Chunks may exceed this size to maintain semantic cohesion.

- **`num_overlapping_elements`** (int, default: 0): Number of elements (list items, table rows) to overlap between adjacent chunks. This provides better context continuity for information retrieval:
  - `0`: No overlap - each element appears in only one chunk
  - `1-3`: Recommended for most use cases - provides context while minimizing duplication  
  - `>3`: High overlap - useful for very context-sensitive applications but increases chunk redundancy

### When to Use Overlap

Use `num_overlapping_elements > 0` when:
- Building RAG systems where context is critical
- Processing documents with closely related list items or table rows
- Working with technical documentation where missing context reduces comprehension

Use `num_overlapping_elements = 0` when:
- Processing very large documents where duplication is costly
- Building simple search indices where exact deduplication is important
- Working with documents where elements are largely independent

## Future Roadmap

- [ ] **Chunk Size Homogenization**: Implement strategies to reduce chunk size variance.
- [ ] **Enhanced Unit Testing**: Add more tests for complex tables and lists.
- [ ] **Retrieval Evaluation Framework**: Develop a framework to assess chunk effectiveness.
- [ ] **Increased Test Coverage**: Systematically improve overall code coverage.
- [x] **PDF Support**: Full PDF parsing and chunking support with structure detection.
- [x] **Element Overlap**: Configurable overlap between chunks for better context preservation.
- [ ] **Advanced Element Handling**: Support for images (captions/alt-text), headers/footers, footnotes.
- [ ] **Performance Optimizations**: Profile and optimize for very large documents.


## License

MIT

## About the Author

DocChunker is developed by **Vlad Griguta**. Connect with me on [LinkedIn](https://www.linkedin.com/in/vlad-marius-griguta) or [GitHub](https://github.com/vladGriguta).
