Metadata-Version: 2.3
Name: semantic-chunker-langchain
Version: 0.1.2
Summary: Token-aware, LangChain-compatible semantic chunker with PDF and layout support
License: MIT
Author: Prajwal Shivaji Mandale
Author-email: prajwal.mandale333@gmail.com
Requires-Python: >=3.9,<3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: faiss-cpu (>=1.11.0,<2.0.0)
Requires-Dist: langchain (>=0.3.25,<0.4.0)
Requires-Dist: langchain-community (>=0.3.26,<0.4.0)
Requires-Dist: openai (>=1.84.0,<2.0.0)
Requires-Dist: pdfplumber (>=0.11.6,<0.12.0)
Requires-Dist: tiktoken (>=0.9.0,<0.10.0)
Description-Content-Type: text/markdown

# Semantic Chunker for LangChain

A **token-aware**, **LangChain-compatible** chunker that splits text (from PDF, markdown, or plain text) into semantically coherent chunks while respecting model token limits.

---

## 🚀 Features

* 🔍 **Model-Aware Token Limits**: Automatically adjusts chunking size for GPT-3.5, GPT-4, Claude, and others.
* 📄 **Multi-format Input Support**:

  * PDF via `pdfplumber`
  * Plain `.txt`
  * Markdown
  * (Extendable to `.docx` and `.html`)
* 🔁 **Overlapping Chunks**: Smart overlap between paragraphs to preserve context.
* 🧠 **Smart Merging**: Merges chunks smaller than 300 tokens.
* 🧩 **Retriever-Ready**: Direct integration with `LangChain` retrievers via FAISS.
* 🔧 **CLI Support**: Run from terminal with one command.

---

## 📦 Installation

```bash
pip install semantic-chunker-langchain
```

> Requires Python 3.9 - 3.12

---

## 🛠️ Usage

### 🔸 Chunk a PDF and Save to JSON/TXT

```bash
semantic-chunker sample.pdf --txt chunks.txt --json chunks.json
```

### 🔸 From Code

from semantic_chunker_langchain.chunker import SemanticChunker, SimpleSemanticChunker
from semantic_chunker_langchain.extractors.pdf import extract_pdf
from semantic_chunker_langchain.outputs.formatter import write_to_txt

# Extract
docs = extract_pdf("sample.pdf")

# Using SemanticChunker
chunker = SemanticChunker(model_name="gpt-3.5-turbo")
chunks = chunker.split_documents(docs)

# Save to file
write_to_txt(chunks, "output.txt")

# Using SimpleSemanticChunker
simple_chunker = SimpleSemanticChunker(model_name="gpt-3.5-turbo")
simple_chunks = simple_chunker.split_documents(docs)


### 🔸 Convert to Retriever

```python
from langchain_community.embeddings import OpenAIEmbeddings
retriever = chunker.to_retriever(chunks, embedding=OpenAIEmbeddings())
```

---

## 🧪 Testing

```bash
poetry run pytest tests/
```

---

## 👨‍💻 Authors

* Prajwal Shivaji Mandale
* Sudhnwa Ghorpade

---

## 📜 License

This project is licensed under the MIT License.

