Metadata-Version: 2.4
Name: splitter-mr
Version: 0.1.2
Summary: A modular text splitting library.
Author-email: Andrés Herencia <andresherencia2000@gmail.com>
License: MIT License
        
        Copyright (c) 2025 Andrés Herencia
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: homepage, https://github.com/andreshere00/splitter_mr
Project-URL: repository, https://github.com/andreshere00/splitter_mr
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: docling>=2.15.1
Requires-Dist: fastparquet>=2024.11.0
Requires-Dist: langchain-text-splitters>=0.3.8
Requires-Dist: markitdown>=0.1.2
Requires-Dist: pyyaml>=6.0.2
Dynamic: license-file

# Splitter MR

<img src="./assets/splitter_mr_logo.svg" alt="SplitterMR logo" width="80%"/>

## Description

**SplitterMR** is a library for chunking data into convenient text blocks compatible with your LLM applications.

## Features

### Different input formats

Splitter MR can read data from multiples sources and files. To read the files, it uses the Reader components, which inherits from a Base abstract class, `BaseReader`. This object allows you to read the files as a properly formatted string, or convert the files into another format (such as `markdown` or `json`). 

Currently, there are supported three readers: `VanillaReader`, and `MarkItDownReader` and `DoclingReader`. These are the differences between each Reader component:

| **Reader**         | **Unstructured files & PDFs**    | **MS Office suite files**         | **Tabular data**        | **Files with hierarchical schema**      | **Image files**                  | **Markdown conversion** |
|--------------------|----------------------------------|-----------------------------------|-------------------------|----------------------------------------|----------------------------------|----------------------------------|
| **Vanilla Reader**      | `txt`, `csv`                     | –                                 | `csv`, `tsv`, `parquet`| `json`, `yaml`, `html`, `xml`          || No |----------------------------------| –                                |
| **MarkItDown Reader**   | `txt`, `md`, `pdf`               | `docx`, `xlsx`, `pptx`            | `csv`, `tsv`                  | `json`, `html`, `xml`                  | `jpg`, `png`, `pneg`             | Yes |
| **Docling Reader**      | `txt`, `md`, `pdf`                     | `docx`, `xlsx`, `pptx`            | –                 | `html`, `xhtml`                        | `png`, `jpeg`, `tiff`, `bmp`, `webp` | Yes |

### Serveral splitting methods

Splitter_MR allows you to split the files on many different ways depending on your needs. 

Main splitting methods include:

| Splitting Technique       | Description                                                                                                                                                                                                                                                                                                                                                                                                   |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Character Splitter**    | Splits text into chunks based on a specified number of characters. Supports overlapping by character count or percentage. <br> **Parameters:** `chunk_size` (max chars per chunk), `chunk_overlap` (overlapping chars: int or %). <br> **Compatible with:** Text.                                                                                                                                             |
| **Word Splitter**         | Splits text into chunks based on a specified number of words. Supports overlapping by word count or percentage. <br> **Parameters:** `chunk_size` (max words per chunk), `chunk_overlap` (overlapping words: int or %). <br> **Compatible with:** Text.                                                                                                                                                       |
| **Sentence Splitter**     | Splits text into chunks by a specified number of sentences. Allows overlap defined by a number or percentage of words from the end of the previous chunk. Customizable sentence separators (e.g., `.`, `!`, `?`). <br> **Parameters:** `chunk_size` (max sentences per chunk), `chunk_overlap` (overlapping words: int or %), `sentence_separators` (list of characters). <br> **Compatible with:** Text.     |
| **Paragraph Splitter**    | Splits text into chunks based on a specified number of paragraphs. Allows overlapping by word count or percentage, and customizable line breaks. <br> **Parameters:** `chunk_size` (max paragraphs per chunk), `chunk_overlap` (overlapping words: int or %), `line_break` (delimiter(s) for paragraphs). <br> **Compatible with:** Text.                                                                     |
| **Recursive Splitter**    | Recursively splits text based on a hierarchy of separators (e.g., paragraph, sentence, word, character) until chunks reach a target size. Tries to preserve semantic units as long as possible. <br> **Parameters:** `chunk_size` (max chars per chunk), `chunk_overlap` (overlapping chars), `separators` (list of characters to split on, e.g., `["\n\n", "\n", " ", ""]`). <br> **Compatible with:** Text. |
| **Paged Splitter**        | Splits text by pages for documents that have page structure. Each chunk contains a specified number of pages, with optional word overlap. <br> **Parameters:** `num_pages` (pages per chunk), `chunk_overlap` (overlapping words). <br> **Compatible with:** Word, PDF, Excel, PowerPoint.                                                                                                                    |
| **Row/Column Splitter**   | For tabular formats, splits data by a set number of rows or columns per chunk, with possible overlap. Row-based and column-based splitting are mutually exclusive. <br> **Parameters:** `num_rows`, `num_cols` (rows/columns per chunk), `overlap` (overlapping rows or columns). <br> **Compatible with:** Tabular formats (csv, tsv, parquet, flat json).                                                   |
| **Schema Based Splitter** | Splits hierarchical documents (XML, HTML) based on element tags or keys, preserving the schema/structure. Splitting can be done on a specified or inferred parent key/tag. <br> **Parameters:** `chunk_size` (approx. max chars per chunk), `key` (optional parent key or tag). <br> **Compatible with:** XML, HTML.                                                                                          |
| **JSON Splitter**         | Recursively splits JSON documents into smaller sub-structures that preserve the original JSON schema. <br> **Parameters:** `max_chunk_size` (max chars per chunk), `min_chunk_size` (min chars per chunk). <br> **Compatible with:** JSON.                                                                                                                                                                    |
| **Semantic Splitter**     | Splits text into chunks based on semantic similarity, using an embedding model and a max tokens parameter. Useful for meaningful semantic groupings. <br> **Parameters:** `embedding_model` (model for embeddings), `max_tokens` (max tokens per chunk). <br> **Compatible with:** Text.                                                                                                                      |
| **HTMLTagSplitter**       | Splits HTML content based on a specified tag, or automatically detects the most frequent and shallowest tag if not specified. Each chunk is a complete HTML fragment for that tag. <br> **Parameters:** `chunk_size` (max chars per chunk), `tag` (HTML tag to split on, optional). <br> **Compatible with:** HTML.                                                                                           |
| **HeaderSplitter**        | Splits Markdown or HTML documents into chunks using header levels (e.g., `#`, `##`, or `<h1>`, `<h2>`). Uses configurable headers for chunking. <br> **Parameters:** `headers_to_split_on` (list of headers and semantic names), `chunk_size` (unused, for compatibility). <br> **Compatible with:** Markdown, HTML.                                                                                          |

### Output Format

#### Reader

The output object is `ReaderOutput`, a dictionary with the following structure:

```python
{
  text: Optional[str] = ""  # The extracted text
  document_name: Optional[str] = None  # The base name of the file
  document_path: str = ""  # The path to the document
  document_id: Optional[str] = None  # The document identifier (given by default by an UUID)
  conversion_method: Optional[str] = None  # The format in which the file has been converted (markdown, json, etc.)
  ocr_method: Optional[str] = None  # The OCR method or VLM used to analyze images (TBD)
  metadata: Optional[List[str]]  # The appended metadata, introduced by the user (TBD)
}
```

#### Splitter

The output object is `SplitterOutput`, a dictionary with the following structure:

```python
{
  'chunks': List[str],  # The extracted chunks from the text
  'chunk_id': List[str],  # The identifier for the chunks (given by default with uuid)
  'document_name': Optional[str],  # The base name of the file.
  'document_path': str,  # The path to the document
  'document_id': Optional[str],  # The identifier for that document
  'conversion_method': Optional[str],  # The format in which the file has been converted (markdown, json, etc.)
  'ocr_method': Optional[str],  # The OCR method or VLM used to analyze images (TBD)
  'split_method': str,  # The splitting strategy used for chunking the document
  'split_params': Optional[Dict[str, Any]],  # The specific splitter parameters
  'metadata': Optional[List[str]]  # The appended metadata, introduced by the user (TBD)
}
```

## Architecture

![SplitterMR architecture diagram](./assets/architecture_splitter_mr.svg)

**SplitterMR** is designed around a modular pipeline that processes files from raw data all the way to chunked, LLM-ready text.

- **Reading**
    - A **`BaseReader`** implementation reads the (optionally converted) file.
    - Supported readers (e.g., **`VanillaReader`**, **`MarkItDownReader`**, **`DoclingReader`**) produce a `ReaderOutput` dictionary containing:
        - **Text** content (in `markdown`, `text`, `json` or another format).
        - Document **metadata**.
        - **Conversion** method.
- **Splitting**
    - A **`BaseSplitter`** implementation takes the **`ReaderOutput`** and divides the text into meaningful chunks for LLM or other downstream use.
    - Splitter classes (e.g., **`CharacterSplitter`**, **`SentenceSplitter`**, **`RecursiveSplitter`**, etc.) allow flexible chunking strategies with optional overlap and rich configuration.

**And that's it!** Your data is now prepared to be used in several LLM applications.

## How to install

Currently, the package can be installed executing the following instruction:

```python
pip install splitter-mr
```

We strongly recommend to install it using a python package management tool such as [`uv`](https://docs.astral.sh/uv/):

```python
uv add splitter-mr
```

## How to use

### Read files

Firstly, you need to instantiate an object from a BaseReader class, for example, `DoclingReader`.

```python
from splitter_mr.reader import DoclingReader

reader = DoclingReader()
```

To read any file, provide the file path within the `read()` method. If you use `DoclingReader` or `MarkItDownReader`, your files will be automatically parsed to markdown text format. The result of this reader will be a `ReaderOutput` object, a dictionary with the following shape:

```python 
reader_output = reader.read(file_path = "path/to/some/text.txt)
print(reader_output)
```
```bash
{'text': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sit amet ultricies orci. Nullam et tellus dui.', 
'document_name': 'text.txt', 
'document_path': 'path/to/some/text.txt', 
'document_id': '47ccc8b1-7259-4c83-be03-60e9990aa5cd', 
'conversion_method': 'docling', 
'ocr_method': None, 
'metadata': {}
}
```

### Split text

To split the text, first import the class that implements your desired splitting strategy (e.g., by characters, recursively, by headers, etc.). Then, create an instance of this class and call its `split` method, which is defined in the `BaseSplitter` class.

For example, we will split by characters with a maximum chunk size of 50, with an overlapping betwen chunks of 10 characters:

```python
from splitter_mr.splitter import CharacterSplitter

char_splitter = CharacterSplitter(chunk_size=50, chunk_overlap = 10)
splitter_output = char_splitter.split(reader_output)
print(splitter_output)
```
```bash
{
'chunks': ['Lorem ipsum dolor sit amet, consectetur adipiscing', 'adipiscing elit. Vestibulum sit amet ultricies orc', 'ricies orci. Nullam et tellus dui'], 
'chunk_id': ['357ee998-407a-4476-b09a-ab753a62de76', '487a8b7f-cbb3-4c79-99c8-02f51d5dac71', '257fa636-04f7-4802-9da2-a0841be4e75b'], 
'document_name': 'text.txt', 
'document_path': 'path/to/some/text.txt', 
'document_id': '5a78e5a6-db00-4cfc-8fbd-b37eb7d699ff', 
'conversion_method': 'docling', 
'ocr_method': None, 
'split_method': 'character_splitter', 
'split_params': {'chunk_size': 50, 'chunk_overlap': 10}, 
'metadata': {}
}
```

The returned dictionary is a `SplitterOutput` object, which provides all the information you need to further process your data. You can easily add custom metadata, and you have access to details such as the document name, path, and type. Each chunk is uniquely identified by an UUID, allowing for easy traceability throughout your LLM workflow.

### Compatibility with vision tools for image processing and annotations

> Coming soon!

## Contact

If you want to collaborate, please, send me a mail to the following address: [andresherencia2000@gmail.com](mailto:andresherencia2000@gmail.com).
