Metadata-Version: 2.3
Name: kathairo
Version: 0.4.2
Summary: Scripture processing pipeline for converting USFM and USX formats into structured TSV files
Keywords: bible,scripture,usfm,usx,nlp,tokenization,versification
Author: Robertson Brinker
Author-email: robertsonbrinker@gmail.com
Requires-Python: >=3.10,<3.12
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Religion
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Religion
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: polars (>=1.4.0,<2.0.0)
Requires-Dist: sil-machine (==1.0.2)
Requires-Dist: spacy (>=3.7.5,<4.0.0)
Project-URL: Homepage, https://github.com/ClearBible/kathairo.py
Project-URL: Repository, https://github.com/ClearBible/kathairo.py
Description-Content-Type: text/markdown

# kathairo

### Scripture Processing Pipeline: Parse, Tokenize, and Versify

[![PyPI version](https://badge.fury.io/py/kathairo.svg)](https://badge.fury.io/py/kathairo)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

**kathairo** is a comprehensive text processing pipeline for Scripture that converts USFM and USX formats into structured TSV files. Built on [SIL's machine.py](https://github.com/sillsdev/machine.py), kathairo provides both verse-level and token-level outputs with enhanced support for psalm superscriptions and advanced tokenization.

## Installation

```bash
pip install kathairo
```

### Requirements

- Python 3.10 or 3.11
- Dependencies (automatically installed):
  - `sil-machine==1.0.2`
  - `spacy>=3.7.5`
  - `polars>=1.4.0`

## Quick Start

### Minimal Example

The simplest way to use kathairo:

```python
from kathairo.tsvs.build_tsv_json_parser import process_corpus

config = {
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/path/to/usfm/",
    "targetVersificationPath": "/path/to/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True
}

process_corpus(config)
# Output: output/eng/MyBible/verse/verse_MyBible.tsv
#         output/eng/MyBible/token/token_MyBible.tsv
```

### Complete Workflow Example

Here's a realistic workflow from installation to processing:

```python
# Step 1: Install kathairo
# pip install kathairo

# Step 2: Import and setup
import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus
import kathairo

# Step 3: Locate built-in versification file
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")

# Step 4: Configure your project
config = {
    "projectName": "ESV",
    "language": "eng",
    "targetUsfmCorpusPath": "/home/user/bibles/ESV/usfm/",
    "targetVersificationPath": vrs_path,
    "latinWhiteSpaceIncludedTokenizer": True,
    "treatApostropheAsSingleQuote": False,
    "excludeBracketedText": False,
    "excludeCrossReferences": True,
    "psalmSuperscriptionTag": "d"
}

# Step 5: Process the corpus
print(f"Processing {config['projectName']}...")
result = process_corpus(config)
print("Done!")

# Output files created at:
# - output/eng/ESV/verse/verse_ESV.tsv
# - output/eng/ESV/token/token_ESV.tsv
```

### Processing Multiple Projects in Parallel

Create a JSON configuration file (e.g., `projects_config.json`):

```json
[
  {
    "projectName": "ESV",
    "language": "eng",
    "targetUsfmCorpusPath": "/scripture/ESV/usfm/",
    "targetVersificationPath": "/versification/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": true,
    "treatApostropheAsSingleQuote": false,
    "excludeBracketedText": false,
    "excludeCrossReferences": true,
    "stopWordsPath": "/config/english_stopwords.tsv",
    "zwRemovalPath": "/config/zw_removal.tsv",
    "psalmSuperscriptionTag": "d"
  },
  {
    "projectName": "RVR1960",
    "language": "spa",
    "targetUsfmCorpusPath": "/scripture/RVR1960/usfm/",
    "targetVersificationPath": "/versification/spa.vrs",
    "latinWhiteSpaceIncludedTokenizer": true,
    "excludeCrossReferences": true
  }
]
```

Then process all projects:

```python
from kathairo.tsvs.build_tsv_json_parser import main

main("projects_config.json")
```

This processes all projects in parallel using Python's `ProcessPoolExecutor`.

### Configuration Options

| Option                             | Type    | Description                                       |
| ---------------------------------- | ------- | ------------------------------------------------- |
| `projectName`                      | string  | Name of your project (used in output file naming) |
| `language`                         | string  | Language code (e.g., "eng", "spa")                |
| `targetUsfmCorpusPath`             | string  | Path to USFM files directory                      |
| `targetUsxCorpusPath`              | string  | Path to USX files directory (alternative to USFM) |
| `targetVersificationPath`          | string  | Path to versification file (.vrs)                 |
| `latinTokenizer`                   | boolean | Use standard Latin word tokenizer                 |
| `latinWhiteSpaceIncludedTokenizer` | boolean | Use Latin tokenizer with whitespace preservation  |
| `chineseTokenizer`                 | boolean | Use Chinese Bible word tokenizer                  |
| `treatApostropheAsSingleQuote`     | boolean | Handle apostrophes as single quotes               |
| `excludeBracketedText`             | boolean | Exclude text within square brackets               |
| `excludeCrossReferences`           | boolean | Exclude cross-reference text                      |
| `stopWordsPath`                    | string  | Path to TSV file containing stop words            |
| `zwRemovalPath`                    | string  | Path to TSV file for zero-width character removal |
| `regexRulesPath`                   | string  | Path to custom regex rules module                 |
| `psalmSuperscriptionTag`           | string  | USFM tag for psalm superscriptions (default: "d") |
| `tsvPath`                          | string  | Path to existing token TSV (for re-versification) |
| `solo`                             | boolean | If present, only this project will be processed   |

### Choosing a Tokenizer

You must specify **exactly one** tokenizer in your configuration:

- **`latinTokenizer`** - Standard Latin word tokenizer from SIL Machine
  - Use for: Most Latin-script languages (English, Spanish, French, etc.)
  - Behavior: Standard word boundary detection

- **`latinWhiteSpaceIncludedTokenizer`** - Enhanced Latin tokenizer with whitespace preservation
  - Use for: Latin-script languages where you need precise whitespace tracking
  - Behavior: Preserves whitespace information for accurate reconstruction
  - Recommended for most projects

- **`chineseTokenizer`** - Chinese Bible word tokenizer
  - Use for: Chinese scripture text
  - Behavior: Specialized Chinese word segmentation

Example:
```python
# Pick ONE of these:
"latinTokenizer": True                      # Standard
"latinWhiteSpaceIncludedTokenizer": True    # Recommended
"chineseTokenizer": True                    # Chinese only
```

## Output Format

### Verse-Level TSV

Generated at: `output/{language}/{projectName}/verse/verse_{projectName}.tsv`

| Column                   | Description                        |
| ------------------------ | ---------------------------------- |
| `id`                     | Verse identifier (BBCCCVVV format) |
| `source_verse`           | Source versification verse ID      |
| `text`                   | Complete verse text                |
| `id_range_end`           | End verse for verse ranges         |
| `source_verse_range_end` | End verse in source versification  |

### Token-Level TSV

Generated at: `output/{language}/{projectName}/token/token_{projectName}.tsv`

| Column                   | Description                                          |
| ------------------------ | ---------------------------------------------------- |
| `id`                     | Token identifier (BBCCCVVVWWW format)                |
| `source_verse`           | Source versification verse ID                        |
| `text`                   | Token text                                           |
| `skip_space_after`       | "y" if no space should follow this token             |
| `exclude`                | "y" if token should be excluded from analysis        |
| `id_range_end`           | End verse for verse ranges                           |
| `source_verse_range_end` | End verse in source versification                    |
| `required`               | "y" if token contains non-punctuation, "n" otherwise |

## Usage Examples

### Example 1: Processing USFM Files

```python
from kathairo.tsvs.build_tsv_json_parser import process_corpus

config = {
    "projectName": "ESV",
    "language": "eng",
    "targetUsfmCorpusPath": "/data/scripture/ESV/usfm/",
    "targetVersificationPath": "/data/versification/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "excludeCrossReferences": True,
    "psalmSuperscriptionTag": "d"
}

process_corpus(config)
```

### Example 2: Processing USX Files

```python
config = {
    "projectName": "NIV",
    "language": "eng",
    "targetUsxCorpusPath": "/data/scripture/NIV/usx/",  # Use USX instead of USFM
    "targetVersificationPath": "/data/versification/eng.vrs",
    "latinTokenizer": True  # Standard Latin tokenizer
}

process_corpus(config)
```

### Example 3: Chinese Scripture with Custom Tokenizer

```python
config = {
    "projectName": "CUV",
    "language": "zho",
    "targetUsfmCorpusPath": "/data/scripture/CUV/usfm/",
    "targetVersificationPath": "/data/versification/chinese.vrs",
    "chineseTokenizer": True,  # Use Chinese-specific tokenizer
    "excludeBracketedText": True
}

process_corpus(config)
```

### Example 4: Re-versification from Existing Token TSV

If you already have a token-level TSV and want to re-versify it:

```python
config = {
    "projectName": "ESV_Reversified",
    "language": "eng",
    "tsvPath": "/data/output/eng/ESV/token/token_ESV.tsv",  # Use existing TSV
    "targetVersificationPath": "/data/versification/new_versification.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "stopWordsPath": "/data/config/stopwords.tsv",
    "zwRemovalPath": "/data/config/zw_removal.tsv"
}

process_corpus(config)
```

### Example 5: With Stop Words and Zero-Width Character Removal

Create stop words file (`stopwords.tsv`):
```tsv
stop_words
\u200b
\u200c
\u200d
```

Create zero-width removal file (`zw_removal.tsv`):
```tsv
words
word\u200bwith\u200bzwsp
another\u200cword
```

Then use them:

```python
config = {
    "projectName": "MyBible",
    "language": "hin",
    "targetUsfmCorpusPath": "/data/hindi/usfm/",
    "targetVersificationPath": "/data/versification/hindi.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "stopWordsPath": "/data/config/stopwords.tsv",
    "zwRemovalPath": "/data/config/zw_removal.tsv"
}

process_corpus(config)
```

## Advanced Usage

### Custom Regex Rules

Create a custom Python module (`custom_regex.py`):

```python
import re

class CustomRegexRules:
    # Define custom punctuation patterns for word-level punctuation
    WORD_LEVEL_PUNCT_REGEX = re.compile(
        r'(?<=[a-zA-Z])[\''](?=[a-zA-Z])|'  # Apostrophes within words
        r'(?<=[0-9])[,.](?=[0-9])'          # Commas/periods in numbers
    )
```

Reference it in your config:

```python
config = {
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/data/usfm/",
    "targetVersificationPath": "/data/versification/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "regexRulesPath": "/path/to/custom_regex.py"  # Use custom rules
}

process_corpus(config)
```

### Solo Mode for Testing

When working with multiple projects in a JSON config, add `"solo": true` to process only one:

```json
[
  {
    "projectName": "TestProject",
    "language": "eng",
    "targetUsfmCorpusPath": "/test/usfm/",
    "targetVersificationPath": "/test/eng.vrs",
    "latinTokenizer": true,
    "solo": true
  },
  {
    "projectName": "OtherProject",
    "language": "spa",
    ...
  }
]
```

Only `TestProject` will be processed.

## Required Files and Structure

### Input Files

1. **Scripture Files** - Either USFM or USX format:
   - USFM: Plain text files with USFM markup (e.g., `01-GEN.usfm`, `40-MAT.usfm`)
   - USX: XML files following USX standard
   - Directory should contain all books you want to process

2. **Versification File** (`.vrs`) - Defines verse structure:
   ```
   # Versification  "English"
   GEN 1:31 2:25 3:24 4:26 5:32 ...
   EXO 1:22 2:25 3:22 ...
   ```
   - kathairo includes `eng.vrs` at `src/kathairo/versification/eng.vrs`
   - Format: `BOOK chapter:lastVerse chapter:lastVerse ...`

3. **Optional: Stop Words TSV**:
   ```tsv
   stop_words
   \u200b
   \u200c
   \u200d
   ```

4. **Optional: Zero-Width Removal TSV**:
   ```tsv
   words
   problematic\u200bword
   another\u200cexample
   ```

### Output Structure

kathairo automatically creates this directory structure:

```
output/
└── {language}/
    └── {projectName}/
        ├── verse/
        │   └── verse_{projectName}.tsv
        └── token/
            └── token_{projectName}.tsv
```

## Versification Support

kathairo handles complex versification mappings between different Bible versification systems.

### Using Built-in English Versification

```python
import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus

# Get the package's built-in versification file
import kathairo
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")

config = {
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/data/usfm/",
    "targetVersificationPath": vrs_path,  # Use built-in versification
    "latinWhiteSpaceIncludedTokenizer": True
}

process_corpus(config)
```

### Creating Custom Versification

```python
from kathairo.versification.fix_versification import save_versification
from machine.scripture import Versification, VerseRef

# Load and modify a versification
versification = Versification.load("path/to/base_versification.vrs")

# Make modifications (example: adjust Genesis 1:1 mapping)
versification.mappings.add_mapping(
    VerseRef("GEN", 1, 1, versification),
    VerseRef("GEN", 1, 2, versification)
)

# Save the modified versification
save_versification(versification)
# Creates: src/kathairo/versification/output.vrs
```

### Re-versifying Existing TSV Output

If you need to apply a different versification to already-processed scripture:

```python
config = {
    "projectName": "ESV_NewVersification",
    "language": "eng",
    "tsvPath": "/output/eng/ESV/token/token_ESV.tsv",  # Existing token TSV
    "targetVersificationPath": "/path/to/new_versification.vrs",  # New versification
    "latinWhiteSpaceIncludedTokenizer": True
}

process_corpus(config)
```

## Common Issues and Troubleshooting

### Import Errors

If you get `ModuleNotFoundError: No module named 'kathairo'`:
```bash
pip install kathairo
# or
pip install kathairo --upgrade
```

### Versification File Not Found

If you see errors about missing versification files:
```python
# Use the built-in English versification
import os
import kathairo
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")

config["targetVersificationPath"] = vrs_path
```

### No Tokenizer Specified

Error: `TypeError` or no tokenizer found

Solution: Add exactly one tokenizer to your config:
```python
config["latinWhiteSpaceIncludedTokenizer"] = True
```

### Empty Output Files

If your TSV files are empty:
1. Check that your USFM/USX files are in the correct directory
2. Verify the files have proper USFM/USX formatting
3. Ensure the versification file matches your scripture structure

### Performance Issues

For large projects:
- Use the JSON config approach with multiple projects to leverage parallel processing
- Each project is processed in a separate process for better performance

```python
from kathairo.tsvs.build_tsv_json_parser import main
main("projects.json")  # Processes all projects in parallel
```

## API Reference

### Main Functions

#### `process_corpus(config: dict) -> dict`

Process a single scripture project and generate TSV files.

**Parameters:**
- `config` (dict): Configuration dictionary with project settings

**Returns:**
- Dictionary with processing results

**Example:**
```python
from kathairo.tsvs.build_tsv_json_parser import process_corpus

result = process_corpus({
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/path/to/usfm/",
    "targetVersificationPath": "/path/to/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True
})
```

#### `main(json_path: str) -> None`

Process multiple projects in parallel from a JSON configuration file.

**Parameters:**
- `json_path` (str): Path to JSON configuration file

**Example:**
```python
from kathairo.tsvs.build_tsv_json_parser import main

main("projects_config.json")
```

#### `save_versification(versification: Versification) -> None`

Save a versification object to a .vrs file.

**Parameters:**
- `versification` (Versification): SIL Machine Versification object

**Output:**
- Creates `src/kathairo/versification/output.vrs`

**Example:**
```python
from kathairo.versification.fix_versification import save_versification
from machine.scripture import Versification

vrs = Versification.load("input.vrs")
save_versification(vrs)
```

### Lower-Level Functions

For advanced usage, you can access lower-level functions:

#### `corpus_to_tsv(targetVersification, sourceVersification, corpus, tokenizer, ...)`

Convert a corpus directly to TSV files.

#### `tokens_to_tsv(targetVersification, sourceVersification, tsvPath, ...)`

Re-versify an existing token TSV file.

## Technical Details

### Enhanced SIL Machine Features

kathairo extends `sil-machine` with:

- **Psalm Superscription Support**: Modified USFM parser handles psalm superscription tags
- **Improved Tokenizers**: Enhanced Latin tokenizer with whitespace preservation
- **Versification Mapping**: Automatic verse mapping between source and target versifications

### Text Processing Pipeline

1. **Parsing**: USFM/USX files parsed into structured corpus
2. **Tokenization**: Text tokenized using selected tokenizer
3. **Versification**: Verses mapped between source and target versifications
4. **Filtering**: Optional exclusion of bracketed text, cross-references, stop words
5. **Output**: Structured TSV files generated for verse and token levels

## Project Structure

```
kathairo/
├── parsing/           # USFM and USX parsers
│   ├── usfm/         # USFM parsing with modified handlers
│   └── usx/          # USX parsing utilities
├── tokenization/      # Tokenizer implementations
├── tsvs/             # TSV building and processing
├── versification/    # Versification utilities
└── helpers/          # Utility functions
```

## Contributing

Contributions are welcome! This project processes Scripture text and aims to provide reliable, accurate text processing for Bible translation and analysis workflows.

## Author

**Robertson Brinker** - [robertson.brinker@biblica.com](mailto:robertson.brinker@biblica.com)

## Acknowledgments

Built on [SIL's machine.py](https://github.com/sillsdev/machine.py) - a machine learning and natural language processing library for Scripture.

---

### Note to self: To update python package

1. `poetry config repositories.pypi https://upload.pypi.org/legacy/`
2. `$env:PYPI_USERNAME="__token__"`
3. `$env:PYPI_PASSWORD="<api-token>"`
4. `poetry publish --build --username $env:PYPI_USERNAME --password $env:PYPI_PASSWORD`

