Metadata-Version: 2.1
Name: wtpsplit-lite
Version: 0.1.0
Summary: wtpsplit with minimal dependencies
Home-page: https://github.com/superlinear-ai/wtpsplit-lite
Author: Laurent Sorber
Author-email: laurent@superlinear.eu
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: huggingface-hub (>=0.25.2)
Requires-Dist: numpy (>=1.0.0)
Requires-Dist: onnxruntime (>=1.13.1)
Requires-Dist: tokenizers (>=0.20)
Project-URL: Repository, https://github.com/superlinear-ai/wtpsplit-lite
Description-Content-Type: text/markdown

[![Open in Dev Containers](https://img.shields.io/static/v1?label=Dev%20Containers&message=Open&color=blue&logo=data:image/svg%2bxml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0iI2ZmZiIgZD0iTTE3IDE2VjdsLTYgNU0yIDlWOGwxLTFoMWw0IDMgOC04aDFsNCAyIDEgMXYxNGwtMSAxLTQgMmgtMWwtOC04LTQgM0gzbC0xLTF2LTFsMy0zIi8+PC9zdmc+)](https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/superlinear-ai/wtpsplit-lite) [![Open in GitHub Codespaces](https://img.shields.io/static/v1?label=GitHub%20Codespaces&message=Open&color=blue&logo=github)](https://github.com/codespaces/new/superlinear-ai/wtpsplit-lite)

# ✂️ wtpsplit-lite

🪓 [wtpsplit](https://github.com/segment-any-text/wtpsplit) is a Python package that offers training, inference, and evaluation of state-of-the-art Segment any Text (SaT) models for partitioning text into sentences.

✂️ [wtpsplit-lite](https://github.com/superlinear-ai/wtpsplit-lite) is a lightweight version of wtsplit that only retains accelerated ONNX inference of SaT models with minimal dependencies:

1. [huggingface-hub](https://github.com/huggingface/huggingface_hub) to download the model
2. [numpy](https://github.com/numpy/numpy) to process the model in- and output
3. [onnxruntime](https://github.com/microsoft/onnxruntime) to run the model
4. [tokenizers](https://github.com/huggingface/tokenizers) to tokenize the text for the model

## Installing

To install this package, run:

```sh
pip install wtpsplit-lite
```

## Using

> [!TIP]
> For a complete list of Segment any Text (SaT) models and all `SaT.split` keyword arguments, see the [wtsplit README](https://github.com/segment-any-text/wtpsplit).

Example usage:

```python
from wtpsplit_lite import SaT

text = """
It is known that Maxwell’s electrodynamics—as usually understood at the
present time—when applied to moving bodies, leads to asymmetries which do
not appear to be inherent in the phenomena. Take, for example, the recipro-
cal electrodynamic action of a magnet and a conductor.
"""

# Fast (~150ms/page), good quality:
sat = SaT("sat-3l-sm")
sentences = sat.split(text, stride=128, block_size=256)

# Slow, highest quality:
sat = SaT("sat-12l-sm")
sentences = sat.split(text)
```

This package also [contributes a new 'hat' weighting scheme to wtpsplit](https://github.com/segment-any-text/wtpsplit/pull/147) that improves output quality when using large strides. To enable it, set `weighting="hat"` as follows:

```python
# Fast (~150ms/page), better quality:
sat = SaT("sat-3l-sm")
sentences = sat.split(text, stride=128, block_size=256, weighting="hat")
```

> [!NOTE]
> In wtpsplit, the SaT implementation treats newlines as sentence boundaries by default. However, this leads to poor results on text extracted from PDF such as in the example above. In wtpsplit-lite, newlines are therefore treated as whitepace by default. You can choose which behavior you prefer with the `treat_newline_as_space` boolean keyword argument of the `SaT.split` method.

## Contributing

<details>
<summary>Prerequisites</summary>

<details>
<summary>1. Set up Git to use SSH</summary>

1. [Generate an SSH key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent#generating-a-new-ssh-key) and [add the SSH key to your GitHub account](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account).
1. Configure SSH to automatically load your SSH keys:
    ```sh
    cat << EOF >> ~/.ssh/config
    
    Host *
      AddKeysToAgent yes
      IgnoreUnknown UseKeychain
      UseKeychain yes
      ForwardAgent yes
    EOF
    ```

</details>

<details>
<summary>2. Install Docker</summary>

1. [Install Docker Desktop](https://www.docker.com/get-started).
    - _Linux only_:
        - Export your user's user id and group id so that [files created in the Dev Container are owned by your user](https://github.com/moby/moby/issues/3206):
            ```sh
            cat << EOF >> ~/.bashrc
            
            export UID=$(id --user)
            export GID=$(id --group)
            EOF
            ```

</details>

<details>
<summary>3. Install VS Code or PyCharm</summary>

1. [Install VS Code](https://code.visualstudio.com/) and [VS Code's Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers). Alternatively, install [PyCharm](https://www.jetbrains.com/pycharm/download/).
2. _Optional:_ install a [Nerd Font](https://www.nerdfonts.com/font-downloads) such as [FiraCode Nerd Font](https://github.com/ryanoasis/nerd-fonts/tree/master/patched-fonts/FiraCode) and [configure VS Code](https://github.com/tonsky/FiraCode/wiki/VS-Code-Instructions) or [configure PyCharm](https://github.com/tonsky/FiraCode/wiki/Intellij-products-instructions) to use it.

</details>

</details>

<details open>
<summary>Development environments</summary>

The following development environments are supported:

1. ⭐️ _GitHub Codespaces_: click on _Code_ and select _Create codespace_ to start a Dev Container with [GitHub Codespaces](https://github.com/features/codespaces).
1. ⭐️ _Dev Container (with container volume)_: click on [Open in Dev Containers](https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/superlinear-ai/wtpsplit-lite) to clone this repository in a container volume and create a Dev Container with VS Code.
1. _Dev Container_: clone this repository, open it with VS Code, and run <kbd>Ctrl/⌘</kbd> + <kbd>⇧</kbd> + <kbd>P</kbd> → _Dev Containers: Reopen in Container_.
1. _PyCharm_: clone this repository, open it with PyCharm, and [configure Docker Compose as a remote interpreter](https://www.jetbrains.com/help/pycharm/using-docker-compose-as-a-remote-interpreter.html#docker-compose-remote) with the `dev` service.
1. _Terminal_: clone this repository, open it with your terminal, and run `docker compose up --detach dev` to start a Dev Container in the background, and then run `docker compose exec dev zsh` to open a shell prompt in the Dev Container.

</details>

<details>
<summary>Developing</summary>

- This project follows the [Conventional Commits](https://www.conventionalcommits.org/) standard to automate [Semantic Versioning](https://semver.org/) and [Keep A Changelog](https://keepachangelog.com/) with [Commitizen](https://github.com/commitizen-tools/commitizen).
- Run `poe` from within the development environment to print a list of [Poe the Poet](https://github.com/nat-n/poethepoet) tasks available to run on this project.
- Run `poetry add {package}` from within the development environment to install a run time dependency and add it to `pyproject.toml` and `poetry.lock`. Add `--group test` or `--group dev` to install a CI or development dependency, respectively.
- Run `poetry update` from within the development environment to upgrade all dependencies to the latest versions allowed by `pyproject.toml`.
- Run `cz bump` to bump the package's version, update the `CHANGELOG.md`, and create a git tag.

</details>

