Metadata-Version: 2.4
Name: talk-tag
Version: 0.3.0
Summary: Transcript annotator for speaker-scoped CHAT corpus correction
License-Expression: MIT
Project-URL: Homepage, https://github.com/OliverHennhoefer/talk-tag
Project-URL: Repository, https://github.com/OliverHennhoefer/talk-tag
Project-URL: Issues, https://github.com/OliverHennhoefer/talk-tag/issues
Keywords: chat,transcript,annotation,linguistics,nlp
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: huggingface_hub>=0.29.3
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: orjson>=3.10.15
Requires-Dist: pandas>=2.2.3
Requires-Dist: tqdm>=4.67.1
Provides-Extra: runtime
Requires-Dist: numpy>=2.0.0; extra == "runtime"
Requires-Dist: peft>=0.12.0; extra == "runtime"
Requires-Dist: torch>=2.7.0; extra == "runtime"
Requires-Dist: transformers>=4.52.0; extra == "runtime"
Provides-Extra: dev
Requires-Dist: pytest>=8.3.5; extra == "dev"
Dynamic: license-file

# talk-tag

Adapter-only TalkBank CHAT morphosyntactic error annotator for `.cha` and `.jsonl`.

The runtime deployment path is fixed to:

1. Base model: `unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit`
2. Adapter: `mash-mash/talkbank-morphosyntax-annotator-final-recon_full_comp_preserve_final_seed3407`

No merged-model runtime path is used.

The package bundles the deployed CHAT token augmentation list and injects those
tokens into the tokenizer before loading the PEFT adapter. This step is required
to keep the tokenizer/model vocabulary aligned with the adapter checkpoint.

## Install

Python requirement: `>=3.10`.

```bash
pip install "talk-tag[runtime]"
```

Runtime extras include `torch`, `transformers`, and `peft`.

## Hugging Face access

You need Hub access to both repositories above. Set a token before first run:

```bash
export HF_TOKEN=...
```

If token or access is missing, `talk-tag doctor`/`talk-tag model pull` will report
auth or gated-repo errors.

## First-run workflow

1. Check environment:

```bash
talk-tag doctor
```

2. Pull/warm model assets:

```bash
talk-tag model pull --device auto
```

3. Run annotation:

```bash
talk-tag annotate \
  --input-dir ./input \
  --output-dir ./output \
  --target-speaker "*CHI" \
  --device auto
```

Single-file `.cha` example:

```bash
talk-tag annotate \
  --input-path ./input/sample.cha \
  --output-dir ./output \
  --target-speaker "*CHI" \
  --device auto
```

## Inference defaults

- `batch_size = 4`
- `max_new_tokens = 128`
- `max_seq_length = 512`
- `max_context_chars = 1200`
- `limit = 0`
- greedy decoding (`do_sample = false`)

## Supported runtime inputs

- `.cha`
- `.jsonl` (requires `--speaker-field` and `--text-field`)

The `annotate` command accepts either:

- `--input-dir` for folder annotation
- `--input-path` for a single `.cha` or `.jsonl` file

Other previously supported formats (`.txt`, `.csv`, `.json`, `.xlsx`) are rejected in adapter-only deployment mode.

## Colab quickstart

See [`examples/colab_quickstart.ipynb`](examples/colab_quickstart.ipynb) for a minimal setup flow.
