Metadata-Version: 2.4
Name: talk-tag
Version: 0.2.0
Summary: Transcript annotator for speaker-scoped CHAT corpus correction
License-Expression: MIT
Project-URL: Homepage, https://github.com/OliverHennhoefer/talk-tag
Project-URL: Repository, https://github.com/OliverHennhoefer/talk-tag
Project-URL: Issues, https://github.com/OliverHennhoefer/talk-tag/issues
Keywords: chat,transcript,annotation,linguistics,nlp
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: huggingface_hub>=0.29.3
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: orjson>=3.10.15
Requires-Dist: pandas>=2.2.3
Requires-Dist: tqdm>=4.67.1
Provides-Extra: runtime
Requires-Dist: numpy>=2.0.0; extra == "runtime"
Requires-Dist: peft>=0.12.0; extra == "runtime"
Requires-Dist: torch>=2.7.0; extra == "runtime"
Requires-Dist: transformers>=4.52.0; extra == "runtime"
Provides-Extra: dev
Requires-Dist: pytest>=8.3.5; extra == "dev"
Dynamic: license-file

# talk-tag

Adapter-only TalkBank CHAT morphosyntactic error annotator for `.cha` and `.jsonl`.

The runtime deployment path is fixed to:

1. Base model: `unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit`
2. Adapter: `mash-mash/Llama_TalkTag_CHAT_error_annotator_adapter`

No merged-model runtime path is used.

## Install

Python requirement: `>=3.10`.

```bash
pip install "talk-tag[runtime]"
```

Runtime extras include `torch`, `transformers`, and `peft`.

## Hugging Face access

You need Hub access to both repositories above. Set a token before first run:

```bash
export HF_TOKEN=...
```

If token or access is missing, `talk-tag doctor`/`talk-tag model pull` will report
auth or gated-repo errors.

## First-run workflow

1. Check environment:

```bash
talk-tag doctor
```

2. Pull/warm model assets:

```bash
talk-tag model pull --device auto
```

3. Run annotation:

```bash
talk-tag annotate \
  --input-dir ./input \
  --output-dir ./output \
  --target-speaker "*CHI" \
  --device auto
```

## Inference defaults

- `batch_size = 4`
- `max_new_tokens = 128`
- `max_seq_length = 512`
- `max_context_chars = 1200`
- `limit = 0`
- greedy decoding (`do_sample = false`)

## Supported runtime inputs

- `.cha`
- `.jsonl` (requires `--speaker-field` and `--text-field`)

Other previously supported formats (`.txt`, `.csv`, `.json`, `.xlsx`) are rejected in adapter-only deployment mode.

## Colab quickstart

See [`examples/colab_quickstart.ipynb`](examples/colab_quickstart.ipynb) for a minimal setup flow.
