Metadata-Version: 2.1
Name: costa-utils
Version: 0.1.1
Summary: 
Author: Costa Huang
Requires-Python: >=3.9,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: datasets (>=2.20.0,<3.0.0)
Requires-Dist: rich (>=13.7.1,<14.0.0)
Requires-Dist: tyro (>=0.8.5,<0.9.0)
Description-Content-Type: text/markdown

# costa-utils

This repo contains some personal utilities to do quick things. Currently we have utils to help visualize Hugging Face's preference and SFT datasets.


# Get started


Visualizing a HF SFT dataset:

```bash
# visualizing https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture
python -m costa_utils.hf_viz \
    --sft allenai/tulu-v2-sft-mixture \
    --split train \
    --sft_messages_column_name messages
python -m costa_utils.hf_viz \
    --sft AI-MO/NuminaMath-TIR \
    --split train \
    --sft_messages_column_name messages
```

![](static/sft.png)

which is a bit easier to read than

![](static/sft_hf.png)


Visualizing a HF preference dataset:

```bash
# visualizing https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
python -m costa_utils.hf_viz \
    --preference HuggingFaceH4/ultrafeedback_binarized \
    --split train_prefs \
    --preference_chosen_column_name chosen \
    --preference_rejected_column_name rejected
```

![](static/pref.png)

which is a bit easier to read than

![](static/pref_hf.png)



## dev note

It's simple to debug. Just replace `python -m costa_utils.hf_viz` with `python costa_utils/hf_viz.py`

```bash
python -m costa_utils.hf_viz \
    --preference HuggingFaceH4/ultrafeedback_binarized \
    --split train_prefs \
    --preference_chosen_column_name chosen \
    --preference_rejected_column_name rejected
```
