Metadata-Version: 2.4
Name: tabstar
Version: 0.1.0
Summary: TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations
Author-email: Alan Arazi <alanarazi7@gmail.com>
License: MIT
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: pandas>=2.2.2
Requires-Dist: peft
Requires-Dist: scikit-learn
Requires-Dist: skrub
Requires-Dist: torch>=2.6.0
Requires-Dist: tqdm
Requires-Dist: transformers>=4.49.0

<img src="src/tabstar/resources/tabstar_logo.png" alt="TabSTAR Logo" width="50%">

**Welcome to the TabSTAR repository! 👋**   
You can use it in two modes: production mode for fitting TabSTAR on your own dataset, and research mode to pretrain TabSTAR and replicate our work in the paper. 

🚧 The repository is under construction: Any bugs or feature request? Please open an issue! 🚧

---

### 📚 Resources

* **Paper**: [TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations](https://arxiv.org/abs/2505.18125)
* **Project Website**: [TabSTAR](https://eilamshapira.com/TabSTAR/)

<img src="src/tabstar/resources/tabstar_arch.png" alt="TabSTAR Logo" width="200%">

---

## Production Mode

Use this mode if you want to fit a pretrained TabSTAR model to your own dataset.  
(Note that currently we still don't support reloading that model for later use, but this is coming soon! 🔜)

### Installation

```bash
source init.sh
```

### Inference Example

TabSTAR uses the sklearn API, and it is as simple as this:

```python
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

from tabstar.tabstar_model import TabSTARClassifier

x = pd.read_csv("src/tabstar/resources/imdb.csv")
y = x.pop('Genre_is_Drama')
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
tabstar = TabSTARClassifier()
tabstar.fit(x_train, y_train)
y_pred = tabstar.predict(x_test)
print(classification_report(y_test, y_pred))
```

Below is a template you can use to quickly get started with TabSTAR in production mode.

```python
from pandas import DataFrame, Series
from sklearn.model_selection import train_test_split

from tabstar.tabstar_model import TabSTARClassifier, TabSTARRegressor

# --- USER-PROVIDED INPUTS ---
x_train = None  # TODO: load your feature DataFrame here
y_train = None  # TODO: load your target Series here
is_cls = None   # TODO: True for classification, False for regression
x_test = None   # TODO Optional: load your test feature DataFrame (or leave as None)
y_test = None   # TODO Optional: load your test target Series (or leave as None)
# -----------------------------

# Sanity checks
assert isinstance(x_train, DataFrame), "x should be a pandas DataFrame"
assert isinstance(y_train, Series), "y should be a pandas Series"
assert isinstance(is_cls, bool), "is_cls should be a boolean indicating classification or regression"

if x_test is None:
    assert y_test is None, "If x_test is None, y_test must also be None"
    x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.1)

assert isinstance(x_test, DataFrame), "x_test should be a pandas DataFrame"
assert isinstance(y_test, Series), "y_test should be a pandas Series"

tabstar_cls = TabSTARClassifier if is_cls else TabSTARRegressor
tabstar = tabstar_cls()
tabstar.fit(x_train, y_train)
y_pred = tabstar.predict(x_test)
```

---

## Research Mode

Use this section when you want to pretrain, finetune, or run baselines on TabSTAR. It assumes you are actively working on model development, experimenting with different datasets, or comparing against other methods.

### Prerequisites

After cloning the repo, run:

```bash
source init.sh
```

This will install all necessary dependencies, set up your environment, and download any example data needed to get started.

### Pretraining

To pretrain TabSTAR on a specified number of datasets:

```bash
python do_pretrain.py --n_datasets=256
```

`--n_datasets` determines how many datasets to use for pretraining. You can reduce this number for quick debugging, but note this will harm downstream performance.

### Finetuning

Once pretraining finishes, note the printed `<PRETRAINED_EXP>` identifier. Then run:

```bash
python do_finetune.py --pretrain_exp=<PRETRAINED_EXP> --dataset_id=46655
```

`--dataset_id` is an ID for the downstream task you want to evaluate yourself on. Only the 400 datasets in the paper are supported.  

### Baseline Comparison

If you want to compare TabSTAR against a classic baseline (e.g., random forest):

```bash
python do_baseline.py --model=rf --dataset_id=46655
```

You can also try other names models supported by `do_baseline.py` (check the script for details).

### License

This work is licensed under the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).

### Citation

If you use TabSTAR in your research, please cite:

```bibtex
@article{arazi2025tabstarf,
  title   = {TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations},
  author  = {Alan Arazi and Eilam Shapira and Roi Reichart},
  journal = {arXiv preprint arXiv:2505.18125},
  year    = {2025},
}
```
