Metadata-Version: 2.4
Name: tabicl
Version: 2.0.3
Summary: TabICL: A state-of-the-art tabular foundation model
Author: Jingang Qu, David Holzmüller, Marine Le Morvan, Gaël Varoquaux
License: BSD 3-Clause License
        
        Copyright (c) 2025, Soda team @ Inria
        
        Redistribution and use in source and binary forms, with or without
        modification, are permitted provided that the following conditions are met:
        
        1. Redistributions of source code must retain the above copyright notice, this
           list of conditions and the following disclaimer.
        
        2. Redistributions in binary form must reproduce the above copyright notice,
           this list of conditions and the following disclaimer in the documentation
           and/or other materials provided with the distribution.
        
        3. Neither the name of the copyright holder nor the names of its
           contributors may be used to endorse or promote products derived from
           this software without specific prior written permission.
        
        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
        AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
        IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
        DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
        FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
        DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
        SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
        CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
        OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
        OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
        
        Code in the src/tabicl/forecast directory is currently derived work from
        TabPFN-TS https://github.com/PriorLabs/tabpfn-time-series
        
        As such it is under the following license:
        
          Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "[]"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright 2025 Prior Labs GmbH
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
        
License-File: LICENSE
Keywords: TabICL,foundation model,in-context learning,tabular data
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.10
Requires-Dist: einops>=0.7
Requires-Dist: huggingface-hub
Requires-Dist: numpy
Requires-Dist: psutil
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scipy
Requires-Dist: torch>=2.2
Requires-Dist: tqdm>=4.64.0
Provides-Extra: all
Requires-Dist: gluonts>=0.16.0; extra == 'all'
Requires-Dist: matplotlib>=3.10.3; extra == 'all'
Requires-Dist: pandas>=2.1.2; extra == 'all'
Requires-Dist: statsmodels>=0.14.5; extra == 'all'
Requires-Dist: transformers; extra == 'all'
Requires-Dist: wandb; extra == 'all'
Requires-Dist: xgboost; extra == 'all'
Provides-Extra: forecast
Requires-Dist: gluonts>=0.16.0; extra == 'forecast'
Requires-Dist: matplotlib>=3.10.3; extra == 'forecast'
Requires-Dist: pandas>=2.1.2; extra == 'forecast'
Requires-Dist: statsmodels>=0.14.5; extra == 'forecast'
Provides-Extra: pretrain
Requires-Dist: transformers; extra == 'pretrain'
Requires-Dist: wandb; extra == 'pretrain'
Requires-Dist: xgboost; extra == 'pretrain'
Description-Content-Type: text/markdown

[![test](https://github.com/soda-inria/tabicl/actions/workflows/testing.yml/badge.svg)](https://github.com/soda-inria/tabicl/actions/workflows/testing.yml)
[![PyPI version](https://badge.fury.io/py/tabicl.svg)](https://badge.fury.io/py/tabicl)
[![Downloads](https://img.shields.io/pypi/dm/tabicl)](https://pypistats.org/packages/tabicl)

# TabICLv2: A state-of-the-art tabular foundation model

This repository is the official implementation of **TabICLv2** ([arXiv](https://arxiv.org/abs/2602.11139)) 
and **TabICL** ([ICML 2025](https://arxiv.org/abs/2502.05564)).

**State-of-the-art accuracy even without hyperparameter tuning:** 
TabICLv2 is the new state-of-the-art model for tabular classification and regression 
on the [TabArena](https://tabarena.ai) and [TALENT](https://arxiv.org/abs/2407.00956) benchmarks. 
It does not require hyperparameter tuning 
and still outperforms heavily tuned XGBoost, CatBoost, or LightGBM on TabArena on ~80% of datasets.

**Easy to use:** TabICL is pip-installable and scikit-learn compliant. 
It is also **open source** (including [pre-training](#pre-training) for v1), 
with a permissive license.

**Speed:** TabICL performs `fit` and `predict` jointly via a single 
forward pass through a pre-trained transformer model. 
For larger datasets, we recommend a GPU.
On an H100 GPU, TabIClv2 can `fit` and `predict` a dataset 
with 50,000 samples and 100 features in under 10 seconds, 
which is 10x faster than TabPFN-2.5.
Through KV caching, TabICL supports faster repeated inference on the same training data.

**Scalability:** TabICL shows excellent performance on benchmarks 
with 300 to 100,000 training samples and up to 2,000 features. 
It can scale to even larger datasets (e.g., 500K samples) through CPU and disk offloading, 
though its accuracy may degrade at some point.

<img src="./figures/pareto_front_improvability_tabarena.png" width="70%" alt="Model comparison on TabArena" style="display: block; margin: auto;">

## Installation

```bash
pip install tabicl
```

Optional dependencies can be installed as needed:
```bash
pip install tabicl[forecast]   # time series forecasting
pip install tabicl[pretrain]   # pre-training
pip install tabicl[all]        # everything
```

## Basic usage

```python
from tabicl import TabICLClassifier, TabICLRegressor

clf = TabICLClassifier()
clf.fit(X_train, y_train)  # downloads checkpoint on first use, otherwise cheap
clf.predict(X_test)  # in-context learning happens here

reg = TabICLRegressor()
reg.fit(X_train, y_train)
reg.predict(X_test)
```

To speed up repeated inference on the same training data, enable KV caching. The cache is built during `fit` and reused across `predict` calls. Note that this consumes additional memory to store the cached projections, so consider the trade-off for your use case:

```python
clf = TabICLClassifier(kv_cache=True)
clf.fit(X_train, y_train)  # caches key-value projections for training data
clf.predict(X_test)  # fast: only processes test data by reusing the cached context
```

Save and load a fitted classifier or regressor:

```python
clf.save(
    "classifier.pkl",
    save_model_weights=False,  # if False, reload from checkpoint on load
    save_training_data=True,   # if True, include training data; if False, discard it (requires KV cache)
    save_kv_cache=True,        # if True and KV cache exists, save it
)
clf = TabICLClassifier.load("classifier.pkl")
```

When KV cache exists and is saved, you can set `save_training_data=False` to exclude
cached training data, which may be useful for data privacy.

## Advanced configuration

TabICL offers a set of parameters to customize its behavior. The following example shows all available parameters with their default values and brief descriptions:

```python
from tabicl import TabICLClassifier

clf = TabICLClassifier(
    n_estimators=8,  # number of ensemble members, more = better but slower
    norm_methods=None,  # normalization methods to try
    feat_shuffle_method="latin",  # feature permutation strategy
    class_shuffle_method="shift",  # class permutation strategy
    outlier_threshold=4.0,  # z-score threshold for outlier detection and clipping
    softmax_temperature=0.9,  # temperature to control prediction confidence
    average_logits=True,  # average logits (True) or probabilities (False)
    support_many_classes=True,  # handle >10 classes automatically
    batch_size=8,  # ensemble members processed together, lower to save memory
    kv_cache=False,  # cache training data KV projections for faster repeated inference
    model_path=None,  # path to checkpoint, None downloads from Hugging Face
    allow_auto_download=True,  # auto-download checkpoint if not found locally
    checkpoint_version="tabicl-classifier-v2-20260212.ckpt",  # pretrained checkpoint version
    device=None,  # inference device, None auto-selects CUDA or CPU; specify "mps" for Apple Silicon
    use_amp="auto",  # automatic mixed precision for faster inference
    use_fa3="auto",  # Flash Attention 3 for Hopper GPUs (e.g. H100)
    offload_mode="auto",  # automatically decide when to use cpu/disk offloading
    disk_offload_dir=None,  # directory for disk offloading
    random_state=42,  # random seed for reproducibility
    n_jobs=None,  # number of PyTorch threads for CPU inference
    verbose=False,  # print detailed information during inference
    inference_config=None,  # fine-grained inference control for advanced users
)
```

`TabICLRegressor` accepts the same parameters except for the classification-specific ones:
`class_shuffle_method`, `softmax_temperature`, `average_logits`, and `support_many_classes`.

## Available models

| Model | Classification checkpoint | Regression checkpoint |
|-------|--------------------------|----------------------|
| **TabICLv2** ([arXiv](https://arxiv.org/abs/2602.11139)) | `tabicl-classifier-v2-20260212.ckpt` (default) | `tabicl-regressor-v2-20260212.ckpt` (default) |
| **TabICLv1.1** (May 2025, no paper) | `tabicl-classifier-v1.1-20250506.ckpt` | — |
| **TabICLv1** ([ICML 2025](https://arxiv.org/abs/2502.05564)) | `tabicl-classifier-v1-20250208.ckpt` | — |

- **TabICLv2**: Our state-of-the-art model, supporting both classification and regression.
  Strongly improved accuracy over v1 through better synthetic pre-training data,
  architectural improvements, and better pre-training, with comparable runtime.
- **TabICLv1.1**: TabICLv1 post-trained on an early version of the v2 prior. Classification only.
- **TabICLv1**: Original model. Classification only.
  TabICLv1 and v1.1 originally used `n_estimators=32`; we reduced the default to 8 afterwards.

## Time series forecasting

TabICL can be used for zero-shot time series forecasting via `TabICLForecaster`.
Install the forecast dependencies first:

```bash
pip install tabicl[forecast]
```

`TabICLForecaster` accepts the following parameters:

```python
from tabicl import TabICLForecaster

forecaster = TabICLForecaster(
    max_context_length=4096,  # max historical timesteps to use as context
    temporal_features=None,  # timestep index, calendar patterns, and seasonality
    point_estimate="mean",  # point prediction method: "mean" or "median"
    tabicl_config=None,  # passed to TabICLRegressor; None uses default settings
)
```

The following example shows how it works for univariate forecasting:

```python
import pandas as pd
from tabicl import TabICLForecaster
from tabicl.forecast import TimeSeriesDataFrame, plot_forecast

df = pd.read_csv(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/australian_electricity_subset/test.csv",
    parse_dates=["timestamp"],
)
data = TimeSeriesDataFrame.from_data_frame(df)

prediction_length = 96
selected_items = data.item_ids[:2]
train_data, test_data = data.train_test_split(prediction_length)

context_df = train_data.reset_index()
context_df = context_df[context_df["item_id"].isin(selected_items)]
test_df = test_data.reset_index()
test_df = test_df[test_df["item_id"].isin(selected_items)]
test_df = test_df.groupby("item_id").tail(prediction_length)

forecaster = TabICLForecaster(max_context_length=10240)
pred_df = forecaster.predict_df(context_df, prediction_length=prediction_length)
fig, axes = plot_forecast(context_df=context_df, pred_df=pred_df, test_df=test_df)
```

<img src="./figures/tabiclv2_time_series.png" width="60%" alt="Runtimes for different hardware and sample sizes" style="display: block; margin: auto;">

`TabICLForecaster` is heavily inspired by [TabPFN-TS](https://arxiv.org/abs/2501.02945v3). We may later improve it to enhance the ability of TabICL for time series forecasting.

## Pre-training

Pre-training code (including synthetic data generation) is currently available for the v1 model. 
The scripts folder provides the commands for [stage 1](./scripts/train_stage1.sh), [stage 2](./scripts/train_stage2.sh), 
and [stage 3](./scripts/train_stage3.sh) of curriculum learning.
Pre-training code for v2 will be released upon publication.

## Nanotabicl: a minimal architecture implementation

We provide a minimal implementation of the TabICLv2 architecture 
[here](https://github.com/soda-inria/nanotabicl), 
for educational and experimental purposes.

## TODO

- [ ] Documentation
- [ ] Multi-GPU parallel inference

## FAQ

**What is TabICL?**
TabICL is a tabular foundation model (like TabPFN). 
It uses in-context learning (ICL) to learn from new data 
in a single forward pass through a Transformer model: 
`y_pred = model(X_train, y_train, X_test)` (this is called inside `predict()`).
It has acquired strong learning capabilities through 
pre-training on millions of synthetic datasets.

**How fast is TabICL?** On datasets with $n$ training rows and $m$ columns, 
the runtime complexity of TabICL (v1 and v2) is $O(n^2 + nm^2)$. 
On datasets with many rows and columns, it can be 10x faster than TabPFN-2.5. 
On modern GPUs, TabICL can handle a million samples 
in a few minutes without RAM overflow
thanks to CPU and disk offloading.

<img src="./figures/runtime_tabpfnv25_tabiclv2.png" width="70%" alt="Runtimes for different hardware and sample sizes" style="display: block; margin: auto;">

**What dataset sizes work well?** 
TabICLv2 is pre-trained on datasets between 300 and 48K training samples.
However, it can generalize to larger datasets to some extent, 
and we see good results even on some datasets with 600K samples. 
We have not tested if TabICL generalizes to datasets smaller than 300 samples.

<img src="./figures/tabiclv2_perf_vs_n_samples.png" width="70%" alt="Average rank vs. number of samples" style="display: block; margin: auto;">

**What about the number of columns?**
TabICLv2 is pre-trained on datasets between 2 and 100 columns. 
We see good generalization to more columns and don't know where the limit is.

<img src="./figures/tabiclv2_perf_vs_n_features.png" width="70%" alt="Average rank vs. number of features" style="display: block; margin: auto;">

## Preprocessing

### Simple built-in preprocessing
If the input `X` to TabICL is a pandas DataFrame, TabICL will automatically:
- Detect and ordinal encode categorical columns (including string, object, category, and boolean types)
- Create a separate category for missing values in categorical features
- Perform mean imputation for missing numerical values (encoded as NaN)

If the input `X` is a numpy array, TabICL assumes that ordinal encoding and missing value imputation have already been performed.

For both input types, TabICL applies additional preprocessing:
- Outlier detection and removal
- Feature scaling and normalization
- Feature shuffling for ensemble diversity

### Advanced data preprocessing with skrub <img src="https://skrub-data.github.io/stable/_static/skrub.svg" width="8%" alt="skrub logo" style="display: inline; margin-left: 5px; margin-right: 5px;">

Real-world datasets often contain complex heterogeneous data that benefits from more sophisticated preprocessing. For these scenarios, we recommend [skrub](https://skrub-data.org/stable/index.html), a powerful library designed specifically for advanced tabular data preparation.

**Why use skrub?**
- Handles diverse data types (numerical, categorical, text, datetime, etc.)
- Provides robust preprocessing for dirty data
- Offers sophisticated feature engineering capabilities
- Supports multi-table integration and joins

#### Installation

```bash
pip install skrub -U
```

#### Basic Integration

Use skrub's [TableVectorizer](https://skrub-data.org/stable/reference/generated/skrub.TableVectorizer.html) to transform your raw data before passing it to TabICLClassifier:

```python
from skrub import TableVectorizer
from tabicl import TabICLClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    TableVectorizer(low_cardinality="passthrough"),  # Automatically handles various data types
    TabICLClassifier()
)

pipeline.fit(X_train, y_train)  # X should be a DataFrame
predictions = pipeline.predict(X_test)
```

## Citation
If you use TabICL for research purposes, 
please cite our papers for **[TabICL](https://arxiv.org/abs/2502.05564)** and **[TabICLv2](https://arxiv.org/abs/2602.11139)**:
```bibtex
@inproceedings{qu2025tabicl,
  title={Tab{ICL}: {A} Tabular Foundation Model for In-Context Learning on Large Data},
  author={Qu, Jingang and Holzm{\"u}ller, David and Varoquaux, Ga{\"e}l and Le Morvan, Marine},
  booktitle={International Conference on Machine Learning},
  year={2025}
}

@article{qu2026tabiclv2,
  title={{TabICLv2}: {A} better, faster, scalable, and open tabular foundation model},
  author={Qu, Jingang and Holzm{\"u}ller, David and Varoquaux, Ga{\"e}l and Le Morvan, Marine},
  journal={arXiv preprint arXiv:2602.11139},
  year={2026}
}
```

## Contributors

- [Jingang Qu](https://github.com/jingangQu)
- [David Holzmüller](https://github.com/dholzmueller)
- [Marine Le Morvan](https://github.com/marineLM)

## Star history

[![Star History Chart](https://api.star-history.com/svg?repos=soda-inria/tabicl&type=date&legend=top-left)](https://www.star-history.com/#soda-inria/tabicl&type=date&legend=top-left)