Assembling
fuzzy_join(): Joining
tables on non-normalized categories with approximate matching. Example
Joiner,
AggJoiner: transformers for joining multiple tables together. Example
Encoding
TableVectorizer:
turn a pandas dataframe into a numerical array for
machine learning. Example
GapEncoder:
OneHotEncoder but robust to typos or non-normalized categories. Example
Cleaning
deduplicate(): merge
categories of similar morphology (spelling). Example
Less data wrangling, more machine learning
tabular_learner:
easily create tabular-learning pipelines that wrangle complex dataframes.
Given, a complex dataframe
df: (expand for full code)
>>> from skrub.datasets import fetch_employee_salaries
>>> dataset = fetch_employee_salaries()
>>> df = dataset.X
>>> y = dataset.y
>>> df
>>> from sklearn.model_selection import cross_val_score
>>> from skrub import tabular_learner
>>> cross_val_score(tabular_learner('regressor'), df, y)
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])