Metadata-Version: 2.4
Name: scikit-sqlearn
Version: 0.1.0
Summary: Convert scikit-learn models and pipelines into executable SQL.
Author: Sofeikov
License-Expression: Apache-2.0
Keywords: machine-learning,scikit-learn,sql,duckdb,inference
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.3.3
Requires-Dist: polars>=1.34.0
Requires-Dist: pyarrow>=21.0.0
Requires-Dist: scikit-learn>=1.8.0
Dynamic: license-file

# sqlearn

**sqlearn** is a Python library designed to convert Scikit-Learn models and pipelines into native SQL queries. This allows you to run machine learning inference directly within your database server (e.g., DuckDB, Postgres, etc.) without extracting data or managing separate inference services.

## Project Purpose

The goal of this project is to implement a generic converter that translates trained sklearn objects into SQL statements. This approach enables:
- **Zero-latency inference**: Run predictions where your data lives.
- **Simplified architecture**: Remove the need for pickle files and separate ML microservices.
- **Portability**: Generate SQL that can be executed on various SQL dialects (powered by `sqlglot`).

Currently, we support:
- Linear Models (`LinearRegression`, `Ridge`, `Lasso`, `ElasticNet`, `SGDRegressor`) into SQL arithmetic expressions or CASE statements.
- `StandardScaler` preprocessing.
- `OneHotEncoder` preprocessing (`handle_unknown="ignore"`, `drop=None`).
- `ColumnTransformer` combining numeric and categorical branches.
- `Pipeline` with `ColumnTransformer` + linear regressor as final estimator.
- `DecisionTreeClassifier` and `RandomForestClassifier`.

## Getting Started

### Prerequisites

- Python >= 3.12
- `uv` for dependency management

### Installation

```bash
pip install sqlearn
```

For local development:

```bash
uv pip install -e .
```

## Running Tests

We use `pytest` for testing. You can run the test suite using the configured script in `pyproject.toml`:

```bash
uv run test
```

Or directly via pytest:

```bash
uv run pytest tests/ -v
```

## Development Rules

1.  **Test First**: Always add tests for new features or bug fixes.
2.  **Integrity**: Do not modify existing tests just to make them pass. If a test fails, fix the implementation, not the test (unless the test itself is incorrect).
3.  **Verification**: Ensure all tests pass before committing or submitting changes.
    ```bash
    uv run test
    ```
4.  **Usage Examples**: When adding new modules, include an `if __name__ == "__main__":` block with a runnable example to verify functionality quickly and make sure you can actually run it;
5. MAKE SURE ALL TESTS PASS NOT JUST THE NEW ONES

## Examples

### Linear Model to SQL

```python
import numpy as np
from sklearn.linear_model import LinearRegression
from sqlearn.linear_model import LinearModelConverter

# Use fixed coefficients so the output SQL is deterministic
model = LinearRegression()
model.coef_ = np.array([2.0, -3.0])
model.intercept_ = 5.0

converter = LinearModelConverter(model)
sql = converter.to_sql(feature_names=["col1", "col2"], table_name="my_table")

print(sql)
# SELECT (2 * col1) + (-3 * col2) + 5 AS prediction FROM my_table
```

### Pipeline to SQL (StandardScaler + OneHotEncoder + LinearRegression)

```python
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sqlearn.pipeline import PipelineConverter

X = pd.DataFrame({"age": [0, 2], "city": ["la", "ny"]})

preprocessor = ColumnTransformer(
    [
        ("num", StandardScaler(), ["age"]),
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["city"]),
    ]
).fit(X)

model = LinearRegression()
model.coef_ = np.array([2.0, 10.0, -5.0])  # num__age, cat__city_la, cat__city_ny
model.intercept_ = 1.0

pipe = Pipeline([("preprocessor", preprocessor), ("model", model)])

sql = PipelineConverter(pipe).to_sql(feature_names=["age", "city"], table_name="people")

print(sql)
# WITH transformed AS (
#   SELECT age - 1 AS num__age,
#          CASE WHEN city = 'la' THEN 1 ELSE 0 END AS cat__city_la,
#          CASE WHEN city = 'ny' THEN 1 ELSE 0 END AS cat__city_ny
#   FROM people
# )
# SELECT (2 * num__age) + (10 * cat__city_la) + (-5 * cat__city_ny) + 1 AS prediction
# FROM transformed
```

### DecisionTreeClassifier to SQL

```python
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sqlearn.tree_model import DecisionTreeClassifierConverter

X = np.array([[0.0], [1.0], [2.0], [3.0]])
y = np.array([0, 0, 1, 1])
clf = DecisionTreeClassifier(max_depth=1, random_state=42).fit(X, y)

sql = DecisionTreeClassifierConverter(clf).to_sql(
    feature_names=["x0"],
    table_name="input_data",
)

print(sql)
# SELECT CASE WHEN (CASE WHEN x0 <= 1.5 THEN 1 ELSE 0 END) >= ...
# ... THEN 0 ELSE 1 END AS prediction FROM input_data
```
