Metadata-Version: 2.4
Name: cqla
Version: 0.1.0
Summary: LINQ-inspired query library for Python collections with Polars-style syntax
Keywords: query,linq,polars,filter,collections,dataframe,sql
Author: Ahmed Muhammad
License-Expression: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.12
Project-URL: Homepage, https://github.com/ahmedmuhammad/cqla
Project-URL: Repository, https://github.com/ahmedmuhammad/cqla
Project-URL: Documentation, https://github.com/ahmedmuhammad/cqla#readme
Description-Content-Type: text/markdown

# Config Query Language API (cqla)

A query language for Python collections, inspired by LINQ with Polars-style syntax.

## Installation

```bash
pip install cqla
```

## The Problem

If you've worked with Pydantic models or dataclasses, you've probably written methods inside these classes, like this:

```python
@dataclass
class Config:
    name: str
    value: str
    enabled: bool
    priority: int

@dataclass
class ConfigStore:
    configurations: list[Config]

    def search_by_name(self, name: str) -> Config | None:
        for cfg in self.configurations:
            if cfg.name == name:
                return cfg

    def get_enabled(self) -> list[Config]:
        return [cfg for cfg in self.configurations if cfg.enabled]

    def get_high_priority(self, threshold: int) -> list[Config]:
        return [cfg for cfg in self.configurations if cfg.priority > threshold]

    def get_enabled_high_priority(self, threshold: int) -> list[Config]:
        return [
            cfg for cfg in self.configurations
            if cfg.enabled and cfg.priority > threshold
        ]

    # ...and so on, a new method for every query pattern
```

This gets tedious. Every new query requirement means another method. The logic is scattered, repetitive, and hard to compose.

With cqla, you don't need any of those methods:

```python
import cqla as cq

configs = [...]  # list of Config objects

# Search by name
cq.Query(configs).filter(cq.field("name") == "database_url").first()

# Get enabled configs
cq.Query(configs).filter(cq.field("enabled") == True).collect()

# High priority enabled configs, sorted
(cq.Query(configs)
   .filter((cq.field("enabled") == True) & (cq.field("priority") > 5))
   .collect())
```

## Works With

cqla works with any Python objects:

- Plain dicts (JSON-like data)
- dataclasses
- Pydantic models
- msgspec Structs
- Any object with attributes

## Inspiration: LINQ and Polars had a Baby

cqla is inspired by [LINQ](https://learn.microsoft.com/en-us/dotnet/csharp/linq/) (Language Integrated Query) from C#/.NET. LINQ lets you query collections using a SQL-like, composable syntax:

```csharp
// C# LINQ
var results = configs
    .Where(c => c.Enabled && c.Priority > 5)
    .Select(c => new { c.Name, c.Value })
    .OrderBy(c => c.Name);
```

cqla brings this same idea to Python, but with syntax borrowed from Polars:

```python
# cqla (Python)
results = (
    cq.Query(configs)
    .filter((cq.field("enabled") == True) & (cq.field("priority") > 5))
    .select("name", "value")
    .collect()
)
```

## Alternatives?

Libraries like [pydash](https://pydash.readthedocs.io/) and [toolz](https://toolz.readthedocs.io/) are excellent for functional programming patterns:

```python
# pydash
import pydash as _

_.filter_(configs, lambda c: c.enabled and c.priority > 5)
_.map_(configs, lambda c: c.name.upper())
_.group_by(configs, "category")

# toolz
from toolz import filter, map, groupby
from toolz.curried import pipe

list(filter(lambda c: c.enabled, configs))
list(map(lambda c: c.name.upper(), configs))
groupby(lambda c: c.category, configs)

# composing operations in toolz
pipe(configs,
     lambda x: filter(lambda c: c.enabled, x),
     lambda x: map(lambda c: {"name": c.name.upper(), "priority": c.priority}, x),
     list)
```

These work, but if you think in SQL, they feel inside-out. The data comes last, the operations are functions you wrap around things, and composing multiple operations requires nesting or piping.

cqla reads like SQL, and when you need custom transformations, `.apply()` lets you drop into a lambda:

```python
# cqla - reads top to bottom, left to right
(
    cq.Query(configs)
    .filter(cq.field("enabled") == True)        # WHERE enabled = true
    .filter(cq.field("priority") > 5)           # AND priority > 5
    .group_by("category")                       # GROUP BY category
    .having(cq.field("priority").count() > 2)   # HAVING COUNT(priority) > 2
    .agg(                                       # SELECT ...
        count=cq.field("name").count(),
        avg_priority=cq.field("priority").mean(),
    )
    .collect()
)

# apply() for custom transformations
(
    cq.Query(configs)
    .filter(cq.field("enabled") == True)
    .select(
        "name",
        name_upper=cq.field("name").apply(str.upper),
        slug=cq.field("name").apply(lambda s: s.lower().replace(" ", "-")),
    )
    .collect()
)
```

## Why Not Polars or Pandas?

Polars and Pandas are built for tabular data — rows and columns, where every row has the same schema. They're optimized for numerical computation on large datasets.

But configuration data, API responses, and domain objects are often semi-structured or nested:

```python
configs = [
    {
        "name": "app",
        "settings": {
            "database": {"host": "localhost", "port": 5432},
            "features": ["auth", "logging", "metrics"],
        },
        "metadata": {"version": 1, "tags": ["production"]},
    },
    {
        "name": "worker",
        "settings": {
            "queue": "redis://localhost",
            # no "database" key here
        },
        "metadata": {"version": 2},  # no "tags" key
    },
]
```

Try loading this into Pandas:

```python
import pandas as pd

df = pd.DataFrame(configs)
print(df)
#      name                                           settings                              metadata
# 0     app  {'database': {'host': 'localhost', 'port': 54...  {'version': 1, 'tags': ['production']}
# 1  worker              {'queue': 'redis://localhost'}                         {'version': 2}

# Want to filter by database host? Good luck.
df[df["settings"].apply(lambda s: s.get("database", {}).get("host")) == "localhost"]
```

The nested dicts stay as opaque objects. You're back to writing lambdas and `.apply()`.

Polars has the same issue:

```python
import polars as pl

df = pl.DataFrame(configs)
# polars.exceptions.SchemaError:
# could not append value: {"database": {"host": "localhost" ...
# struct fields must have a consistent schema
```

Polars won't even load it because the schemas don't match.

cqla handles this naturally:

```python
import cqla as cq

# Filter by nested field
(
    cq.Query(configs)
    .filter(cq.field("settings.database.host") == "localhost")
    .collect()
)

# Access nested fields in select
(
    cq.Query(configs)
    .select(
        "name",
        db_host=cq.field("settings.database.host"),
        version=cq.field("metadata.version"),
    )
    .collect()
)
```

## Features

cqla supports the operations you'd expect from a query language:

```python
import cqla as cq

data = [...]  # list of dicts, dataclasses, Pydantic models, or any objects

# Filtering
cq.Query(data).filter(cq.field("age") > 30).collect()
cq.Query(data).filter((cq.field("age") > 30) & (cq.field("active") == True)).collect()

# Selecting fields
cq.Query(data).select("name", "email").collect()
cq.Query(data).select("name", uppercased=cq.field("name").str.to_uppercase()).collect()

# Adding computed columns
cq.Query(data).with_columns(
    year=cq.field("created_at").dt.year(),
    name_lower=cq.field("name").str.to_lowercase(),
).collect()

# Conditional expressions
cq.Query(data).select(
    "name",
    tier=cq.when(cq.field("score") >= 90).then("gold")
          .when(cq.field("score") >= 70).then("silver")
          .otherwise("bronze"),
).collect()

# Grouping and aggregation
cq.Query(data).group_by("department").agg(
    count=cq.field("id").count(),
    avg_salary=cq.field("salary").mean(),
).collect()

# Filtering groups (HAVING)
cq.Query(data).group_by("department").having(
    cq.field("id").count() >= 5
).agg(
    count=cq.field("id").count(),
).collect()

# Window functions
cq.Query(data).with_columns(
    dept_avg=cq.field("salary").mean().over("department"),
).collect()

# Explode: expand list field into multiple rows
cq.Query(data).explode("tags").collect()
# [{"name": "alice", "tags": ["a", "b"]}] -> [{"name": "alice", "tags": "a"}, {"name": "alice", "tags": "b"}]

# Accessors for strings, lists, sets, datetimes
cq.field("name").str.contains("smith", literal=True)
cq.field("tags").list.len()
cq.field("categories").set.contains("electronics")
cq.field("created_at").dt.year()
```

## Scalability

cqla is built on generators. Operations like `filter`, `select`, and `with_columns` don't materialize the full dataset until you call `.collect()`. This means you can process large datasets without loading everything into memory:

```python
# Process a million records lazily
query = (
    cq.Query(huge_dataset)
    .filter(cq.field("status") == "active")
    .select("id", "name")
)

# Only materializes when you iterate or collect
for record in query:
    process(record)

# Or take just the first 10
query.limit(10).collect()
```

## Examples

The `examples/` directory contains interactive [marimo](https://marimo.io/) notebooks demonstrating cqla with different data types:

- `json_example.py` — querying plain dicts, nested field access
- `pydantic_example.py` — querying Pydantic models, set operations
- `msgspec_example.py` — querying msgspec Structs
- `stress_test.py` — benchmarks with large datasets

To run the examples, clone the repo and install development dependencies:

```bash
git clone https://github.com/ahmedmuhammad/cqla.git
cd cqla
uv sync
uv run marimo edit examples/
```

## License

MIT
