Metadata-Version: 2.4
Name: cylint
Version: 0.1.3
Summary: PySpark anti-pattern linter — catch the code that costs you money
Project-URL: Homepage, https://github.com/clusteryieldanalytics/cylint
Project-URL: Issues, https://github.com/clusteryieldanalytics/cylint/issues
Author: Cluster Yield Analytics
License-Expression: Apache-2.0
Keywords: anti-pattern,linter,performance,pyspark,spark
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.10
Provides-Extra: snapshot
Requires-Dist: requests; extra == 'snapshot'
Description-Content-Type: text/markdown

# cylint

A PySpark linter that catches the anti-patterns costing you real money.

Static analysis for PySpark code. No Spark runtime needed. Zero dependencies. Runs anywhere Python runs.

## Install

```bash
pip install cylint
```

## Usage

```bash
# Lint files or directories
cy lint src/pipelines/

# JSON output for CI
cy lint --format json src/

# Only warnings and critical
cy lint --min-severity warning .
```

Example output:

```
pipeline.py:47:8: CY003 [critical] .withColumn() inside a loop creates O(n²) plan complexity.
  Use .select([...]) with all column expressions instead.

pipeline.py:82:4: CY001 [warning] .collect() called without filtering.
  Consider .limit(N).collect(), .take(N), or using .show() for inspection.

pipeline.py:103:4: CY005 [warning] .cache() with single downstream use.
  Cache is only beneficial when the same DataFrame is used in multiple actions.

Found 3 issues (1 critical, 2 warnings) in 1 file.
```

## Rules

| Rule | Severity | What it catches |
|------|----------|----------------|
| CY001 | warning | `.collect()` without `.filter()` or `.limit()` — the #1 OOM cause |
| CY002 | warning | UDF where a builtin exists (e.g. `udf(lambda x: x.lower())` → `F.lower()`) |
| CY003 | critical | `.withColumn()` in a loop — creates O(n²) Catalyst plans |
| CY004 | info | `SELECT *` in `spark.sql()` strings — prevents column pruning |
| CY005 | warning | `.cache()` / `.persist()` with ≤1 downstream use — wastes memory |
| CY006 | warning | `.toPandas()` on unfiltered DataFrame — collects everything to driver |
| CY007 | critical | `.crossJoin()` or `.join()` without condition — cartesian product |
| CY008 | info | `.repartition()` before `.write()` — unnecessary shuffle |
| CY009 | critical | UDF in `.filter()`/`.where()` — blocks predicate pushdown |
| CY010 | warning | `.join()` without explicit `how=` — ambiguous join type |
| CY011 | warning | `.withColumnRenamed()`/`.drop()` in a loop — O(n²) plan nodes |
| CY012 | warning | `.show()`/`.display()`/`.printSchema()` left in production code |
| CY013 | warning | `.coalesce(1)` before `.write()` — single-executor bottleneck |
| CY014 | critical | Multiple actions without `.cache()` — recomputes full lineage each time |
| CY015 | critical | Non-equi `.join()` condition — implicit cartesian product |
| CY016 | info | Invalid escape sequence in string literal — use raw strings for regex |

List all rules:

```bash
cy rules
```

## How it works

`cylint` uses Python's `ast` module to parse your source files and track DataFrame variables through assignment chains. It knows that anything coming from `spark.read.*`, `spark.sql()`, or `spark.table()` is a DataFrame, and follows method chains from there.

No type stubs. No Spark installation. No imports resolved. Just fast, heuristic analysis that catches the patterns that matter.

## Configuration

Out of the box, every rule runs at its default severity with no exclusions. No config file needed.

If a rule doesn't apply to your codebase, or you want to skip certain directories, drop a `.cylint.yml` in your project root or add a `[tool.cylint]` section to your existing `pyproject.toml`. The linter picks it up automatically.

### .cylint.yml

```yaml
# Only fail on warnings and above (ignore info-level findings)
min-severity: warning

rules:
  CY004: off        # we use SELECT * intentionally in dynamic queries
  CY008: warning    # promote repartition-before-write to warning

exclude:
  - tests/
  - vendor/
  - notebooks/scratch/
```

### pyproject.toml

```toml
[tool.cylint]
min-severity = "warning"
exclude = ["tests/", "notebooks/scratch/"]

[tool.cylint.rules]
CY004 = "off"
CY008 = "warning"
```

## CI Integration

### GitHub Actions

```yaml
name: PySpark Lint
on: pull_request

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install cylint
      - run: cy lint --format github src/
```

The `--format github` flag outputs findings as workflow annotations — they appear inline on the PR diff.

### pre-commit

```yaml
repos:
  - repo: https://github.com/clusteryield/cylint
    hooks:
      - id: spark-lint
        args: [--min-severity, warning]
```

## Exit codes

| Code | Meaning |
|------|---------|
| 0 | No findings |
| 1 | Warnings or info findings |
| 2 | Critical findings |

## Why these rules?

Every rule targets a pattern that either causes OOM crashes, triggers unnecessary shuffles, or prevents Spark's Catalyst optimizer from doing its job. These aren't style opinions — they're the patterns you find in postmortems after a 3am page about a failed pipeline or a $40K surprise on your Databricks bill.

If you've read a "PySpark anti-patterns to avoid" blog post, you've seen these patterns described. This tool catches them automatically, before the code hits production.

## License

Apache 2.0