Metadata-Version: 2.4
Name: AuraIndex
Version: 0.1.3
Summary: AST-driven SQL pattern extraction and index recommendation engine.
Author: Mihir Patil
License-Expression: MIT
Project-URL: Homepage, https://github.com/mihir0209/AuraIndex
Project-URL: Repository, https://github.com/mihir0209/AuraIndex
Project-URL: Issues, https://github.com/mihir0209/AuraIndex/issues
Keywords: ast,sql,indexing,database-optimization,static-analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sqlglot<28.0.0,>=26.0.0
Provides-Extra: dev
Requires-Dist: pytest<9.0.0,>=8.0.0; extra == "dev"
Requires-Dist: build<2.0.0,>=1.2.2; extra == "dev"
Requires-Dist: twine<7.0.0,>=5.1.1; extra == "dev"
Provides-Extra: db
Requires-Dist: psycopg[binary]<4.0.0,>=3.1.0; extra == "db"
Requires-Dist: PyMySQL<2.0.0,>=1.1.0; extra == "db"
Dynamic: license-file

# AuraIndex

AuraIndex is an AST-driven index recommendation engine for databases. It statically scans application code, extracts SQL and ORM query patterns, parses them into a SQL AST, aggregates workload-style usage signals, and proposes guarded index recommendations.

## Why this project

Traditional index tuning tools depend on live database telemetry. AuraIndex targets earlier design stages by mining query intent from source code:

- Language AST frontends extract query intent from code.
- A shared SQL AST backend normalizes patterns across sources.
- Heuristics generate index suggestions with explicit confidence and review flags.

This architecture makes it practical to support multiple languages over time while keeping one recommendation core.

## Current scope

- Tree-sitter WASM extraction for Python, JavaScript/TypeScript, and PHP
  - raw SQL call detection (`execute`, `executemany`, `query`, `raw`, `text`)
  - SQL-like query builder chain lifting (`where/filter/join/order/group` patterns)
- SQL file extraction (`.sql`)
- SQL AST analysis with `sqlglot`
- Index heuristics:
  - single-column filter keys
  - join keys
  - composite filter keys
  - where+order composite patterns
- Impact model (enabled by default):
  - read benefit vs write amplification scoring
  - net impact adjustment on suggestion ranking
- Schema-aware checks (enabled by default):
  - existing index detection from SQL DDL (`CREATE INDEX`)
  - exact/prefix overlap detection (including unique/partial overlaps)
  - automatic suppression of already-covered suggestions
- Migration artifact generation (enabled by default):
  - up/down SQL script for suggested indexes
- Optional EXPLAIN validation:
  - sqlite built-in runner (`EXPLAIN QUERY PLAN`)
  - custom external command runner for postgres/mysql
- Engine-aware index type hints (enabled by default):
  - btree / hash / brin / gin heuristics
  - typed DDL variant rendering (when applicable)
- Schema drift watch (enabled by default):
  - possibly unused existing index candidates
  - missing index candidate summary
- Query optimization advisories (enabled by default):
  - `SELECT *` detection
  - `ORDER BY` without `LIMIT`
  - leading wildcard `LIKE` warnings
- Guardrails:
  - deduplication
  - confidence scoring
  - per-table cap on suggestions
  - manual-review flags
- Unified **Tree-sitter WASM** frontend (`web-tree-sitter`) for supported languages

## Architecture

1. **Extractors (language-specific frontend)**
   - `WasmTreeSitterExtractor` invokes a Node script using Tree-sitter WASM for Python/JS/TS/PHP.
   - `extract_sql_file` loads SQL statements directly from `.sql` files.

2. **SQL AST analyzer (shared backend)**
   - Parses candidate SQL with `sqlglot`.
   - Extracts tables, predicate columns, join columns, ordering/grouping columns.
   - Builds query-level feature records.

3. **Workload aggregation**
   - Counts column participation by role (`WHERE`, `JOIN`, `ORDER BY`, `GROUP BY`).
   - Tracks composite filter patterns and where+order pair frequency.

4. **Index suggestion engine**
    - Applies heuristic scoring with confidence.
    - Applies impact scoring (read gain vs write cost).
    - Performs schema-aware overlap checks against discovered existing indexes.
    - Emits dialect-aware `CREATE INDEX` statements.
    - Adds ORM snippet equivalents for SQLAlchemy and Django.

## Installation
  
```bash
cd AuraIndex
python -m venv .venv
.venv\Scripts\activate
pip install -e .[dev]
```

AuraIndex wheels now bundle WASM runtime + grammar assets (Python/JS/TS/TSX/PHP), so npm setup is not required for normal package usage.

Optional (source-development override of local grammars/runtime):

```bash
npm install web-tree-sitter tree-sitter-python tree-sitter-javascript tree-sitter-typescript tree-sitter-php
```

For live database introspection (Postgres/MySQL drivers):

```bash
pip install -e .[dev,db]
```

After publishing to PyPI, install the package with:

```bash
pip install AuraIndex
```

## Quick start

```bash
auraindex --path . --dialect postgres --min-support 2 --show-top
```

Outputs:

- By default, AuraIndex writes reports to a temp folder: `%TEMP%\auraindex-report-*`
- On completion, AuraIndex automatically opens the interactive HTML report in your default browser.
- Artifacts:
  - `auraindex_report.json`: extraction, parsing, workload, and suggestion details
  - `auraindex_indexes.sql`: executable **SQL index script** (DDL = Data Definition Language)
  - `auraindex_migration.sql`: up/down migration SQL (create + rollback)
  - `auraindex_orm_snippets.md`: ORM equivalents
  - `auraindex_report.md`: detailed human-readable report with ranking and evidence
  - `auraindex_report.html`: interactive report with filters, copy buttons, and evidence drill-down

## Useful CLI flags

- `--report-dir D:\reports\scan-01`: write outputs to a fixed directory.
- `--json-out custom.json`, `--ddl-out custom.sql`, `--migration-out migration.sql`, `--orm-out custom.md`, `--md-out details.md`, `--html-out ui.html`
- `--show-top`: print ranked recommendations in terminal.
- `--limit-suggestions 25`: optionally limit ranked suggestions shown in markdown/HTML/terminal; by default AuraIndex shows all suggestions.
- `--strict-parse`: exit code `2` when parse failures are found.
- `--disable-wasm`: scan only `.sql` files (debug mode).
- `--disable-impact-scoring`: turn off net impact rank adjustment.
- `--disable-schema-checks`: turn off existing-index overlap suppression.
- `--disable-index-type-hints`: turn off engine-aware index type recommendation metadata.
- `--disable-drift-watch`: turn off schema drift watch output.
- `--disable-query-advisories`: turn off query optimization advisories.
- `--max-query-advisories 150`: cap advisory rows in report metadata.
- `--enable-db-introspection`: enable live DB schema/index introspection.
- `--db-dialect postgres|mysql|sqlite`: live DB dialect (defaults to `--dialect` if omitted).
- `--db-host`, `--db-port`, `--db-user`, `--db-password` / `--db-password-env`, `--db-name`
- `--db-path` (sqlite), `--db-schema` (postgres schema filter), `--db-timeout-seconds`
- `--disable-migration-output`: skip writing migration SQL artifact.
- `--explain-validate --explain-db-path D:\db.sqlite`: run sqlite EXPLAIN validation.
- `--explain-validate --explain-command "psql ... -c \"EXPLAIN {sql}\""`: run custom EXPLAIN command.

Use help anytime:

```bash
auraindex --help
```

## WASM runtime assets

```bash
npm install web-tree-sitter tree-sitter-python tree-sitter-javascript tree-sitter-typescript tree-sitter-php
auraindex --path .
```

AuraIndex bundles default WASM grammar/runtime assets in the wheel.
It can also auto-detect external grammars from local `node_modules`, and supports `AURAINDEX_WASM_GRAMMAR_DIR`.

If WASM grammars are unavailable, AuraIndex now degrades gracefully:

- Python files are still scanned using a built-in Python AST fallback.
- SQL files continue to be scanned normally.
- JS/TS/PHP source files are scanned with a built-in text fallback extractor (raw SQL + query-builder pattern lifting) if WASM is unavailable.

Default excluded directories include generated assets and build outputs:

- `.git`, `node_modules`, `__pycache__`, `.venv`, `venv`
- `build`, `dist`, `staticfiles`, `.next`

This avoids parse noise from minified vendor bundles in production/static folders.

## Cardinality hints (optional)

You can provide a JSON file to bias recommendations:

```json
{
  "users.email": "high",
  "orders.status": "low"
}
```

Use:

```bash
auraindex --path . --cardinality-hints cardinality_hints.json
```

Hints are guardrails only; they do not replace runtime telemetry.

## Example report fields

`auraindex_report.json` includes:

- `summary`: scanned files, candidate count, parse success/failures, suggestion count
- `candidates`: extracted SQL with `file`, `line`, `function`, `source_kind`
- `workload`: per-table and per-column usage metrics
- `suggestions`: index DDL, confidence, evidence, impact metadata, schema checks, and manual-review flags
- `detailed_index_evidence`: per-index support counts, matched query excerpts, and source locations

## How confidence is computed

Confidence is a bounded heuristic score in `[0.50, 0.98]`, not a probabilistic guarantee.

- Join-key suggestions start higher (`~0.80`) and rise with repeated JOIN evidence.
- Filter-key suggestions start lower (`~0.74`) and rise with repeated WHERE evidence.
- Composite suggestions start higher (`~0.82-0.84`) because repeated multi-column patterns are stronger signals.
- Cardinality hints can nudge confidence (`high` raises, `low` lowers).
- Manual-review flag is set for lower-confidence suggestions and wider composites.

In short: more repeated, structurally strong evidence => higher confidence.

## What `--min-support` means

`--min-support` is the minimum repetition threshold before AuraIndex recommends an index.

Example:

- `--min-support 1`: include one-off patterns (higher recall, more noise)
- `--min-support 3`: include only repeated patterns (cleaner, more conservative)

Use lower values for exploration, higher values for production-oriented recommendations.

## Composite index handling

AuraIndex explicitly models composite opportunities and prunes redundant suggestions:

- Detects repeated multi-column WHERE patterns
- Detects WHERE + ORDER BY patterns for left-prefix composites
- Suppresses redundant single-column suggestions when a stronger composite prefix exists
- Limits per-table output via `--max-indexes-per-table`

## Impact scoring model (default ON)

AuraIndex estimates whether an index is likely to be worth it:

- **Read benefit**: weighted by WHERE/JOIN/ORDER/GROUP usage and support evidence.
- **Write cost**: estimated from table write ratio and index width.
- **Net impact** = read benefit - write cost.

This net impact adjusts ranking (`score_adjustment`) and is shown in JSON/Markdown/HTML reports.

Disable with `--disable-impact-scoring` when you want raw heuristic-only ranking.

## Schema-aware checks (default ON)

AuraIndex scans discovered SQL DDL and extracts existing indexes, then checks each suggestion for:

- exact match (`skip_exact_existing`)
- covered by existing left-prefix (`skip_covered_by_existing_prefix`)
- narrower overlap requiring review (`review_overlaps_existing_prefix`)
- unique/partial overlap signals

Suggestions already covered by existing indexes are skipped by default to reduce noise.

Disable with `--disable-schema-checks` if you want full raw suggestions.

With `--enable-db-introspection`, AuraIndex also pulls existing indexes directly from live DB metadata:

- PostgreSQL: `pg_indexes`
- MySQL: `INFORMATION_SCHEMA.STATISTICS`
- SQLite: `PRAGMA index_list` + `PRAGMA index_info`

This significantly improves practical overlap detection when schema is not fully represented in code migrations.

## Migration SQL (up/down)

AuraIndex emits `auraindex_migration.sql` with:

- **Up**: suggested `CREATE INDEX ...`
- **Down**: corresponding rollback (`DROP INDEX ...`)

Disable file generation with `--disable-migration-output`.

## Optional EXPLAIN validation

AuraIndex can annotate suggestions with execution-plan checks:

- SQLite mode: `--explain-validate --explain-db-path <path>`
- External mode: `--explain-validate --explain-command "<command with {sql}>"`

Results are included in report metadata and rendered in Markdown/HTML.

## Engine-aware index type hints (default ON)

AuraIndex recommends an index access method per suggestion (dialect-aware):

- `btree` as baseline default
- `hash` for equality-heavy single-column patterns (low write ratio)
- `brin` for time-correlated columns in append-style workloads
- `gin` for JSON/metadata-like columns
- `ivfflat` hint for vector-like columns (manual operator-class tuning required)

The report includes rationale and typed DDL variant when safely renderable.

Disable with `--disable-index-type-hints`.

## Schema drift watch (default ON)

AuraIndex emits drift-oriented metadata:

- possibly unused existing indexes (non-unique indexes with no observed column usage)
- missing index candidates from current scan

Rendered in Markdown/HTML under **Schema Drift Watch**.

Disable with `--disable-drift-watch`.

When live DB introspection is enabled, drift watch uses live index inventory as the primary signal.

## Query optimization advisories (default ON)

AuraIndex includes static query-level improvement hints:

- `SELECT *` projection warning
- `ORDER BY` without `LIMIT` warning
- leading wildcard `LIKE '%x'` warning

Rendered in Markdown/HTML under **Query Optimization Advisories**.

Disable with `--disable-query-advisories`.

## Latest end-to-end validation snapshot

Run command:

```bash
auraindex --path D:\SuccessPilot --dialect postgres --min-support 2 --report-dir D:\AutoCure\AuraIndex\reports\successpilot-strong-adds
```

Observed output snapshot (latest run):

- scanned files: `1002`
- query candidates: `62`
- parsed queries: `62`
- parse failures: `0`
- suggested indexes: `17`
- schema checks: existing index discovery and overlap checks enabled by default

Generated artifacts:

- `auraindex_report.json`
- `auraindex_indexes.sql`
- `auraindex_migration.sql`
- `auraindex_orm_snippets.md`
- `auraindex_report.md`
- `auraindex_report.html`

Live DB mode examples:

```bash
# PostgreSQL
auraindex --path . --dialect postgres --enable-db-introspection --db-dialect postgres --db-host localhost --db-port 5432 --db-user app --db-password-env DB_PASSWORD --db-name appdb --db-schema public

# MySQL
auraindex --path . --dialect mysql --enable-db-introspection --db-dialect mysql --db-host localhost --db-port 3306 --db-user app --db-password-env DB_PASSWORD --db-name appdb

# SQLite
auraindex --path . --dialect sqlite --enable-db-introspection --db-dialect sqlite --db-path D:\data\app.db
```

Per-service verification (SuccessPilot microservices):

- `ai-worker`: `fail=0`
- `analytics`: `fail=0`
- `backend`: `fail=0` (after excluding `staticfiles` generated assets)
- `frontend`: `fail=0` (after excluding `build` artifacts)
- `frontend_new`: `fail=0`
- `live-updates`: `fail=0`
- `onboarding`: `fail=0`

## Limitations

- Static analysis cannot observe runtime frequency, parameter distributions, or write pressure.
- ORM-to-SQL reconstruction is approximate for complex query builders.
- EXPLAIN validation is optional and environment-dependent (DB connectivity and compatible SQL needed).
- Suggestions are intentionally conservative and should still be reviewed before production rollout.

## Testing

```bash
pytest -q
```

## Build and publish (PyPI)

```bash
python -m pip install -U build twine
python -m build
twine check dist/*
# twine upload dist/*
```

Release checklist:

- Package name available on PyPI.
- Version bumped (`pyproject.toml`).
- Long description renders (`twine check` passes).
- `LICENSE` and `README.md` included in source distribution.

## Roadmap

- Add richer ORM semantic lifting (SQLAlchemy Core/ORM constructs).
- Add log-based dynamic signal fusion.
- Add adaptive suggestion ranking using execution metrics.
- Expand language frontends while retaining the shared SQL AST backend.
