Metadata-Version: 2.4
Name: datanaut
Version: 0.1.1
Summary: A CLI SQL agent that converts natural language to optimized BigQuery SQL using GPT-4
Author-email: Your Name <you@example.com>
License: MIT
Project-URL: Homepage, https://github.com/yourname/datanaut
Project-URL: Issues, https://github.com/yourname/datanaut/issues
Keywords: bigquery,sql,ai,gpt4,cli,nlp,data
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=1.30.0
Requires-Dist: google-cloud-bigquery>=3.11.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: wcwidth>=0.2.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# 🚀 Datanaut — BigQuery SQL Agent

[![PyPI version](https://badge.fury.io/py/datanaut.svg)](https://pypi.org/project/datanaut/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A CLI agent that converts plain English questions into optimized BigQuery SQL using GPT-4.
Built for data teams who want to query BigQuery without writing SQL.

```
  ✦ · ✧ · ✦ · ✧ · ✦ · ✧ · ✦ · ✧ · ✦ · ✧ · ✦ · ✧ · ✦ · ✧

  DATANAUT  >>  SQL Explorer v1.0
  Navigating the BigQuery galaxy...

  ✦ · ✧ · ✦ · ✧ · ✦ · ✧ · ✦ · ✧ · ✦ · ✧ · ✦ · ✧ · ✦ · ✧
```

## Installation

```bash
pip install datanaut
```

## Quick start

```bash
# 1. Scaffold config in your project folder
datanaut init

# 2. Fill in your keys in .env
#    OPENAI_API_KEY=sk-...
#    BIGQUERY_PROJECT_ID=your-project

# 3. Edit schema.json with your BigQuery tables / table functions

# 4. Authenticate with GCP
gcloud auth application-default login

# 5. Launch
datanaut
```

## Usage

```bash
datanaut                              # interactive REPL
datanaut ask "total revenue today"    # one-shot query
datanaut ask "..." --dry-run          # validate SQL, skip execution
datanaut test                         # run accuracy test suite
datanaut init                         # scaffold .env + schema.json
datanaut --version
```

### Interactive REPL example

```
  🧑‍🚀  You: show me total revenue by region last month

  🗺️   Retrieving relevant schema...
  🤖   Generating SQL with GPT-4...
  ⚗️   Running BigQuery dry run...

  ╔══════════════════════════════════════════════╗
  ║  >>  MISSION QUERY                           ║
  ╠══════════════════════════════════════════════╣
  ║  WITH last_month AS (                        ║
  ║    SELECT region,                            ║
  ║           SUM(total_revenue) AS revenue      ║
  ║    FROM `project.dataset.orders`             ║
  ║    WHERE order_date >= ...                   ║
  ║  )                                           ║
  ║  SELECT * FROM last_month                    ║
  ║  ORDER BY revenue DESC LIMIT 1000            ║
  ╚══════════════════════════════════════════════╝

  ✅  Estimated scan: 84.2 MB
  🚀  Launch this query? (y/n):
```

## Configuration

### .env

```env
OPENAI_API_KEY=sk-...
BIGQUERY_PROJECT_ID=your-gcp-project-id
BQ_COST_THRESHOLD_BYTES=5368709120   # 5 GB — queries above this are rejected
CACHE_ENABLED=true
CACHE_TTL_SECONDS=3600
GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json   # optional
```

### schema.json

Describe your BigQuery tables or table functions. The richer the descriptions,
the more accurate GPT-4's SQL will be.

```json
[
  {
    "dataset": "your_dataset",
    "table_name": "orders",
    "description": "Customer orders with revenue and status",
    "partition_column": "order_date",
    "tags": ["sales", "revenue", "orders"],
    "columns": [
      { "name": "order_id",      "type": "STRING",  "description": "Unique order ID" },
      { "name": "order_date",    "type": "DATE",    "description": "Order date — partition column" },
      { "name": "total_revenue", "type": "FLOAT64", "description": "Revenue in USD" }
    ]
  }
]
```

For BigQuery **table functions (TVFs)**, use:

```json
{
  "function_name": "healing_sessions",
  "description": "Session data filtered by date range",
  "call_syntax": "`project`.`dataset`.healing_sessions(start_date_param, end_date_param)",
  "parameters": [
    { "name": "start_date_param", "type": "DATE" },
    { "name": "end_date_param",   "type": "DATE" }
  ],
  "columns": [ ... ]
}
```

## GCP Authentication

```bash
# Option A — personal credentials (easiest)
gcloud auth application-default login

# Option B — service account key
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
```

## Features

- GPT-4 with function calling for structured, reliable SQL output
- BigQuery dry run before every execution — validates SQL and estimates cost
- Configurable cost threshold — rejects expensive queries automatically
- Self-correcting retry loop — passes BigQuery errors back to GPT-4
- In-memory result cache with TTL (Redis-ready interface)
- Multi-turn conversation memory — refine queries across turns
- Keyword-based schema retrieval (swap for vector search in production)
- Batch test runner for accuracy regression testing

## License

MIT
