Metadata-Version: 2.4
Name: hexadruid
Version: 0.1.5
Summary: Smart Spark Optimizer: Skew Rebalancer + Key Detector + DRTree
Home-page: https://github.com/OmarAttia95/hexadruid
Author: Omar Attia
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyspark>=3.5.1
Requires-Dist: pandas
Requires-Dist: matplotlib
Requires-Dist: seaborn
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# HexaDruid 🧠⚡

[![PyPI version](https://badge.fury.io/py/hexadruid.svg)](https://badge.fury.io/py/hexadruid)
[![Python Version](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**HexaDruid** is an intelligent Spark optimizer designed to tackle **data skew**, **ambiguous key detection**, and **schema bloat** using smart salting, recursive shard-aware rule trees, and adaptive tuning. It enables better parallelism, safer memory layout, and intelligent insight into skewed datasets using PySpark’s native DataFrame API.

---

## 🚀 Installation

```bash
pip install hexadruid
```

---

## 🔍 Features

- 📊 **Smart Salting** using Z-score or IQR skew analysis + percentile bucketing
- 🌲 **Recursive DRTree** for shard-based logical filtering with SQL predicates
- 🔑 **Primary & Composite Key Detection** (UUIDs, alphanumerics, hex — optional)
- 🧠 **Schema Inference** with safe type coercion, length introspection & metadata tags
- ⚙️ **Auto-Parameter Advisor** for optimal salt count and shuffle parallelism
- 📉 **Z-Score Plots** and **partition size diagnostics** for visibility
- ✅ Fully **PySpark-native** — No RDDs, no CLI dependencies, no black-box wrappers

---

## 🧠 Quickstart

```python
from hexadruid import HexaDruid

hd = HexaDruid(df)

# Step 1: Apply smart salting to balance skew
df_salted = hd.apply_smart_salting("sales_amount")

# Step 2 (Optional): Detect candidate primary or composite keys
key_info = hd.detect_keys()

# Step 3: Run schema optimizer + DRTree analyzer
typed_df, inferred_schema, dr_tree = HexaDruid.schemaVisor(df)
```

---

## 📚 What Does It Do?

Imagine a typical DataFrame:

| order_id (UUID) | amount  |
|-----------------|---------|
| a12e...         | 500.0   |
| b98c...         | 5000.0  |
| ...             | ...     |

You're doing:

```python
df.groupBy("amount").agg(...)
```

But **most rows have the same `amount`**, so Spark sends 99% of the work to 1 executor = skew 💥

---

### ⚖️ Smart Salting to the Rescue

```python
df2 = hd.apply_smart_salting("amount")
```

What happens?

```
 Step 1: Analyze column distribution via IQR or Z-score
 Step 2: Generate N percentile buckets
 Step 3: Assign salt ID per row using bucket bounds
 Step 4: Create salted_key = amount_salt
 Step 5: Repartition on salted_key for parallelism
```

📈 This rebalances the shuffle phase for joins, groupBy, and aggregates.

---

### 🧠 DRTree Explained Visually

The DRTree is a **decision-rule tree**, not a classifier.

It recursively splits data into shards by applying SQL-style predicates. Each leaf is a filtered logical subset of the DataFrame.

```
                        [Root: sales_amount]
                                |
                   ┌───────────┴────────────┐
        [amount <= 500]             [amount > 500]
               |                           |
       ┌───────┴───────┐            ┌──────┴───────┐
 [amount <= 100] [>100, ≤500]   [>500, ≤1000]  [>1000]
       |         |                  |             |
   [Leaf A]   [Leaf B]          [Leaf C]       [Leaf D]
 (shard_1)   (shard_2)         (shard_3)      (shard_4)
```

Each **leaf** holds:
- Filtered subset of the DataFrame (as a Spark SQL query)
- Associated metadata like row count, min/max, schema drift
- Auto key detection can run **within** these shards

---

### 🔬 Leaf-Level Parallelization

DRTree enables **parallel insight**:

- Each leaf is **autonomous** (you can infer schema, key, and stats per leaf)
- Makes the system robust to changes over time (drift detection)
- Enables controlled analytics:
  
```
[DRTree Output]
Leaf A:
  - rows: 30K
  - key_confidence: 0.92
  - type: Float(5,2)

Leaf D:
  - rows: 300K (hotspot!)
  - key_confidence: 0.12
  - type: String(255)
```

---

## 🔑 Key Detection (Optional & Shard-Aware)

```python
key_info = hd.detect_keys()
```

You **don’t need to force primary keys**.

This is just **analysis** — it evaluates uniqueness confidence for each column (or combination of columns):

- **Primary Key:**

```python
score = (approx_count_distinct(col) / total_rows) - null_ratio
```

If `score ≥ 0.99`, it’s a good candidate.

- **Composite Key:**

```python
combo_key = concat_ws("_", col1, col2, ...)
score = approx_count_distinct(combo_key) / total_rows - null_ratio
```

DRTree passes its **shard filters** into `detect_keys()` to evaluate keys per **subgroup** — boosting accuracy.

---

## 🧠 Smart Salting Internals

### 🧪 Step-by-step:

1. **Detect Skew**  
   - If `z_score` range is too large or  
   - IQR shows asymmetry (Q3 - Q2 ≫ Q2 - Q1)

2. **Split by Percentiles**

```python
percentiles = percentile_approx("amount", [0.0, 0.1, ..., 1.0])
```

3. **Salt Bucketing Logic**

```python
salt = when(col >= p0 & col < p1, 0) \
     .when(col >= p1 & col < p2, 1) ...
```

4. **Create Salted Key**

```python
salted_key = concat_ws("_", col("amount"), col("salt"))
df = df.withColumn("salted_key", salted_key).repartition("salted_key")
```

5. **Auto-Tune Salt Count**

- If distribution is dense, fewer buckets suffice
- Otherwise, more salting is applied dynamically

---

## 📈 Visualization Example

Output from `schemaVisor()`:

```
Leaf Node A [shard_0]
- size: 102,391
- type: Float(8,2)
- confidence: 92%

Leaf Node B [shard_1]
- size: 489,128 (dense zone)
- skew detected!
- Recommended salt count: 10
```

You can visualize the Z-score distribution:

```
Before:
  [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■             ]

After:
  [■■■■■■■■■■■■■■■■■■■■■      ■■■■■■■■■■■■■■■■■■■■■■   ]
```

---

## 🧪 Testing

```bash
pytest tests/
```

Mocked `SparkSession` with synthetic data is used to ensure full coverage.

---

## 🧱 Suggested Project Structure

```
hexadruid/
├── __init__.py
├── core.py                # HexaDruid entry point
├── skew_balancer.py       # Smart salting logic
├── drtree.py              # DRTree shard splitting
├── key_detection.py       # Unique key checker
├── schema_optimizer.py    # Type inference
├── advisor.py             # Parameter tuning
├── utils.py               # Logging, plots, etc.
└── tests/                 # Test suite
```

---

## 🔧 Roadmap

- [ ] CLI interface  
- [ ] Delta Lake + Iceberg support  
- [ ] JupyterLab extension  
- [ ] DRTree JSON export for audits  
- [ ] Cost metrics estimation  
- [ ] Column statistics and visualization dashboard

---

## 📄 License

MIT License

---

## 🤝 Contributing

Pull requests, ideas, and contributions welcome!

We believe Spark shouldn’t be slow. Let’s make it smarter together.

---
