Metadata-Version: 2.4
Name: tedcheck
Version: 0.1.0
Summary: UMAP Segment Validation tool
Author-email: Tergel Munkhbaatar <tergelitu@example.com>
License: MIT
Project-URL: Homepage, https://github.com/tergelitu/tedcheck
Project-URL: Repository, https://github.com/tergelitu/tedcheck.git
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Requires-Python: >=3.8
Description-Content-Type: text/markdown

# TEDCHECK - UMAP Segment Validation Tool

A comprehensive Python package for UMAP-based customer segmentation visualization and validation.

## Features

- **Flexible Configuration**: Customize column names, exclude/include features dynamically
- **UMAP Dimensionality Reduction**: 2D visualization of customer segments
- **Quality Metrics**: Calculate silhouette scores, feature importance, and purity metrics
- **Interactive Visualizations**: Plotly-based interactive charts
- **Preset Configurations**: Built-in presets for different use cases
- **Package-Ready**: Both terminal CLI and Python library usage

## Installation

```bash
pip install -e .
```

## Quick Start

### Terminal Usage

```bash
# Using default configuration
tedcheck data.csv

# With custom columns
tedcheck data.csv --user-id customer_id --segment-col tier --time-col month

# Using presets
tedcheck data.csv --preset ecommerce --metrics

# Custom configuration file
tedcheck data.csv --config my_config.json

# Skip time column if not available
tedcheck data.csv --skip-time
```

### Python Usage

```python
from tedcheck import Config, apply_umap_reduction, calculate_umap_metrics
import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Create config
config = Config(
    user_id_col='customer_id',
    segment_col='tier',
    skip_time=False
)

# Apply UMAP reduction
df_umap, embedding, num_cols = apply_umap_reduction(df, config=config)

# Calculate metrics
metrics = calculate_umap_metrics(df_umap, embedding, df, config=config)
```

## Configuration

### Default Configuration

```json
{
  "user_id_col": "user_id",
  "time_col": "base_month",
  "segment_col": "segment",
  "cluster_col": "cluster_kmeans",
  "exclude_cols": ["user_id", "base_month", "segment""],
  "include_cols": null,
  "n_neighbors": 50,
  "min_dist": 0.1,
  "random_state": 42,
  "skip_time": false
}
```
## Preset Configurations

### Default (General Purpose)
```bash
tedcheck data.csv --preset default
```

### E-commerce
```bash
tedcheck data.csv --preset ecommerce
# Uses: customer_id, purchase_month, customer_tier
```

### SaaS
```bash
tedcheck data.csv --preset saas
# Uses: account_id, billing_month, account_segment
```

## CLI Options

```
Usage: tedcheck <csv_file> [OPTIONS]

Options:
  --base-month <value>          Filter by specific month
  --user-id <col>               User ID column name
  --time-col <col>              Time column name
  --segment-col <col>           Segment column name
  --cluster-col <col>           Cluster column name
  --exclude-cols <col1,col2>    Columns to exclude
  --include-cols <col1,col2>    Columns to include only
  --n-neighbors <int>           UMAP n_neighbors
  --min-dist <float>            UMAP min_dist
  --metrics                     Calculate and save metrics
  --config <json_file>          Load config from JSON
  --preset <name>               Load preset (default, ecommerce, saas)
  --skip-time                   Skip time column if not available
```

## Output Files

- `*_umap_results.csv` - UMAP coordinates with user IDs and segments
- `umap_segment_*.html` - Interactive visualizations by month (if time column exists)
- `umap_segment_all.html` - Single visualization (if no time column)
- `umap_metrics.json` - Quality metrics (with `--metrics` flag)
- `feature_importance.csv` - Feature importance scores (with `--metrics` flag)

## Metrics Explained

### Silhouette Score
- Range: -1 to 1
- 1: Well-separated clusters
- 0: Overlapping clusters
- -1: Incorrect assignment

### Feature Importance
- Importance of each feature in UMAP dimensions
- Higher = More influential

### Purity Metrics
- **Homogeneity**: Segmentation purity (0-1)
- **Completeness**: Cluster completeness (0-1)
- **V-Measure**: Harmonic mean (0-1)

## Package Structure

```
tedcheck/
├── __init__.py          # Package initialization
├── config.py            # Configuration class
├── features.py          # Core UMAP functions
├── cli.py              # Command-line interface
├── utils.py            # Utility functions
├── exceptions.py       # Custom exceptions
├── logger.py           # Logging setup
└── configs/            # Preset configurations
    ├── default.json
    ├── ecommerce.json
    └── saas.json
```

## API Reference

### `Config` Class
```python
from tedcheck import Config

config = Config(
    user_id_col='id',
    segment_col='tier',
    skip_time=True
)

# Validate columns
missing = config.validate_columns(df)

# Load from file
config = Config.from_json('config.json')

# Load preset
config = Config.from_preset('ecommerce')

# Save config
config.to_json('my_config.json')
```

### `apply_umap_reduction()` Function
```python
from tedcheck import apply_umap_reduction, Config

df_umap, embedding, num_cols = apply_umap_reduction(
    df,
    config=config
)
```

### `calculate_umap_metrics()` Function
```python
from tedcheck import calculate_umap_metrics

metrics = calculate_umap_metrics(
    df_umap,
    embedding,
    df,
    config=config
)

print(metrics['silhouette_avg'])
print(metrics['feature_importance'])
print(metrics['purity_metrics'])
```

## Troubleshooting

### Missing Column Error
```bash
# Check available columns
python -c "import pandas as pd; print(pd.read_csv('data.csv').columns.tolist())"

# Use --skip-time if time column doesn't exist
tedcheck data.csv --skip-time
```

### Wrong Column Names
```bash
# Specify correct column names
tedcheck data.csv --user-id id --segment-col group --time-col month
```

### Memory Issues
```bash
# Use include_cols to select only important features
tedcheck data.csv --include-cols feature1,feature2,feature3
```

## Contributing

Contributions welcome! Please submit pull requests or issues on GitHub.

## License

MIT License - See LICENSE file for details

## Author

Tergel Munkhbaatar

## Version

0.1.0
