Metadata-Version: 2.4
Name: dataset-profiler
Version: 2.1.0
Summary: Data profiling tool for CSV and Excel files with cross-table relationship detection
Author: HelloPareto
License: MIT
Project-URL: Homepage, https://github.com/HelloPareto/dataset-profiler
Project-URL: Repository, https://github.com/HelloPareto/dataset-profiler.git
Keywords: data-profiling,csv,excel,synthetic-data,data-quality
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: pandas>=1.5.0
Requires-Dist: openpyxl>=3.0.0
Requires-Dist: typing_extensions>=4.0.0

# Data Profiler

A modular data profiling tool for CSV and Excel files that enables quick decisions about dataset compatibility and context priming for synthetic data generation.

## Features

- **Multi-format support**: Profile CSV files and Excel workbooks (each sheet as separate table)
- **Cross-table analysis**: Automatically detects foreign key relationships and primary key candidates
- **Intra-table relationships**: Numeric correlations, categorical associations (Cramér's V), and conditional statistics
- **Token-efficient output**: TXT output by default (compact, ideal for LLM context priming)
- **Comprehensive profiling**: Type inference, distributions, anomalies, high-cardinality analysis
- **Batch processing**: Profile all tables in a folder with a single command

## Installation

Requires Python 3.9+ and the following dependencies:

```bash
pip install pandas openpyxl typing_extensions
```

## Quick Start

```bash
# Profile all CSV/Excel files in a folder (TXT output)
python -m data_profiler /path/to/data/folder

# Also generate JSON output
python -m data_profiler /path/to/data/folder --json

# Specify output directory
python -m data_profiler /path/to/data/folder --out-dir ./reports
```

## CLI Options

```
positional arguments:
  folder                Folder containing CSV and/or Excel files to profile

options:
  --json                Also generate JSON output (default: TXT only)
  --out-dir PATH        Output directory (default: same as input folder)
  --sep SEP             CSV delimiter (default: ,)
  --encoding ENC        CSV encoding (optional)
  --nrows N             Profile only first N rows per file
  --low-memory          Pass low_memory=True to pandas.read_csv
  --date-threshold F    Fraction for date-like detection (default: 0.80)
  --topk N              Top-K examples for distributions (default: 10)
  --max-examples-scan N Max rows to scan for expensive ops (default: 200000)
  --seed N              Random seed for sampling (default: 7)
  --numeric-parse-threshold F
                        Parse-rate threshold for numeric detection (default: 0.90)
  --hc-prefix-len N     Prefix length for high-cardinality analysis (default: 3)
  -v, --verbose         Enable verbose logging
```

## Output Formats

### TXT Report (Default)

Compact, token-efficient format ideal for LLM context priming:

```
PROFILE REPORT
============================================================
folder: /data/retail/store_data
tables: 3
generated: 2026-01-14T10:30:00
runtime: 0.142s

== customers.csv (rows=1000, cols=5) [0.045s]
col | dtype | sem | card | null% | distinct%
------------------------------------------------------------------------
customer_id | object | id-like | low | 0.00% | 100.00%
  hc len(p50/p90/p95/p99)=8/8/8/8
    pref CUS:1000
name | object | text | high | 0.00% | 95.00%
  hc len(p50/p90/p95/p99)=12/18/22/25
email | object | text | high | 0.00% | 100.00%
signup_date | object | date-like | med | 0.50% | 45.00%
  date parse=99.50% range=2022-01-01..2024-12-31
    years 2022:250 2023:400 2024:350
is_active | bool | boolean | low | 0.00% | 40.00%
  num 0..1 mean=0.75 p50=1 p95=1

== orders.csv (rows=5000, cols=6) [0.082s]
...

============================================================
CROSS-TABLE RELATIONSHIPS
============================================================

Primary Key Candidates:
  customers.csv: customer_id
  orders.csv: order_id

Foreign Key Hints:
  orders.csv.customer_id -> customers.csv.customer_id (confidence=0.95, evidence=both)
```

### JSON Report (Optional)

Full detailed output including anomaly analysis. Generated with `--json` flag.

## Semantic Types

The profiler infers one of 9 semantic types for each column:

| Type | Description |
|------|-------------|
| `empty` | No rows or all-null column |
| `boolean` | Bool dtype or string vocab {true, false, yes, no, 0, 1} |
| `year-like` | Integer column with values in year range (1900-2100) |
| `numeric` | Numeric dtype or string that parses as numeric |
| `date-like` | Datetime dtype or string that parses as datetime |
| `categorical` | Low cardinality with low distinct fraction |
| `id-like` | Near-unique with low nulls and ID-like pattern (primary key candidates) |
| `fk-like` | Numeric column with ID-like name pattern but not near-unique (foreign key candidates) |
| `text` | Fallback for unclassified strings |

## Intra-Table Column Relationships

The profiler analyzes relationships between columns within each table:

### Numeric Correlations
- **Pearson correlation** between numeric columns
- Reports pairs with |r| ≥ 0.5
- Example: `OT Hours ~ Total OT Paid (r=0.92)`

### Categorical Associations (Cramér's V)
- **Cramér's V** measures association between categorical columns
- Includes high-cardinality columns (up to 500 categories, aggregated to top-20)
- Reports associations with V ≥ 0.3
- **Co-occurrence patterns** show which values tend to appear together
- Example:
  ```
  Agency Name ~ Pay Basis (V=0.71 strong)
    "DEPT OF ED PEDAGOGICAL" -> "per Annum" (100%)
    "POLICE DEPARTMENT" -> "per Annum" (94%)
  ```

### Conditional Statistics
- Numeric column ranges grouped by categorical values
- Reveals constraints like "Base Salary for hourly workers is typically $20-$50"
- Only reports when there's meaningful variation between groups (CV > 10%)

## Cross-Table Relationship Detection

The profiler automatically detects:

1. **Primary Key Candidates**: Columns with ≥99% distinct values and ≤1% nulls
2. **Foreign Key Hints**: Based on:
   - Column name pattern matching (e.g., `customer_id` → `customers.id`)
   - Value overlap analysis (sampled)
   - Type compatibility

Confidence scores (0.0-1.0) indicate relationship strength.

## Programmatic Usage

The package provides a clean Python API for integration into your workflows:

```python
from data_profiler import (
    profile_folder,
    profile_file,
    profile_dataframe,
    render_txt,
    render_json,
    ProfilerConfig,
)

# Profile all files in a folder
report = profile_folder("/data/retail")

# Profile with custom configuration
config = ProfilerConfig(
    date_threshold=0.85,
    topk=20,
    max_examples_scan=100_000,
)
report = profile_folder("/data/retail", config=config)

# Profile a single file (CSV or Excel)
report = profile_file("customers.csv")
report = profile_file("data.xlsx")  # All sheets profiled

# Profile a pandas DataFrame directly
import pandas as pd
df = pd.read_csv("data.csv")
profile = profile_dataframe(df, name="my_table")

# Render outputs
txt_output = render_txt(report)   # Token-efficient TXT for LLM context
json_output = render_json(report) # Full JSON with all details

# Access results
for file_profile in report["files"]:
    print(f"{file_profile['table_name']}: {file_profile['profile']['rows']} rows")

# Access FK hints
if report["relationships"]:
    for hint in report["relationships"]["fk_hints"]:
        print(f"{hint['from_table']}.{hint['from_column']} -> "
              f"{hint['to_table']}.{hint['to_column']}")
```

### API Reference

| Function | Description |
|----------|-------------|
| `profile_folder(path, config=None)` | Profile all CSV/Excel files in a folder |
| `profile_file(path, config=None)` | Profile a single CSV or Excel file |
| `profile_dataframe(df, name, config=None)` | Profile a pandas DataFrame directly |
| `render_txt(report)` | Render report as compact TXT string |
| `render_json(report)` | Render report as JSON string |

## Architecture

```mermaid
flowchart TD
    subgraph input [Input Layer]
        CLI[cli.py]
        CLI --> Reader[readers.py]
    end
    
    subgraph readers [File Discovery]
        Reader --> CSV[CSV Files]
        Reader --> Excel[Excel Sheets]
        CSV --> Tables[List of DataFrames]
        Excel --> Tables
    end
    
    subgraph profiling [Profiling Layer]
        Tables --> Core[profiler.py]
        Core --> Distributions[distributions.py]
        Core --> Anomalies[anomalies.py]
        Core --> Correlations[correlations.py]
        Core --> Shape[shape.py]
        Core --> TableProfiles[Per-Table Profiles]
    end
    
    subgraph intra [Intra-Table Analysis]
        Correlations --> NumCorr[Numeric Correlations]
        Correlations --> CramersV[Categorical Associations]
        Correlations --> CondStats[Conditional Statistics]
    end
    
    subgraph cross [Cross-Table Analysis]
        TableProfiles --> Relations[relationships.py]
        Relations --> FKHints[FK Hints]
        Relations --> PKCandidates[PK Candidates]
    end
    
    subgraph output [Output Layer]
        TableProfiles --> Render[output.py]
        FKHints --> Render
        PKCandidates --> Render
        NumCorr --> Render
        CramersV --> Render
        CondStats --> Render
        Render --> TXT[profile_report.txt]
        Render -.->|--json flag| JSON[profile_report.json]
    end
```

### Data Flow

1. **Input Layer**: CLI parses arguments and invokes the reader
2. **File Discovery**: Discovers CSV files and Excel sheets, loads as DataFrames
3. **Profiling Layer**: Profiles each table using distribution and anomaly modules
4. **Cross-Table Analysis**: Detects FK relationships and PK candidates across tables
5. **Output Layer**: Renders TXT (always) and JSON (optional) reports

## Repository Structure

```
dataset-profiler/
├── README.md                # This file
├── requirements.txt         # Python dependencies
├── pyproject.toml           # Package configuration
├── .gitignore
├── data_profiler/           # Main package
│   ├── __init__.py          # Public API exports
│   ├── __main__.py          # Entry point for python -m data_profiler
│   ├── api.py               # Public API functions (profile_folder, profile_file, etc.)
│   ├── cli.py               # Argparse CLI + orchestration
│   ├── readers.py           # CSV/Excel file discovery and loading
│   ├── profiler.py          # Core DataFrame profiling logic
│   ├── distributions.py     # Categorical, datelike, high-cardinality summaries
│   ├── anomalies.py         # Duplicates, outliers, invalid dates, rare categories
│   ├── correlations.py      # Intra-table: numeric correlations, Cramér's V, conditional stats
│   ├── shape.py             # Distribution shape detection (normal, skewed, bimodal, etc.)
│   ├── relationships.py     # Cross-table FK hints and PK detection
│   ├── output.py            # TXT and JSON rendering
│   ├── helpers.py           # Shared utility functions
│   ├── types.py             # TypedDicts and dataclasses
│   └── ANALYSIS.md          # Detailed analysis reference
├── example_outputs/         # Sample profile reports
│   ├── retail/
│   ├── hr/
│   ├── ecomm-sales/
│   ├── company/
│   └── payroll/
└── tests/                   # API and integration tests
    ├── test_api.py          # Public API tests
    └── data/                # Test data files
```

## Detailed Analysis Reference

For comprehensive documentation on what analysis is performed for each column type, see [data_profiler/ANALYSIS.md](data_profiler/ANALYSIS.md), which covers:

- Core stats and cardinality bucketing
- Semantic type inference rules (including `year-like` detection)
- Numeric coercion for string columns
- Distribution summaries (categorical, datelike, high-cardinality)
- Distribution shape detection (normal, skewed, bimodal, sparse, discrete)
- Intra-table column relationships (correlations, associations, conditional stats)
- Anomaly detection (duplicates, outliers, invalid dates, rare categories)
- Cross-table relationship detection algorithms
