Metadata-Version: 2.4
Name: dataprof
Version: 0.4.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
License-File: LICENSE
Summary: Fast, lightweight data profiling and quality assessment library
Keywords: data,profiling,quality,csv,json,analysis,performance
Author-email: Andrea Bozzo <andrea@example.com>
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/AndreaBozzo/dataprof
Project-URL: Repository, https://github.com/AndreaBozzo/dataprof
Project-URL: Issues, https://github.com/AndreaBozzo/dataprof/issues

# DataProfiler 📊

[![CI](https://github.com/AndreaBozzo/dataprof/workflows/CI/badge.svg)](https://github.com/AndreaBozzo/dataprof/actions)
[![License](https://img.shields.io/github/license/AndreaBozzo/dataprof)](LICENSE)
[![Rust](https://img.shields.io/badge/rust-1.70%2B-orange.svg)](https://www.rust-lang.org)
[![Crates.io](https://img.shields.io/crates/v/dataprof.svg)](https://crates.io/crates/dataprof)
[![PyPI](https://img.shields.io/pypi/v/dataprof.svg)](https://pypi.org/project/dataprof/)

**High-performance data quality library for production pipelines**

DataProfiler analyzes datasets 13x faster than pandas, handles unlimited file sizes through streaming, and detects 30+ quality issues automatically. Built in Rust with Python bindings and direct database connectivity.

![DataProfiler HTML Report](assets/animations/HTML.gif)

## ✨ Key Features

- **⚡ High Performance**: 13x faster than pandas with Apache Arrow integration
- **🌊 Scalable**: Stream processing for files larger than RAM (tested up to 100GB)
- **🔍 Smart Quality Detection**: Automatically finds nulls, duplicates, outliers, format issues
- **🗃️ Database Connectivity**: Direct profiling of PostgreSQL, MySQL, SQLite, DuckDB
- **🐍 Python & Rust APIs**: Library-first design with comprehensive bindings

## 🚀 Quick Start

### Python
```bash
pip install dataprof
```

```python
import dataprof

# Analyze CSV with quality assessment
report = dataprof.analyze_csv_with_quality("data.csv")
print(f"Quality score: {report.quality_score():.1f}%")

# Profile database table directly
profiles = dataprof.analyze_database("postgresql://user:pass@host/db", "users")
```

### Rust
```bash
cargo add dataprof --features arrow
```

```rust
use dataprof::*;

// High-performance Arrow processing
let profiler = DataProfiler::columnar();
let report = profiler.analyze_csv_file("large_dataset.csv")?;
```

### CLI
```bash
# Basic profiling
dataprof data.csv --quality --html report.html

# Database profiling
dataprof users --database "postgresql://user:pass@host:5432/db" --quality

# Large files with progress
dataprof huge_file.csv --streaming --progress
```

## 📊 Performance

| Tool | 100MB CSV | Memory | Quality Checks | >RAM Support |
|------|-----------|--------|----------------|--------------|
| **DataProfiler (Arrow)** | **0.5s** | 30MB | ✅ 30+ checks | ✅ |
| DataProfiler (Standard) | 2.1s | 45MB | ✅ 30+ checks | ✅ |
| pandas.describe() | 8.4s | 380MB | ❌ Basic stats | ❌ |
| Great Expectations | 12.1s | 290MB | ✅ Rule-based | ❌ |

## 💡 Real-World Examples

**Production Quality Gate**
```python
from dataprof import quick_quality_check

def validate_pipeline_data(file_path):
    quality_score = quick_quality_check(file_path)
    if quality_score < 85.0:
        raise Exception(f"Data quality too low: {quality_score:.1f}%")
    return quality_score
```

**Jupyter Data Exploration**
```python
report = dataprof.analyze_csv_with_quality("experiment_data.csv")

# Quick overview
print(f"📊 Quality: {report.quality_score():.1f}% | Rows: {report.scan_info.rows_scanned:,}")

# Identify issues
for issue in report.issues:
    print(f"⚠️ {issue.severity}: {issue.description}")
```

**Database Monitoring**
```bash
# Monitor daily data loads
dataprof daily_sales --database "mysql://user:pass@prod-db/warehouse" \
  --query "SELECT * FROM sales WHERE date = CURRENT_DATE" \
  --quality --json | jq '.quality_score'
```

## 📖 Documentation

| Guide | Description |
|-------|-------------|
| **[Python API](https://github.com/AndreaBozzo/dataprof/wiki/Python-Bindings)** | Complete Python reference with examples |
| **[Database Connectors](https://github.com/AndreaBozzo/dataprof/wiki/Database-Connectors)** | PostgreSQL, MySQL, SQLite, DuckDB integration |
| **[Apache Arrow Integration](https://github.com/AndreaBozzo/dataprof/wiki/Apache-Arrow-Integration)** | High-performance columnar processing guide |

Additional resources: [CHANGELOG](CHANGELOG.md) • [CONTRIBUTING](CONTRIBUTING.md) • [LICENSE](LICENSE)

## 🛠️ Development

```bash
git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof

# Quick setup
bash scripts/setup-dev.sh    # Linux/macOS
pwsh scripts/setup-dev.ps1   # Windows

# Build and test
cargo build --release
cargo test --all
```

## 🤝 Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

- 🐛 [Report bugs](https://github.com/AndreaBozzo/dataprof/issues)
- ✨ [Request features](https://github.com/AndreaBozzo/dataprof/issues)
- 📖 [Improve docs](https://github.com/AndreaBozzo/dataprof/wiki)

## 📄 License

Licensed under [GPL-3.0](LICENSE) • Commercial use allowed with source disclosure

