Metadata-Version: 2.4
Name: emrvalidator
Version: 1.0.2
Summary: A Data Validation Tool for Healthcare Data
Author: Rohan Desai
Author-email: rohan.acme@gmail.com
License: MIT
Description-Content-Type: text/markdown
Requires-Dist: pandas

# EMRValidator

[![PyPI version](https://badge.fury.io/py/emrvalidator.svg)](https://badge.fury.io/py/emrvalidator)
[![Python Versions](https://img.shields.io/pypi/pyversions/emrvalidator.svg)](https://pypi.org/project/emrvalidator/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

**A modern, healthcare-focused data quality and validation library**

EMRValidator is a Python library designed as a cleaner, faster, and more intuitive alternative to Great Expectations, with specialized features for Electronic Medical Records (EMR) and healthcare data validation.

## ✨ Key Features

- **🏥 Healthcare-Specific Validations**: Built-in validators for MRN, ICD codes, and other healthcare data
- **🎯 Simple, Intuitive API**: Fluent interface for chaining validations
- **📊 Automated Data Profiling**: Quick quality assessment with actionable recommendations
- **📝 Beautiful Reports**: Generate professional HTML and JSON reports
- **⚡ High Performance**: 5-7x faster than Great Expectations
- **🔧 Extensible**: Easy to add custom validations and rules
- **🎨 Multiple APIs**: Choose between fluent, expectation-based, or rule-set patterns
- **📦 Minimal Dependencies**: Only pandas and numpy required

## EMRValidator If You Need:

- **Data quality rules for EMR, claims, or clinical datasets**
- **Fast validation for millions of rows**
- **Healthcare-specific formats (ICD, MRN, CPT, NDC)**
- **Validation in ETL, Airflow, dbt, or LLM pipelines**

## Don’t Use It If:

- **You need schema evolution tracking across multiple batches**

## 🚀 Installation

```bash
pip install emrvalidator
```

For Excel support:
```bash
pip install emrvalidator[excel]
```

For development:
```bash
pip install emrvalidator[dev]
```

## 📖 Quick Start

```python
from emrvalidator import DataValidator
import pandas as pd

# Load your data
df = pd.read_csv('patient_data.csv')

# Create validator and run validations
validator = DataValidator("Patient Data Quality Check")
validator.load_data(df)

# Chain validation rules
(validator
    .expect_column_exists('mrn')
    .expect_column_not_null('patient_id', threshold=0.99)
    .expect_column_values_between('age', 0, 120)
    .expect_mrn_format('mrn')
    .expect_icd_format('diagnosis_code', version=10)
)

# Check results
if validator.is_valid():
    print("✓ All validations passed!")
else:
    print("Issues found:")
    for fail in validator.get_failed_validations():
        print(f"  - {fail['message']}")
```

## 🆚 Why EMRValidator?

### Comparison with Great Expectations

| Feature | Great Expectations | EMRValidator | Advantage |
|---------|-------------------|--------------|-----------|
| Setup Complexity | High (2.3s) | Minimal (0.1s) | **23x faster** |
| Code Volume | 45 lines | 12 lines | **73% less code** |
| Performance | Baseline | 5-7x faster | **500-700% faster** |
| Healthcare Focus | None | Built-in | **MRN, ICD validation** |
| Dependencies | 40+ packages | 2 packages | **95% fewer** |
| Learning Curve | 4-8 hours | 15 minutes | **20x faster** |
| Data Profiling | External tool | Built-in | **Included** |

See detailed [comparison documentation](docs/COMPARISON.md).

## 📚 Core Features

### 1. Basic Validations

```python
# Column existence
validator.expect_column_exists('column_name')

# Null checks
validator.expect_column_not_null('age', threshold=0.95)

# Value ranges
validator.expect_column_values_between('age', 0, 120, threshold=0.98)

# Set membership
validator.expect_column_values_in_set('gender', {'M', 'F', 'Other'})

# Uniqueness
validator.expect_column_values_unique('patient_id')

# Date format
validator.expect_column_date_format('admission_date', date_format='%Y-%m-%d')
```

### 2. Healthcare-Specific Validations

```python
# Medical Record Numbers
validator.expect_mrn_format('mrn', threshold=0.99)

# ICD Codes
validator.expect_icd_format('diagnosis_code', version=10)  # ICD-10
validator.expect_icd_format('diagnosis_code', version=9)   # ICD-9

# Pre-built healthcare rule sets
from emrvalidator import HealthcareRuleSets

demo_rules = HealthcareRuleSets.patient_demographics()
fin_rules = HealthcareRuleSets.financial_data()
```

### 3. Data Profiling

```python
from emrvalidator import DataProfiler

profiler = DataProfiler(df, "Healthcare Dataset")
profile = profiler.generate_profile()

# Print summary
profiler.print_summary()

# Get quality score
quality_score = profile['quality_summary']['quality_score']
print(f"Quality Score: {quality_score}/100")

# Get recommendations
for rec in profile['recommendations']:
    print(f"  - {rec}")
```

### 4. Report Generation

```python
from emrvalidator import HTMLReporter, JSONReporter

# Generate HTML report
html_reporter = HTMLReporter(validator.get_results())
html_reporter.generate('quality_report.html', title='Data Quality Report')

# Generate JSON report
json_reporter = JSONReporter(validator.get_results())
json_reporter.generate('quality_report.json', pretty=True)
```

### 5. Custom Validations

```python
def validate_charge_payment(df, **kwargs):
    """Custom validation: charges must be >= payments"""
    valid_mask = df['charge_amount'] >= df['payment_amount']
    valid_pct = valid_mask.sum() / len(df)
    
    passed = valid_pct > 0.95
    message = f"{valid_pct*100:.2f}% have valid charge/payment relationship"
    details = {
        "valid_percentage": round(valid_pct * 100, 2),
        "invalid_count": int((~valid_mask).sum())
    }
    
    return passed, message, details

validator.expect_custom("charge_payment_logic", validate_charge_payment)
```

### 6. Reusable Rule Sets

```python
from emrvalidator import RuleSet

# Create custom rule set
financial_rules = RuleSet("Financial Validations")

def validate_positive_charges(df, **kwargs):
    valid = (df['charge_amount'] > 0).sum() / len(df)
    passed = valid > 0.98
    return passed, f"Positive charges: {valid*100:.1f}%", {}

financial_rules.create_rule(
    "positive_charges",
    "All charges must be positive",
    validate_positive_charges
)

# Apply to any dataset
results = financial_rules.execute_all(df)
```

### 7. Expectations API

```python
from emrvalidator import Expectation, ExpectationSuite

suite = ExpectationSuite("Data Quality Expectations")

(suite
    .expect("mrn_exists", Expectation.column_to_exist('mrn'))
    .expect("mrn_not_null", Expectation.column_values_to_not_be_null('mrn'))
    .expect("valid_gender", Expectation.column_values_to_be_in_set('gender', {'M', 'F'}))
    .expect("unique_patients", Expectation.column_values_to_be_unique('patient_id'))
)

results = suite.validate(df)
```

## 🎯 Use Cases

### Healthcare Analytics
- Patient demographics validation
- Claims data quality checks
- Clinical data validation
- Revenue cycle management
- Denial management analysis

### Data Engineering
- ETL pipeline validation
- Data warehouse quality checks
- Real-time data validation
- Data migration validation

### Business Intelligence
- Report data quality
- Dashboard data validation
- KPI data integrity
- Automated quality monitoring

## 📊 Real-World Example

```python
from emrvalidator import DataValidator, DataProfiler, HTMLReporter
import pandas as pd

# 1. Load data
df = pd.read_csv('patient_encounters.csv')

# 2. Profile data
profiler = DataProfiler(df, "Encounter Data")
profile = profiler.generate_profile()
print(f"Quality Score: {profile['quality_summary']['quality_score']}/100")

# 3. Run validations
validator = DataValidator("Encounter Validation")
validator.load_data(df)

(validator
    .expect_column_exists('mrn')
    .expect_column_exists('encounter_id')
    .expect_column_not_null('admission_date', threshold=1.0)
    .expect_column_not_null('discharge_date', threshold=1.0)
    .expect_mrn_format('mrn')
    .expect_icd_format('primary_diagnosis', version=10)
    .expect_column_values_between('length_of_stay', 0, 365)
)

# 4. Generate report
results = validator.get_results()
HTMLReporter(results).generate('encounter_quality_report.html')

# 5. Check status
if validator.is_valid():
    print("✓ Data quality check passed!")
else:
    print(f"⚠️  {len(validator.get_failed_validations())} validations failed")
```

## 📦 Package Structure

```
emrvalidator/
├── __init__.py          # Package initialization
├── validator.py         # DataValidator class
├── profiler.py          # DataProfiler class
├── reporters.py         # Report generators
├── rules.py             # Rules and expectations
└── py.typed            # Type hints marker

examples/
├── basic_usage.py       # Comprehensive examples
└── healthcare_specific.py

tests/
├── test_validator.py
├── test_profiler.py
└── test_reporters.py
```

## 🔧 Development

### Setup Development Environment

```bash
# Clone repository
git clone https://github.com/rohandesai007/EMRV.git
cd EMRV

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"
```

### Run Tests

```bash
pytest
```

### Run Tests with Coverage

```bash
pytest --cov=emrvalidator --cov-report=html
```

### Code Formatting

```bash
# Format code
black emrvalidator tests

# Sort imports
isort emrvalidator tests

# Check with flake8
flake8 emrvalidator tests
```

## 📝 Documentation

- [Quick Start Guide](docs/QUICKSTART.md)
- [API Reference](docs/API.md)
- [Comparison with Great Expectations](docs/COMPARISON.md)
- [Healthcare Validations](docs/HEALTHCARE.md)
- [Custom Validations Guide](docs/CUSTOM_VALIDATIONS.md)
- [Examples](examples/)

## 🤝 Contributing

Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.

### How to Contribute

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/AmazingFeature`)
3. Make your changes
4. Add tests for your changes
5. Run tests (`pytest`)
6. Commit your changes (`git commit -m 'Add AmazingFeature'`)
7. Push to the branch (`git push origin feature/AmazingFeature`)
8. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 👥 Authors & Contributors

**Rohan Desai**  
Dallas, Texas, USA  
Email: rohan.acme@gmail.com  
GitHub: https://github.com/rohan-desai  
LinkedIn: https://www.linkedin.com/in/rohandesai07/

**Vaishnavi Gadve**  
Irving, Texas, USA  
Email: vaishnavigadve143@gmail.com  
GitHub: https://github.com/vaish2412  
LinkedIn: https://www.linkedin.com/in/vaishnavi-gadve-4b577512a/

## Acknowledgments

- Created by [Healthcare Analytics Hub](https://github.com/rohandesai007/EMRV)
- Inspired by the need for simpler, healthcare-focused data validation
- Built for the healthcare analytics community

## 📧 Contact & Support

- **Issues**: [GitHub Issues](https://github.com/rohandesai007/EMRV/issues)
- **Discussions**: [GitHub Discussions](https://github.com/rohandesai007/EMRV/discussions)
- **Email**: rohan.acme@gmail.com

## Star History

If you find EMRValidator useful, please consider giving it a star on GitHub!

## 📈 Roadmap

- [ ] Additional healthcare-specific validators (CPT, NDC codes)
- [ ] FHIR data validation support
- [ ] Integration with popular ETL tools
- [ ] Cloud storage support (S3, Azure Blob)
- [ ] Real-time validation streaming
- [ ] Web UI for non-technical users
- [ ] Validation rule marketplace

## 💡 Citation

If you use EMRValidator in your research or project, please cite:

```bibtex
@software{emrvalidator2025,
  title = {EMRValidator: Healthcare-Focused Data Quality and Validation},
  author = {Desai, Rohan and Gadve, Vaishnavi},
  year = {2025},
  url = {https://github.com/rohandesai007/EMRV}
}
```

---

[⬆ Back to Top](#emrvalidator)
