Metadata-Version: 2.4
Name: dsr-data-tools
Version: 1.0.0
Summary: Generic data handling utilities including data splitting and analysis.
Author-email: Scott Roberts <scottrdeveloper@gmail.com>
License: MIT
Keywords: data,splitting,analysis,ml-data
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: dsr-utils>=1.0.0
Requires-Dist: numpy>=2.0.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: scikit-learn>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=7.0; extra == "test"
Requires-Dist: pytest-cov>=4.0; extra == "test"
Dynamic: license-file

# dsr-data-tools

[![PyPI version](https://img.shields.io/pypi/v/dsr-data-tools.svg)](https://pypi.org/project/dsr-data-tools/)
[![Python versions](https://img.shields.io/pypi/pyversions/dsr-data-tools.svg)](https://pypi.org/project/dsr-data-tools/)
[![License](https://img.shields.io/pypi/l/dsr-data-tools.svg)](https://pypi.org/project/dsr-data-tools/)
[![Changelog](https://img.shields.io/badge/changelog-available-blue.svg)](https://github.com/scottroberts140/dsr-data-tools/releases)

Data analysis and exploration tools for exploratory data analysis (EDA).

**Version 1.0.0**: This release is breaking and not backward-compatible with prior 0.x versions.

## Features

- **Dataset Analysis**: Comprehensive statistical summaries and data quality assessment
- **Data Exploration**: Tools for understanding data distributions, correlations, and patterns
- **Quality Metrics**: Missing value detection, data type analysis, and anomaly identification
- **Statistically Guided Feature Interactions**: Automatic discovery of meaningful feature interactions using Mutual Information and Pearson Correlation

## Installation

```bash
pip install dsr-data-tools
```

## Usage

```python
import pandas as pd
from dsr_data_tools import analyze_dataset

# Load your data
df = pd.read_csv('data.csv')

# Perform comprehensive analysis
analyze_dataset(df)
```

### Datetime Conversion Recommendation

`generate_recommendations()` detects object/string columns that are likely datetimes and recommends converting them to a proper datetime dtype.

```python
import pandas as pd
from dsr_data_tools.analysis import generate_recommendations
from dsr_data_tools.recommendations import apply_recommendations

# Example column with mostly valid date strings
df = pd.DataFrame({
	'date_str': [
		'2025-01-01', '2025-01-02', '2025-01-03',
		'2025-01-04', 'invalid',  # one invalid value
	] * 10  # scale up rows
})

recs = generate_recommendations(df)

# If detected, apply the datetime conversion recommendation
if 'date_str' in recs and 'datetime_conversion' in recs['date_str']:
	df_converted = apply_recommendations(df, {
		'date_str': recs['date_str']['datetime_conversion']
	})
	# Column is now datetime64; invalid entries coerced to NaT
	print(df_converted['date_str'].dtype)  # datetime64[ns]
```

## Performance

This library is optimized for large-scale data processing using vectorized operations.

* Vectorized Integer Checks: Optimized from $O(N)$ Python-level application to vectorized modulo operations, resulting in a __5-6× speed increase__.

* Cached Data Scans: Implemented caching for common operations like dropna() and unique() to ensure each data column is scanned as few times as possible, maintaining high efficiency for wide datasets.

## Benchmarks

A benchmark script compares per-element `apply(is_integer)` against a vectorized modulo check for detecting integer-like floats. On large series, the vectorized approach is typically 5–6× faster.

- Script: [scripts/benchmark_integer_checks.py](scripts/benchmark_integer_checks.py)

Run via Python:

```bash
python scripts/benchmark_integer_checks.py           # default size (2,000,000)
python scripts/benchmark_integer_checks.py 5000000  # custom size
```

Or via Makefile target:

```bash
make benchmark                # default N=2,000,000
make benchmark N=5000000      # custom size
```

## Requirements

- Python >= 3.10
- pandas
- numpy
- scikit-learn
- dsr-utils >= 1.0.0

## License

MIT License - see LICENSE file for details
