Metadata-Version: 2.4
Name: dataexaminer
Version: 0.1.0
Summary: Dataset summary and anomaly analysis library
Author-email: Aeron Zentner <aeronzentner@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/yourname/dataexaminer
Project-URL: Issues, https://github.com/yourname/dataexaminer/issues
Keywords: data,analysis,anomaly,pandas,summary
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pandas>=2.0
Provides-Extra: excel
Requires-Dist: openpyxl>=3.1; extra == "excel"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: numpy>=1.26; extra == "dev"

# dataexaminer

A Python library that examines datasets and produces per-column summaries including data type, record count, missing values, and anomaly detection.

## Installation

```bash
pip install dataexaminer
```

For Excel file support:

```bash
pip install dataexaminer[excel]
```

## Quick Start

```python
from dataexaminer import examine

# From a CSV or Excel file
result = examine("data.csv")             # returns a dict
examine("data.csv", output="console")   # prints a table

# From a pandas DataFrame
import pandas as pd
df = pd.read_csv("data.csv")
examine(df, output="console")
```

### Example Output

```
DataExaminer Report — sales_data.csv
Examined: 2026-03-22 14:05:00   Rows: 5,000   Cols: 4

Column      Dtype           Kind         Records  Missing  Miss%   Anomalies  Detail
----------  --------------  -----------  -------  -------  ------  ---------  -------------------------------------------------
revenue     float64         numeric      5,000    23       0.5%    12         12 IQR outliers (fence: 105.0 – 9842.0)
category    object          categorical  5,000    41       0.8%    8          8 values in rare categories (<1%): ['UNKN', 'N/a']
order_date  datetime64[us]  datetime     5,000    5        0.1%    3          3 gap(s) >5.0× median (median: 1d, largest: 47d)
notes       object          categorical  5,000    5,000    100.0%  0          —
```

## What Gets Detected

| Column Type | Anomaly Method | Description |
|---|---|---|
| Numeric | IQR fence (Tukey) | Values outside Q1 ± 1.5 × IQR |
| Categorical | Rare category | Values with frequency below 1% of non-null rows |
| Datetime | Gap detection | Intervals exceeding 5× the median gap; dates outside 1900–2100 |

## Programmatic Usage

```python
from dataexaminer import DataExaminer
from dataexaminer.formatters import ConsoleFormatter, DictFormatter

examiner = DataExaminer()
report = examiner.examine(df)

# report.columns is a list of ColumnSummary objects
for col in report.columns:
    print(col.column_name, col.missing_count, col.anomaly_count)
```

### Working with the Report Object

```python
report = examiner.examine(df)

print(report.row_count)       # total rows
print(report.col_count)       # total columns
print(report.examined_at)     # datetime of analysis

for col in report.columns:
    print(col.column_name)    # column name
    print(col.dtype)          # pandas dtype string
    print(col.kind)           # "numeric" | "categorical" | "datetime" | "boolean"
    print(col.record_count)   # total rows
    print(col.missing_count)  # number of null values
    print(col.missing_pct)    # percentage of null values
    print(col.anomaly_count)  # number of anomalies detected
    print(col.anomaly_detail) # human-readable description
```

### Dict Output

```python
result = examine("data.csv")  # default output="dict"
print(result["row_count"])
print(result["columns"][0]["anomaly_detail"])
```

## Custom Detectors

The library is designed to be extended. Implement `AnomalyDetector` and pass it in:

```python
from dataexaminer import DataExaminer
from dataexaminer.detectors import AnomalyDetector, IQRAnomalyDetector
import pandas as pd

class ZeroInflationDetector(AnomalyDetector):
    """Flags columns where more than 50% of non-null values are zero."""

    def supports(self, series: pd.Series) -> bool:
        return pd.api.types.is_numeric_dtype(series)

    def detect(self, series: pd.Series) -> tuple[int, str]:
        clean = series.dropna()
        if clean.empty:
            return 0, "—"
        zeros = (clean == 0).sum()
        pct = zeros / len(clean) * 100
        if pct > 50:
            return int(zeros), f"{zeros} zero values ({pct:.1f}%)"
        return 0, "—"

examiner = DataExaminer(detectors=[IQRAnomalyDetector(), ZeroInflationDetector()])
report = examiner.examine(df)
```

### Tuning Built-in Detectors

```python
from dataexaminer.detectors import IQRAnomalyDetector, RareCategoryDetector, DatetimeAnomalyDetector

examiner = DataExaminer(detectors=[
    IQRAnomalyDetector(iqr_multiplier=3.0),       # less sensitive to outliers
    RareCategoryDetector(threshold_pct=5.0),       # flag categories below 5%
    DatetimeAnomalyDetector(gap_multiplier=10.0),  # only flag very large gaps
])
```

## Supported File Formats

| Format | Requires |
|---|---|
| CSV (`.csv`) | `pandas` (included) |
| Excel (`.xlsx`, `.xls`) | `pip install dataexaminer[excel]` |

## Requirements

- Python 3.10+
- pandas 2.0+

## License

MIT © Aeron Zentner
