Metadata-Version: 2.4
Name: csvmedic
Version: 0.1.0
Summary: Automatic locale-aware CSV and Excel reader with encoding, delimiter, date format, and number locale detection.
Project-URL: Homepage, https://github.com/csvmedic/csvmedic
Project-URL: Documentation, https://csvmedic.readthedocs.io
Project-URL: Repository, https://github.com/csvmedic/csvmedic
Author: csvmedic contributors
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Requires-Dist: charset-normalizer>=3.0.0
Requires-Dist: pandas>=1.5.0
Provides-Extra: all
Requires-Dist: clevercsv>=0.8.0; extra == 'all'
Requires-Dist: openpyxl>=3.1.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mkdocs-material>=9.0; extra == 'dev'
Requires-Dist: mkdocstrings[python]>=0.24; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pandas-stubs>=2.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Provides-Extra: excel
Requires-Dist: openpyxl>=3.1.0; extra == 'excel'
Provides-Extra: fast
Requires-Dist: clevercsv>=0.8.0; extra == 'fast'
Description-Content-Type: text/markdown

# csvmedic

Automatic locale-aware CSV and Excel reader. One line to clean messy data:

```python
import csvmedic

df = csvmedic.read("messy_file.csv")
print(df.diagnosis)  # See what was detected and converted
```

## What it does

| Detects | Examples |
|--------|----------|
| **Encoding** | UTF-8, Windows-1252, ISO-8859-1, Shift-JIS, BOM |
| **Delimiter** | Comma, semicolon, tab, pipe |
| **Dates** | DD-MM vs MM-DD resolved statistically; ISO, European, US formats |
| **Numbers** | European (1.234,56) vs US (1,234.56); locale hint |
| **Booleans** | Yes/No, Ja/Nein, Oui/Non, Sí/No, and more |
| **Strings** | Preserves leading zeros (IDs like 00742) |

Every transformation is recorded in the `.diagnosis` attribute so you can audit what was changed.

## Installation

```bash
pip install csvmedic
```

Optional extras:

- `pip install csvmedic[fast]` — better dialect detection (clevercsv)
- `pip install csvmedic[excel]` — .xlsx support (openpyxl)
- `pip install csvmedic[all]` — both

## Configuration

Override auto-detection when you know better:

```python
df = csvmedic.read(
    "file.csv",
    encoding="utf-8",
    delimiter=";",
    dayfirst=True,              # Force DD-MM dates
    preserve_strings=["ID"],    # Never convert these columns
    sample_rows=2000,           # Rows to use for detection
    confidence_threshold=0.75,  # Min confidence to convert (0–1)
)
```

## Analyze without converting

```python
profile = csvmedic.read_raw("file.csv")
print(profile.summary())
print(profile.columns["Date"].details)
```

## Schema pinning (recurring files)

Save the detected schema after the first read and reuse it so the next read skips detection:

```python
df = csvmedic.read("monthly_export.csv")
csvmedic.save_schema(df.attrs["diagnosis"].file_profile, "monthly_export.csvmedic.json")

# Next time: same encoding, delimiter, and column types, no re-detection
df2 = csvmedic.read("monthly_export.csv", schema="monthly_export.csvmedic.json")
```

## Batch read with consensus

When reading many similar CSVs (e.g. one per month), use consensus so every file gets the same encoding and delimiter:

```python
dfs = csvmedic.read_batch(["jan.csv", "feb.csv", "mar.csv"], use_consensus=True)
# Encoding and delimiter are chosen by majority across the three files.
```

## Diff: pandas vs csvmedic

See exactly what pandas would have changed or corrupted vs what csvmedic preserves:

```python
result = csvmedic.diff("leading_zeros.csv")
print(result.summary())           # Columns/rows that differ
print(result.pandas_df)           # Default pandas read
print(result.csvmedic_df)         # csvmedic read (e.g. keeps "00742" as string)
print(result.sample_differences)  # Example (row, column, pandas_val, csvmedic_val)
```

## How disambiguation works

For ambiguous dates like `03/04/2025` (March 4 or April 3?), csvmedic uses the data itself: if any value has a day > 12 (e.g. `25/03/2025`), the column is treated as day-first. It also uses cross-column inference, separator hints (e.g. period = European), and sequential order. If it still can’t decide, the column stays as string and is marked ambiguous in the diagnosis.

## Documentation

- [Quickstart](docs/quickstart.md)
- [How it works](docs/how-it-works.md)
- [API reference](docs/api-reference.md)
- [FAQ](docs/faq.md)

## License

MIT
