Metadata-Version: 2.4
Name: scitex-io
Version: 0.2.2
Summary: Universal scientific data I/O with plugin registry
Project-URL: Homepage, https://github.com/ywatanabe1989/scitex-io
Project-URL: Documentation, https://scitex-io.readthedocs.io/
Project-URL: Repository, https://github.com/ywatanabe1989/scitex-io.git
Project-URL: Issues, https://github.com/ywatanabe1989/scitex-io/issues
Author-email: Yusuke Watanabe <ywatanabe@scitex.ai>
License-Expression: AGPL-3.0
License-File: LICENSE
Keywords: csv,hdf5,io,numpy,pandas,plugin,registry,scientific-data,scitex,zarr
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.9
Requires-Dist: click>=8.0.0
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: rich>=13.0.0
Requires-Dist: scitex-dev>=0.1.0
Provides-Extra: all
Requires-Dist: bibtexparser; extra == 'all'
Requires-Dist: fastmcp>=2.0; extra == 'all'
Requires-Dist: h5py; extra == 'all'
Requires-Dist: joblib; extra == 'all'
Requires-Dist: matplotlib; extra == 'all'
Requires-Dist: myst-parser>=2.0; extra == 'all'
Requires-Dist: numcodecs; extra == 'all'
Requires-Dist: openpyxl; extra == 'all'
Requires-Dist: pdfplumber; extra == 'all'
Requires-Dist: pillow; extra == 'all'
Requires-Dist: pypdf2; extra == 'all'
Requires-Dist: pytest-cov; extra == 'all'
Requires-Dist: pytest>=7.0; extra == 'all'
Requires-Dist: python-docx; extra == 'all'
Requires-Dist: pyyaml; extra == 'all'
Requires-Dist: scipy; extra == 'all'
Requires-Dist: sphinx-autodoc-typehints>=1.25; extra == 'all'
Requires-Dist: sphinx-copybutton>=0.5; extra == 'all'
Requires-Dist: sphinx-rtd-theme>=2.0; extra == 'all'
Requires-Dist: sphinx>=7.0; extra == 'all'
Requires-Dist: xlrd; extra == 'all'
Requires-Dist: zarr; extra == 'all'
Provides-Extra: dev
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: myst-parser>=2.0; extra == 'docs'
Requires-Dist: sphinx-autodoc-typehints>=1.25; extra == 'docs'
Requires-Dist: sphinx-copybutton>=0.5; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=2.0; extra == 'docs'
Requires-Dist: sphinx>=7.0; extra == 'docs'
Provides-Extra: mcp
Requires-Dist: fastmcp>=2.0; extra == 'mcp'
Provides-Extra: scientific
Requires-Dist: h5py; extra == 'scientific'
Requires-Dist: matplotlib; extra == 'scientific'
Requires-Dist: numcodecs; extra == 'scientific'
Requires-Dist: scipy; extra == 'scientific'
Requires-Dist: zarr; extra == 'scientific'
Description-Content-Type: text/markdown

# scitex-io

<p align="center">
  <a href="https://scitex.ai">
    <img src="docs/scitex-logo-blue-cropped.png" alt="SciTeX" width="400">
  </a>
</p>

<p align="center"><b>Universal scientific data I/O with plugin registry</b></p>

<p align="center">
  <a href="https://badge.fury.io/py/scitex-io"><img src="https://badge.fury.io/py/scitex-io.svg" alt="PyPI version"></a>
  <a href="https://scitex-io.readthedocs.io/"><img src="https://readthedocs.org/projects/scitex-io/badge/?version=latest" alt="Documentation"></a>
  <a href="https://github.com/ywatanabe1989/scitex-io/actions/workflows/ci.yml"><img src="https://github.com/ywatanabe1989/scitex-io/actions/workflows/ci.yml/badge.svg" alt="Tests"></a>
  <a href="https://www.gnu.org/licenses/agpl-3.0"><img src="https://img.shields.io/badge/License-AGPL--3.0-blue.svg" alt="License: AGPL-3.0"></a>
</p>

<p align="center">
  <a href="https://scitex-io.readthedocs.io/">Full Documentation</a> · <code>pip install scitex-io</code>
</p>

---

## Problem

Three problems recur in every scientific Python project:

1. **Format fragmentation.** Loading a CSV requires `pandas.read_csv()`, an HDF5 file requires `h5py.File()`, a NumPy array requires `numpy.load()`. Each format demands its own library, its own API, and its own boilerplate. Operating systems solved this decades ago — double-click any file and the OS dispatches to the right application. Python has no equivalent.

2. **Hard-coded parameters scattered across scripts.** Sample rates, thresholds, model hyperparameters, plot dimensions — magic numbers buried in code, duplicated across files, impossible to track or share. Changing one parameter means grepping through the entire project.

3. **Figures without provenance.** A saved PNG has no record of the code, parameters, or session that produced it. Months later, reproducing a figure means reverse-engineering which script with which settings generated it.

## Solution

scitex-io addresses all three:

- **`save()`/`load()`** — One interface for 30+ formats with automatic extension-based dispatch. A plugin registry lets you add custom formats without modifying the library.
- **`load_configs()`** — Loads all YAML files from a `config/` directory into a single `DotDict` with dot-notation access. Parameters are version-controlled, centralized, and separate from code.
- **`embed_metadata()`/`read_metadata()`** — Embeds provenance (timestamps, session IDs, parameters) directly into image and PDF files. The figure carries its own history.

<details>
<summary><b>Supported Formats (30+)</b></summary>

<br>

| Category | Extensions |
|----------|-----------|
| Spreadsheet | `.csv`, `.tsv`, `.xlsx`, `.xls` |
| Scientific | `.npy`, `.npz`, `.mat`, `.hdf5`, `.h5`, `.zarr` |
| Serialization | `.pkl`, `.pickle`, `.pkl.gz`, `.joblib` |
| ML/DL | `.pth`, `.pt`, `.cbm` |
| Config | `.json`, `.yaml`, `.yml` |
| Documents | `.txt`, `.md`, `.pdf`, `.docx`, `.tex` |
| Images | `.png`, `.jpg`, `.jpeg`, `.gif`, `.tiff`, `.tif`, `.svg` |
| Media | `.mp4` |
| Web | `.html` |
| Bibliography | `.bib` |

</details>

## Installation

Requires Python >= 3.9.

```bash
pip install scitex-io
```

For MCP server support:

```bash
pip install scitex-io[mcp]
```

> **SciTeX users**: `pip install scitex` already includes scitex-io.

## Quickstart

### Save and Load

```python
from scitex_io import save, load

# Universal save/load — format auto-detected from extension
import pandas as pd
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
save(df, "data.csv")
loaded = load("data.csv")

# 30+ formats work the same way
import numpy as np
save(np.array([1, 2, 3]), "data.npy")
save({"key": "value"}, "config.yaml")
save({"nested": [1, 2]}, "data.json")
```

### Project Configuration

Hard-coded parameters belong in config files, not in code. Use **UPPER_CASE** keys — Python's convention for constants — to signal that these are user-defined values:

```
project/
  config/
    PATHS.yaml          # DATA_DIR: /data/experiment_01
    PREPROCESS.yaml     # SAMPLE_RATE: 1000, BANDPASS: [0.5, 40]
    MODEL.yaml          # HIDDEN_DIM: 256, DROPOUT: 0.3
    PLOT.yaml           # FIGSIZE: [180, 60], DPI: 300
    IS_DEBUG.yaml       # IS_DEBUG: true
```

```python
from scitex_io import load_configs

CONFIG = load_configs()          # loads ./config/*.yaml
CONFIG.PATHS.DATA_DIR            # "/data/experiment_01"
CONFIG.PREPROCESS.SAMPLE_RATE    # 1000
CONFIG.MODEL.HIDDEN_DIM          # 256

# Debug mode: DEBUG_ prefixed keys override their counterparts
# In MODEL.yaml: { HIDDEN_DIM: 256, DEBUG_HIDDEN_DIM: 32 }
CONFIG = load_configs(IS_DEBUG=True)
CONFIG.MODEL.HIDDEN_DIM          # 32 (debug value promoted)
```

Returns a `DotDict` — a nested dictionary with dot-notation access. Parameters become version-controlled, shareable, and separate from code.

### Metadata Embedding

Embed provenance into figures so they carry their own history:

```python
from scitex_io import embed_metadata, read_metadata, has_metadata

# Embed metadata into an image
embed_metadata("figure.png", {
    "experiment": "exp_042",
    "model": "resnet50",
    "accuracy": 0.94,
    "timestamp": "2026-03-11",
})

# Read it back — months later, from the file alone
meta = read_metadata("figure.png")
print(meta["experiment"])    # "exp_042"

# Check if a file has embedded metadata
has_metadata("figure.png")   # True
```

Supports PNG (tEXt chunks), JPEG (EXIF), SVG (XML metadata), and PDF (Info Dictionary).

<details>
<summary><b>Custom Format Registration</b></summary>

<br>

```python
from scitex_io import register_saver, register_loader, save, load

@register_saver(".custom")
def save_custom(obj, path, **kwargs):
    with open(path, "w") as f:
        f.write(str(obj))

@register_loader(".custom")
def load_custom(path, **kwargs):
    with open(path) as f:
        return f.read()

save("hello", "data.custom")
assert load("data.custom") == "hello"
```

</details>

## Three Interfaces

<details>
<summary><strong>Python API</strong></summary>

<br>

```python
from scitex_io import save, load, list_formats, register_saver, register_loader
from scitex_io import load_configs, DotDict
from scitex_io import embed_metadata, read_metadata, has_metadata

save(obj, "path.ext")        # Save any object
data = load("path.ext")      # Load any file
fmts = list_formats()        # Show all registered formats
cfg  = load_configs()        # Load ./config/*.yaml as DotDict
embed_metadata("fig.png", d) # Embed provenance into figure
```

> **[Full API reference](https://scitex-io.readthedocs.io/)**

</details>

<details>
<summary><strong>CLI Commands</strong></summary>

<br>

```bash
scitex-io --help-recursive          # Show all commands
scitex-io info                      # Show registered formats
scitex-io configs                   # Load and display project configs
scitex-io configs -d ./my_configs   # Custom config directory
scitex-io configs --json            # Output as JSON
scitex-io list-python-apis -vv      # List Python APIs with signatures
scitex-io version                   # Show version
scitex-io mcp start                 # Start MCP server
scitex-io mcp doctor                # Check MCP health
scitex-io mcp list-tools -vv        # List MCP tools with parameters
```

> **[Full CLI reference](https://scitex-io.readthedocs.io/)**

</details>

<details>
<summary><strong>MCP Server — for AI Agents</strong></summary>

<br>

AI agents can save, load, and discover formats autonomously.

| Tool | Description |
|------|-------------|
| `io_list_formats` | List all registered save/load formats |
| `io_load` | Load data from any supported format |
| `io_save` | Save data to any supported format |
| `io_load_configs` | Load YAML project configurations |
| `io_register_info` | Show how to register custom formats |

```bash
scitex-io mcp start
```

> **[Full MCP specification](https://scitex-io.readthedocs.io/)**

</details>

## Lint Rules

Detected by [scitex-linter](https://github.com/ywatanabe1989/scitex-linter) when this package is installed.

| Rule | Severity | Message |
|------|----------|---------|
| `STX-IO001` | warning | `np.save()` detected — use `stx.io.save()` for provenance tracking |
| `STX-IO002` | warning | `np.load()` detected — use `stx.io.load()` for provenance tracking |
| `STX-IO003` | warning | `pd.read_csv()` detected — use `stx.io.load()` for provenance tracking |
| `STX-IO004` | warning | `.to_csv()` detected — use `stx.io.save()` for provenance tracking |
| `STX-IO005` | warning | `pickle.dump()` detected — use `stx.io.save()` for provenance tracking |
| `STX-IO006` | warning | `json.dump()` detected — use `stx.io.save()` for provenance tracking |
| `STX-IO007` | warning | `.savefig()` detected — use `stx.io.save(fig, path)` for metadata embedding |

## Part of SciTeX

scitex-io is part of [**SciTeX**](https://scitex.ai). When used inside the SciTeX framework, I/O is seamless:

```python
import scitex

@scitex.session
def main(CONFIG=scitex.INJECTED):
    data = scitex.io.load("input.csv")     # auto-tracked by clew
    result = process(data)
    scitex.io.save(result, "output.csv")   # auto-tracked by clew
    return 0
```

`scitex.io` delegates to `scitex_io` — they share the same API and registry.

The SciTeX system follows the Four Freedoms for Research below, inspired by [the Free Software Definition](https://www.gnu.org/philosophy/free-sw.en.html):

>Four Freedoms for Research
>
>0. The freedom to **run** your research anywhere — your machine, your terms.
>1. The freedom to **study** how every step works — from raw data to final manuscript.
>2. The freedom to **redistribute** your workflows, not just your papers.
>3. The freedom to **modify** any module and share improvements with the community.
>
>AGPL-3.0 — because we believe research infrastructure deserves the same freedoms as the software it runs on.

---

<p align="center">
  <a href="https://scitex.ai" target="_blank"><img src="docs/scitex-icon-navy-inverted.png" alt="SciTeX" width="40"/></a>
</p>

<!-- EOF -->
