Metadata-Version: 2.4
Name: scitex-dataset
Version: 0.3.1
Summary: Multi-domain scientific dataset fetcher — neuroscience, biology, pharmacology, medical (OpenNeuro, DANDI, PhysioNet, GEO, ChEMBL, ClinicalTrials.gov)
Project-URL: Homepage, https://github.com/ywatanabe1989/scitex-dataset
Project-URL: Documentation, https://scitex-dataset.readthedocs.io
Project-URL: Repository, https://github.com/ywatanabe1989/scitex-dataset.git
Project-URL: Issues, https://github.com/ywatanabe1989/scitex-dataset/issues
Author-email: Yusuke Watanabe <ywatanabe@scitex.ai>
License-Expression: AGPL-3.0-only
License-File: LICENSE
Keywords: bids,dandi,datasets,eeg,fmri,mri,neuroscience,nwb,openneuro,physionet
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.10
Requires-Dist: click>=8.0.0
Requires-Dist: httpx>=0.24.0
Requires-Dist: scitex-dev
Provides-Extra: all
Requires-Dist: fastmcp>=2.0.0; extra == 'all'
Requires-Dist: myst-parser>=2.0; extra == 'all'
Requires-Dist: pre-commit>=3.5.0; extra == 'all'
Requires-Dist: pytest-cov>=4.0.0; extra == 'all'
Requires-Dist: pytest>=7.0.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Requires-Dist: scitex>=2.0.0; extra == 'all'
Requires-Dist: sphinx-autodoc-typehints>=1.25; extra == 'all'
Requires-Dist: sphinx-copybutton>=0.5; extra == 'all'
Requires-Dist: sphinx-rtd-theme>=2.0; extra == 'all'
Requires-Dist: sphinx>=7.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: pre-commit>=3.5.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: myst-parser>=2.0; extra == 'docs'
Requires-Dist: sphinx-autodoc-typehints>=1.25; extra == 'docs'
Requires-Dist: sphinx-copybutton>=0.5; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=2.0; extra == 'docs'
Requires-Dist: sphinx>=7.0; extra == 'docs'
Provides-Extra: mcp
Requires-Dist: fastmcp>=2.0.0; extra == 'mcp'
Provides-Extra: scitex
Requires-Dist: scitex>=2.0.0; extra == 'scitex'
Description-Content-Type: text/markdown

# SciTeX Dataset (<code>scitex-dataset</code>)

<p align="center">
  <a href="https://scitex.ai">
    <img src="docs/scitex-logo-blue-cropped.png" alt="SciTeX" width="400">
  </a>
</p>

<p align="center"><b>Unified access to neuroscience and scientific datasets</b></p>

<p align="center">
  <a href="https://badge.fury.io/py/scitex-dataset"><img src="https://badge.fury.io/py/scitex-dataset.svg" alt="PyPI version"></a>
  <a href="https://scitex-dataset.readthedocs.io/"><img src="https://readthedocs.org/projects/scitex-dataset/badge/?version=latest" alt="Documentation"></a>
  <a href="https://github.com/ywatanabe1989/scitex-dataset/actions/workflows/test.yml"><img src="https://github.com/ywatanabe1989/scitex-dataset/actions/workflows/test.yml/badge.svg" alt="Tests"></a>
  <a href="https://www.gnu.org/licenses/agpl-3.0"><img src="https://img.shields.io/badge/License-AGPL--3.0-blue.svg" alt="License: AGPL-3.0"></a>
</p>

<p align="center">
  <a href="https://scitex-dataset.readthedocs.io/">Full Documentation</a> · <code>pip install scitex-dataset</code>
</p>

---

## Problem

Neuroscience datasets are scattered across multiple repositories -- OpenNeuro, DANDI Archive, PhysioNet, Zenodo -- each with its own API, data format, and query interface. Researchers waste time navigating incompatible APIs to discover relevant data. AI agents lack a unified way to search and evaluate datasets programmatically.

## Solution

SciTeX Dataset provides a **single Python API, CLI, and MCP (Model Context Protocol) server** to discover and query metadata from major scientific data repositories. It focuses on fast metadata retrieval without downloading full datasets.

| Repository | Description | Data Types |
|------------|-------------|------------|
| **OpenNeuro** | Open platform for sharing neuroimaging data | MRI, EEG, MEG, iEEG, PET |
| **DANDI** | BRAIN Initiative data archive | Electrophysiology, Ophys |
| **PhysioNet** | Physiological signal databases | ECG, EEG, clinical data |
| **Zenodo** | General scientific data repository (CERN) | Any research data |

<p align="center"><sub><b>Table 1.</b> Supported data repositories. Each source is queried via its public API; no authentication required for metadata access.</sub></p>

## Installation

Requires Python >= 3.10.

```bash
pip install scitex-dataset
```

> **MCP support**: `pip install scitex-dataset[mcp]`

## Quick Start

```python
from scitex_dataset import fetch_all_datasets, format_dataset

# Fetch datasets from OpenNeuro
datasets = fetch_all_datasets(max_datasets=10)

# Format for analysis
for ds in datasets:
    formatted = format_dataset(ds)
    print(f"{formatted['id']}: {formatted['name']} ({formatted['n_subjects']} subjects)")
```

## Four Interfaces

<details>
<summary><strong>Python API</strong></summary>

<br>

```python
from scitex_dataset import fetch_all_datasets, format_dataset, search_datasets, sort_datasets
from scitex_dataset import neuroscience, database

# Fetch from specific sources
datasets = fetch_all_datasets(max_datasets=100)                    # OpenNeuro
dandi_ds = neuroscience.dandi.fetch_all_datasets(max_datasets=50)  # DANDI
phys_ds = neuroscience.physionet.fetch_all_datasets()              # PhysioNet

# Search and filter
eeg_datasets = search_datasets(datasets, modality="eeg", min_subjects=20)
popular = sort_datasets(datasets, by="downloads", descending=True)

# Local database for fast full-text search
database.build()                                        # index all sources
results = database.search("alzheimer EEG", min_subjects=20)
```

> **[Full API reference](https://scitex-dataset.readthedocs.io/)**

</details>

<details>
<summary><strong>CLI Commands</strong></summary>

<br>

```bash
scitex-dataset --help-recursive             # Show all commands

# Fetch from repositories
scitex-dataset openneuro -n 100 -o datasets.json -v
scitex-dataset dandi -n 50 -o dandi.json -v
scitex-dataset physionet -n 50 -v
scitex-dataset zenodo -q "neuroscience" -n 20

# Local database
scitex-dataset db build                     # index all sources
scitex-dataset db search "epilepsy EEG"     # full-text search
scitex-dataset db stats                     # show statistics

# Introspection
scitex-dataset list-python-apis -v          # list Python API tree
scitex-dataset mcp list-tools -v            # list MCP tools
```

> **[Full CLI reference](https://scitex-dataset.readthedocs.io/)**

</details>

<details>
<summary><strong>MCP Server -- for AI Agents</strong></summary>

<br>

AI agents can discover and query neuroscience datasets autonomously.

| Tool | Description |
|------|-------------|
| `dataset_openneuro_fetch` | Fetch datasets from OpenNeuro |
| `dataset_dandi_fetch` | Fetch datasets from DANDI Archive |
| `dataset_physionet_fetch` | Fetch datasets from PhysioNet |
| `dataset_zenodo_fetch` | Fetch datasets from Zenodo |
| `dataset_search` | Filter datasets by modality, subjects, etc. |
| `dataset_list_sources` | List available data repositories |
| `dataset_db_build` | Build local search database |
| `dataset_db_search` | Full-text search across all sources |
| `dataset_db_stats` | Database statistics |

<sub><b>Table 2.</b> Nine MCP tools available for AI-assisted dataset discovery. All tools accept JSON parameters and return JSON results.</sub>

```bash
scitex-dataset mcp start
```

> **[Full MCP specification](https://scitex-dataset.readthedocs.io/)**

</details>

<details>
<summary><strong>Skills — for AI Agent Discovery</strong></summary>

<br>

Skills provide workflow-oriented guides that AI agents query to discover capabilities and usage patterns.

```bash
scitex-dataset skills list              # List available skill pages
scitex-dataset skills get SKILL         # Show main skill page
scitex-dev skills export --package scitex-dataset  # Export to Claude Code
```

| Skill | Content |
|-------|---------|
| `quick-start` | Basic usage |
| `data-sources` | OpenNeuro, DANDI, PhysioNet |
| `cli-reference` | CLI commands |
| `mcp-tools` | MCP tools for AI agents |

</details>

## Part of SciTeX

SciTeX Dataset is part of [**SciTeX**](https://scitex.ai). When used inside the SciTeX framework, dataset discovery integrates with reproducible research sessions:

```python
import scitex
from scitex_dataset import fetch_all_datasets, format_dataset

@scitex.session
def main(logger=scitex.INJECTED):
    datasets = fetch_all_datasets(max_datasets=100, logger=logger)
    formatted = [format_dataset(ds) for ds in datasets]
    scitex.io.save(formatted, "openneuro_datasets.json")
    return 0
```

The SciTeX ecosystem follows the Four Freedoms for Research, inspired by [the Free Software Definition](https://www.gnu.org/philosophy/free-sw.en.html):

>Four Freedoms for Research
>
>0. The freedom to **run** your research anywhere -- your machine, your terms.
>1. The freedom to **study** how every step works -- from raw data to final manuscript.
>2. The freedom to **redistribute** your workflows, not just your papers.
>3. The freedom to **modify** any module and share improvements with the community.
>
>AGPL-3.0 -- because we believe research infrastructure deserves the same freedoms as the software it runs on.

---

<p align="center">
  <a href="https://scitex.ai" target="_blank"><img src="docs/scitex-icon-navy-inverted.png" alt="SciTeX" width="40"/></a>
</p>

<!-- EOF -->
