Metadata-Version: 2.4
Name: ml4t-data
Version: 0.1.0a11
Summary: High-performance market data management library with unified multi-provider interface
Project-URL: Homepage, https://github.com/stefan-jansen/ml4t-data
Project-URL: Documentation, https://ml4t-data.readthedocs.io
Project-URL: Repository, https://github.com/stefan-jansen/ml4t-data
Project-URL: Issues, https://github.com/stefan-jansen/ml4t-data/issues
Project-URL: Changelog, https://github.com/stefan-jansen/ml4t-data/blob/main/CHANGELOG.md
Author-email: ML4T Team <info@ml4trading.io>
Maintainer-email: ML4T Contributors <dev@ml4trading.io>
License: MIT
License-File: LICENSE
Keywords: backtesting,binance,cryptocompare,databento,finance,market-data,oanda,parquet,polars,quantitative-finance,trading,yahoo-finance
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business :: Financial :: Investment
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: aiofiles>=23.0.0
Requires-Dist: click>=8.0.0
Requires-Dist: filelock>=3.19.1
Requires-Dist: html5lib>=1.1
Requires-Dist: httpx>=0.25.0
Requires-Dist: lxml>=6.0.2
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas-market-calendars>=4.3.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: platformdirs>=4.0.0
Requires-Dist: polars>=0.20.0
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: pybreaker>=1.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: structlog>=23.0.0
Requires-Dist: tenacity>=8.0.0
Provides-Extra: all
Requires-Dist: cot-reports>=0.1.0; extra == 'all'
Requires-Dist: databento>=0.38.0; extra == 'all'
Requires-Dist: hypothesis>=6.80.0; extra == 'all'
Requires-Dist: ipdb>=0.13.0; extra == 'all'
Requires-Dist: ipython>=8.14.0; extra == 'all'
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.2.0; extra == 'all'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'all'
Requires-Dist: mkdocs>=1.6.0; extra == 'all'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'all'
Requires-Dist: mypy>=1.5.0; extra == 'all'
Requires-Dist: oandapyv20>=0.7.0; extra == 'all'
Requires-Dist: pre-commit>=3.3.0; extra == 'all'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'all'
Requires-Dist: pytest-cov>=4.1.0; extra == 'all'
Requires-Dist: pytest-timeout>=2.1.0; extra == 'all'
Requires-Dist: pytest-xdist>=3.3.0; extra == 'all'
Requires-Dist: pytest>=7.4.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Requires-Dist: xlsxwriter>=3.1.0; extra == 'all'
Requires-Dist: yfinance>=0.2.0; extra == 'all'
Provides-Extra: all-providers
Requires-Dist: cot-reports>=0.1.0; extra == 'all-providers'
Requires-Dist: databento>=0.38.0; extra == 'all-providers'
Requires-Dist: oandapyv20>=0.7.0; extra == 'all-providers'
Requires-Dist: yfinance>=0.2.0; extra == 'all-providers'
Provides-Extra: cot
Requires-Dist: cot-reports>=0.1.0; extra == 'cot'
Provides-Extra: databento
Requires-Dist: databento>=0.38.0; extra == 'databento'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.80.0; extra == 'dev'
Requires-Dist: ipdb>=0.13.0; extra == 'dev'
Requires-Dist: ipython>=8.14.0; extra == 'dev'
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: oandapyv20>=0.7.0; extra == 'dev'
Requires-Dist: pre-commit>=3.3.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.1.0; extra == 'dev'
Requires-Dist: pytest-xdist>=3.3.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: xlsxwriter>=3.1.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.2.0; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'docs'
Requires-Dist: mkdocs>=1.6.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'docs'
Provides-Extra: oanda
Requires-Dist: oandapyv20>=0.7.0; extra == 'oanda'
Provides-Extra: yahoo
Requires-Dist: yfinance>=0.2.0; extra == 'yahoo'
Description-Content-Type: text/markdown

# ml4t-data

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![PyPI](https://img.shields.io/pypi/v/ml4t-data)](https://pypi.org/project/ml4t-data/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Unified market data acquisition and storage for quantitative research workflows.

## Part of the ML4T Library Ecosystem

This library is one of five interconnected libraries supporting the machine learning for trading workflow described in [Machine Learning for Trading](https://mlfortrading.io):

![ML4T Library Ecosystem](docs/images/ml4t_ecosystem_workflow_print.jpeg)

Each library addresses a distinct stage: data infrastructure, feature engineering, signal evaluation, strategy backtesting, and live deployment.

## What This Library Does

Quantitative research requires consistent, reproducible access to market data from multiple sources. ml4t-data provides:

- A unified interface across 19 data providers (equities, crypto, forex, futures, economic data)
- Automated storage in Hive-partitioned Parquet format with metadata tracking
- Incremental updates, gap detection, and backfill capabilities via CLI
- Built-in data validation (OHLC invariants, deduplication, anomaly detection)

The goal is to support an ongoing research workflow rather than one-off downloads. Data is stored locally, tracked for freshness, and queryable with tools like DuckDB or Polars.

![ml4t-data Architecture](docs/images/ml4t_data_architecture_print.jpeg)

## Installation

```bash
pip install ml4t-data
```

## Quick Start

```python
from ml4t.data.providers import YahooFinanceProvider

provider = YahooFinanceProvider()
data = provider.fetch_ohlcv("AAPL", "2020-01-01", "2024-12-31")
print(data.head())
```

All providers implement the same interface:

```python
from ml4t.data.providers import CoinGeckoProvider, FREDProvider

# Crypto
crypto = CoinGeckoProvider().fetch_ohlcv("bitcoin", "2024-01-01", "2024-12-31")

# Economic data
fred = FREDProvider().fetch_series("GDP", "2020-01-01", "2024-12-31")
```

## Data Providers

### Free (No API Key)

| Provider | Coverage |
|----------|----------|
| Yahoo Finance | US/global equities, ETFs, crypto, forex |
| CoinGecko | 10,000+ cryptocurrencies |
| FRED | 850,000 economic series |
| Fama-French | Academic factor data |
| AQR | Alternative factors (QMJ, BAB, TSMOM) |

### API Key Required

| Provider | Coverage |
|----------|----------|
| EODHD | 60+ global exchanges |
| Tiingo | US equities with quality focus |
| Twelve Data | Multi-asset coverage |
| Alpha Vantage | Equities, forex, crypto |
| Databento | CME, CBOE, ICE futures/options |
| Polygon | US equities, options, forex, crypto |
| Finnhub | 70+ global exchanges |
| Binance | Crypto exchange data |
| OANDA | Forex broker data |

## CLI for Automated Updates

```bash
# Fetch specific symbols
ml4t-data fetch AAPL MSFT GOOGL --provider yahoo --start 2020-01-01

# Configuration-driven batch updates
ml4t-data update-all -c config.yaml

# Detect and fill gaps
ml4t-data update-all -c config.yaml --detect-gaps
```

Configuration example:

```yaml
storage:
  path: ~/data/market

datasets:
  sp500_daily:
    provider: yahoo
    symbols_file: symbols/sp500.txt
    frequency: daily
    start_date: 2015-01-01

  crypto:
    provider: coingecko
    symbols: [bitcoin, ethereum, solana]
    frequency: daily
    start_date: 2020-01-01
```

## Storage Format

Data is stored in Hive-partitioned Parquet:

```
~/data/market/
├── yahoo/daily/symbol=AAPL/data.parquet
├── yahoo/daily/symbol=MSFT/data.parquet
└── coingecko/daily/symbol=bitcoin/data.parquet
```

Query with DuckDB or Polars:

```python
import duckdb

result = duckdb.execute("""
    SELECT * FROM read_parquet('~/data/market/yahoo/daily/**/*.parquet')
    WHERE symbol IN ('AAPL', 'MSFT')
    AND date >= '2024-01-01'
""").pl()
```

## Data Validation

```python
from ml4t.data.validation import validate_ohlcv

issues = validate_ohlcv(data)
# Checks: high >= low, high >= open/close, low <= open/close
# Detects: duplicates, gaps, anomalies
```

## Technical Characteristics

- **Polars-based**: Native Polars DataFrames throughout
- **Consistent schema**: All providers return the same column structure
- **Metadata tracking**: Last update timestamps, row counts, date ranges

## Related Libraries

- **ml4t-engineer**: Feature engineering and technical indicators
- **ml4t-diagnostic**: Signal evaluation and statistical validation
- **ml4t-backtest**: Event-driven backtesting
- **ml4t-live**: Live trading with broker integration

## Development

```bash
git clone https://github.com/applied-ai/ml4t-data.git
cd ml4t-data
uv sync
uv run pytest tests/ -q
uv run ty check
```

## License

MIT License - see [LICENSE](LICENSE) for details.
