Metadata-Version: 2.4
Name: ml4t-data
Version: 0.1.0a10
Summary: High-performance market data management library with unified multi-provider interface
Project-URL: Homepage, https://github.com/stefan-jansen/ml4t-data
Project-URL: Documentation, https://ml4t-data.readthedocs.io
Project-URL: Repository, https://github.com/stefan-jansen/ml4t-data
Project-URL: Issues, https://github.com/stefan-jansen/ml4t-data/issues
Project-URL: Changelog, https://github.com/stefan-jansen/ml4t-data/blob/main/CHANGELOG.md
Author-email: ML4T Team <info@ml4trading.io>
Maintainer-email: ML4T Contributors <dev@ml4trading.io>
License: MIT
License-File: LICENSE
Keywords: backtesting,binance,cryptocompare,databento,finance,market-data,oanda,parquet,polars,quantitative-finance,trading,yahoo-finance
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business :: Financial :: Investment
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: aiofiles>=23.0.0
Requires-Dist: click>=8.0.0
Requires-Dist: filelock>=3.19.1
Requires-Dist: html5lib>=1.1
Requires-Dist: httpx>=0.25.0
Requires-Dist: lxml>=6.0.2
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas-market-calendars>=4.3.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: platformdirs>=4.0.0
Requires-Dist: polars>=0.20.0
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: pybreaker>=1.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: structlog>=23.0.0
Requires-Dist: tenacity>=8.0.0
Provides-Extra: all
Requires-Dist: cot-reports>=0.1.0; extra == 'all'
Requires-Dist: databento>=0.38.0; extra == 'all'
Requires-Dist: hypothesis>=6.80.0; extra == 'all'
Requires-Dist: ipdb>=0.13.0; extra == 'all'
Requires-Dist: ipython>=8.14.0; extra == 'all'
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.2.0; extra == 'all'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'all'
Requires-Dist: mkdocs>=1.6.0; extra == 'all'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'all'
Requires-Dist: mypy>=1.5.0; extra == 'all'
Requires-Dist: oandapyv20>=0.7.0; extra == 'all'
Requires-Dist: pre-commit>=3.3.0; extra == 'all'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'all'
Requires-Dist: pytest-cov>=4.1.0; extra == 'all'
Requires-Dist: pytest-timeout>=2.1.0; extra == 'all'
Requires-Dist: pytest-xdist>=3.3.0; extra == 'all'
Requires-Dist: pytest>=7.4.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Requires-Dist: xlsxwriter>=3.1.0; extra == 'all'
Requires-Dist: yfinance>=0.2.0; extra == 'all'
Provides-Extra: all-providers
Requires-Dist: cot-reports>=0.1.0; extra == 'all-providers'
Requires-Dist: databento>=0.38.0; extra == 'all-providers'
Requires-Dist: oandapyv20>=0.7.0; extra == 'all-providers'
Requires-Dist: yfinance>=0.2.0; extra == 'all-providers'
Provides-Extra: cot
Requires-Dist: cot-reports>=0.1.0; extra == 'cot'
Provides-Extra: databento
Requires-Dist: databento>=0.38.0; extra == 'databento'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.80.0; extra == 'dev'
Requires-Dist: ipdb>=0.13.0; extra == 'dev'
Requires-Dist: ipython>=8.14.0; extra == 'dev'
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pre-commit>=3.3.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.1.0; extra == 'dev'
Requires-Dist: pytest-xdist>=3.3.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: xlsxwriter>=3.1.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.2.0; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'docs'
Requires-Dist: mkdocs>=1.6.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'docs'
Provides-Extra: oanda
Requires-Dist: oandapyv20>=0.7.0; extra == 'oanda'
Provides-Extra: yahoo
Requires-Dist: yfinance>=0.2.0; extra == 'yahoo'
Description-Content-Type: text/markdown

# ml4t-data

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![PyPI](https://img.shields.io/pypi/v/ml4t-data)](https://pypi.org/project/ml4t-data/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Market data acquisition and management library providing unified access to 19 data providers with automated incremental updates and Hive-partitioned storage.

## Features

- **19 Data Providers**: Crypto, equities, forex, futures, factor data with consistent API
- **Unified Interface**: `fetch_ohlcv()` works identically across all providers
- **Automated Updates**: CLI for incremental updates, gap detection, backfilling
- **Smart Storage**: Hive-partitioned Parquet with metadata tracking
- **Data Validation**: OHLC invariant checks, deduplication, anomaly detection
- **Polars-Based**: 10-100x faster than pandas alternatives

## Installation

```bash
pip install ml4t-data
```

## Quick Start

```python
from ml4t.data.providers import YahooFinanceProvider

# No API key needed
provider = YahooFinanceProvider()
data = provider.fetch_ohlcv("AAPL", "2020-01-01", "2024-12-31")

print(data.head())
# shape: (1,258, 7)
# ┌────────────┬────────┬────────┬────────┬────────┬──────────┬────────┐
# │ date       │ open   │ high   │ low    │ close  │ volume   │ symbol │
```

```python
from ml4t.data.providers import CoinGeckoProvider

# Crypto data (no API key needed)
provider = CoinGeckoProvider()
btc = provider.fetch_ohlcv("bitcoin", "2024-01-01", "2024-12-31")
```

## Providers

### Free (No API Key)

| Provider | Coverage | Rate Limit | Best For |
|----------|----------|------------|----------|
| **Yahoo Finance** | Stocks, ETFs, crypto, forex | ~2000/hour | US equities, research |
| **CoinGecko** | 10,000+ cryptocurrencies | 50/min | Crypto historical |
| **FRED** | 850,000 economic series | 120/min | Macro indicators |
| **Fama-French** | Academic factor data | Unlimited | Factor research |
| **AQR** | Alternative factors | Unlimited | QMJ, BAB, TSMOM |

### Free (API Key Required)

| Provider | Free Tier | Paid From | Best For |
|----------|-----------|-----------|----------|
| **EODHD** | 500 calls/day | €20/month | Global markets (60+ exchanges) |
| **Tiingo** | 1000 calls/day | $30/month | High-quality US data |
| **Twelve Data** | 800 calls/day | $10/month | Multi-asset |
| **Alpha Vantage** | 25 calls/day | $50/month | Conservative research |

### Professional

| Provider | Starting Price | Coverage |
|----------|---------------|----------|
| **Databento** | $9/month | CME, CBOE, ICE futures/options |
| **Polygon** | $99/month | Stocks, options, forex, crypto |
| **Finnhub** | $60/month | 70+ global exchanges |

### Specialized

| Provider | Type | Coverage |
|----------|------|----------|
| **Binance** | Crypto exchange | 600+ pairs (geo-restricted) |
| **OANDA** | Forex broker | Major/minor pairs |
| **Wiki Prices** | Static dataset | 3,199 US stocks (1962-2018) |
| **AlgoSeek** | Static dataset | NASDAQ 100 minute bars (2015-2017) |

## CLI Usage

```bash
# Fetch data
ml4t-data fetch AAPL MSFT GOOGL --provider yahoo --start 2020-01-01

# Configuration-driven updates
ml4t-data update-all -c config.yaml

# Preview changes (dry run)
ml4t-data update-all -c config.yaml --dry-run

# Detect and fill gaps
ml4t-data update-all -c config.yaml --detect-gaps

# List stored data
ml4t-data list -c config.yaml

# Check specific symbol
ml4t-data info AAPL -c config.yaml
```

### Configuration File

```yaml
# ml4t-data.yaml
storage:
  path: ~/data/market

datasets:
  sp500_daily:
    provider: yahoo
    symbols_file: symbols/sp500.txt
    frequency: daily
    start_date: 2015-01-01

  crypto:
    provider: coingecko
    symbols: [bitcoin, ethereum, solana]
    frequency: daily
    start_date: 2020-01-01
```

### Automated Updates (Cron)

```bash
# Daily at 5 PM (after market close)
0 17 * * 1-5 ml4t-data update-all -c ~/ml4t-data.yaml
```

## Storage

Data is stored in Hive-partitioned Parquet format:

```
~/data/market/
├── yahoo/
│   └── daily/
│       ├── symbol=AAPL/
│       │   └── data.parquet
│       └── symbol=MSFT/
│           └── data.parquet
└── coingecko/
    └── daily/
        └── symbol=bitcoin/
            └── data.parquet
```

Query with DuckDB (zero-copy):

```python
import duckdb

conn = duckdb.connect()
result = conn.execute("""
    SELECT * FROM read_parquet('~/data/market/yahoo/daily/**/*.parquet')
    WHERE symbol IN ('AAPL', 'MSFT')
    AND date >= '2024-01-01'
""").pl()
```

## Data Validation

Built-in OHLC validation:

```python
from ml4t.data.validation import validate_ohlcv

# Checks: high >= low, high >= open/close, low <= open/close
# Detects: duplicates, gaps, anomalies
issues = validate_ohlcv(data)
if issues:
    print(f"Found {len(issues)} data quality issues")
```

## API Reference

### Providers

```python
from ml4t.data.providers import (
    # Free (no key)
    YahooFinanceProvider,
    CoinGeckoProvider,
    FREDProvider,
    FamaFrenchProvider,
    AQRProvider,

    # Free (key required)
    EODHDProvider,
    TiingoProvider,
    TwelveDataProvider,
    AlphaVantageProvider,

    # Professional
    DataBentoProvider,
    PolygonProvider,
    FinnhubProvider,

    # Specialized
    BinanceProvider,
    OANDAProvider,
)
```

### Common Interface

All providers implement the same interface:

```python
class DataProvider:
    def fetch_ohlcv(
        self,
        symbol: str,
        start_date: str,
        end_date: str | None = None,
        frequency: str = "daily",
    ) -> pl.DataFrame:
        """Fetch OHLCV data for a symbol."""

    def fetch_multiple(
        self,
        symbols: list[str],
        start_date: str,
        end_date: str | None = None,
        frequency: str = "daily",
    ) -> dict[str, pl.DataFrame]:
        """Fetch OHLCV data for multiple symbols."""
```

### Storage

```python
from ml4t.data.storage import HiveStorage

storage = HiveStorage("~/data/market")

# Save data
storage.save(data, provider="yahoo", frequency="daily", symbol="AAPL")

# Load data
data = storage.load(provider="yahoo", frequency="daily", symbol="AAPL")

# Query metadata
metadata = storage.get_metadata(provider="yahoo", frequency="daily", symbol="AAPL")
print(f"Last updated: {metadata['last_updated']}")
```

## Integration with ML4T Libraries

ml4t-data is part of the ML4T library ecosystem:

```python
from ml4t.data import DataManager
from ml4t.engineer import compute_features
from ml4t.backtest import Engine, Strategy

# Complete workflow
data = DataManager().fetch("SPY", "2020-01-01", "2023-12-31")
features = compute_features(data, ["rsi", "macd", "atr"])
# ... backtest with features
```

## Ecosystem

- **ml4t-data**: Market data acquisition and storage (this library)
- **ml4t-engineer**: Feature engineering and indicators
- **ml4t-diagnostic**: Statistical validation and evaluation
- **ml4t-backtest**: Event-driven backtesting
- **ml4t-live**: Live trading platform

## Testing

```bash
# Run tests (3,071 tests)
uv run pytest tests/ -q

# Type checking
uv run ty check

# Linting
uv run ruff check src/
```

## Development

```bash
git clone https://github.com/applied-ai/ml4t-data.git
cd ml4t-data

# Install with dev dependencies
uv sync

# Run tests
uv run pytest tests/ -q

# Type checking
uv run ty check
```

## Environment Variables

```bash
# API keys
export EODHD_API_KEY="your_key"
export DATABENTO_API_KEY="your_key"
export FINNHUB_API_KEY="your_key"
export TIINGO_API_KEY="your_key"
export FRED_API_KEY="your_key"

# Configuration
export ML4T_DATA_CONFIG="~/ml4t-data.yaml"
export ML4T_DATA_LOG_LEVEL="INFO"
```

## License

MIT License - see [LICENSE](LICENSE) for details.
