Metadata-Version: 2.4
Name: turboloader
Version: 0.6.0
Summary: High-performance multi-framework data loading library (10,146 img/s, 12x faster than PyTorch). Features: TensorFlow/Keras, JAX/Flax, PyTorch support, WebDataset format, cloud storage (S3/GCS/HTTP), SIMD-optimized JPEG decoder, and comprehensive benchmarking. Developed and tested on Apple M4 Max (48GB RAM) with C++20 and Python 3.8+
Author: TurboLoader Contributors
Author-email: Arnav Jain <arnav@example.com>
Maintainer-email: Arnav Jain <arnav@example.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/arnavjain/turboloader
Project-URL: Documentation, https://github.com/arnavjain/turboloader/blob/main/README.md
Project-URL: Repository, https://github.com/arnavjain/turboloader
Project-URL: Bug Tracker, https://github.com/arnavjain/turboloader/issues
Project-URL: Changelog, https://github.com/arnavjain/turboloader/blob/main/CHANGELOG.md
Keywords: machine-learning,deep-learning,data-loading,pytorch,performance,simd,imagenet,computer-vision
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: AUTHORS.md
Requires-Dist: numpy>=1.19.0
Requires-Dist: torch>=1.8.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: flake8>=3.9; extra == "dev"
Requires-Dist: mypy>=0.900; extra == "dev"
Requires-Dist: build>=0.7.0; extra == "dev"
Requires-Dist: twine>=3.4.0; extra == "dev"
Provides-Extra: benchmarks
Requires-Dist: Pillow>=8.0.0; extra == "benchmarks"
Requires-Dist: tqdm>=4.60.0; extra == "benchmarks"
Requires-Dist: matplotlib>=3.3.0; extra == "benchmarks"
Requires-Dist: psutil>=5.8.0; extra == "benchmarks"
Provides-Extra: all
Requires-Dist: turboloader[benchmarks,dev]; extra == "all"
Dynamic: author
Dynamic: license-file
Dynamic: requires-python

# TurboLoader

**High-Performance ML Data Loading Library**

[![PyPI version](https://badge.fury.io/py/turboloader.svg)](https://badge.fury.io/py/turboloader)
[![C++20](https://img.shields.io/badge/C%2B%2B-20-blue.svg)](https://en.wikipedia.org/wiki/C%2B%2B20)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

## Overview

TurboLoader is a high-performance data loading library designed to accelerate ML training by replacing Python's multiprocessing-based data loaders with efficient C++ native threads and thread-safe concurrent data structures.

**Key Features**:
- 🚀 **Native C++ Implementation** with Python bindings via pybind11
- ⚡ **SIMD-Optimized Transforms** using AVX2/AVX-512/NEON
- 🔒 **Thread-Safe Concurrent Queues** for reliable multi-threaded data passing
- 🧵 **C++ Native Threads** (no Python GIL, no multiprocessing overhead)
- 💾 **Zero-Copy Memory-Mapped I/O** for efficient file reading
- 📦 **WebDataset TAR Format** support for sharded datasets
- 🎯 **SIMD-Accelerated Image Decoders** (JPEG, PNG, WebP)
- 🎨 **7 Data Augmentation Transforms** with SIMD optimization
- 🐍 **PyTorch-Compatible API** drop-in replacement

---

## Performance

### Current Status (v0.6.0)

**Comprehensive Benchmark Results** (2000 images, 8 workers, batch_size=32, 3 epochs):

| Rank | Framework | Throughput | vs TurboLoader | Avg Epoch Time |
|------|-----------|------------|----------------|----------------|
| 1 | **TurboLoader** | **10,146 img/s** | **1.00x** | **0.18s** |
| 2 | TensorFlow tf.data | 7,569 img/s | 0.75x | 0.26s |
| 3 | PyTorch Cached | 3,123 img/s | 0.31x | 0.64s |
| 4 | PyTorch Optimized | 835 img/s | 0.08x | 2.40s |
| 5 | PIL Baseline | 277 img/s | 0.03x | 7.22s |
| 6 | PyTorch Naive | 85 img/s | 0.01x | 23.67s |

**Key Highlights**:
- **12x faster** than PyTorch Optimized DataLoader
- **3.2x faster** than PyTorch with local file caching
- **1.3x faster** than TensorFlow tf.data
- **Extremely stable**: ±0.005s standard deviation across epochs
- **Memory efficient**: 848 MB peak memory usage

**Test Configuration**:
- Hardware: Apple M1 Pro (8 cores, 16GB RAM)
- Dataset: 2000 synthetic 256x256 JPEG images (117 MB TAR archive)
- Configuration: 8 workers, batch size 32, 3 epochs
- Backend: C++ multi-threaded pipeline with SIMD optimizations

See [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) for detailed analysis and interactive [benchmark report](BENCHMARK_REPORT.html).

---

## Installation

```bash
pip install turboloader
```

**Requirements**:
- Python 3.8+
- C++20 compiler (GCC 10+, Clang 12+, MSVC 19.29+)
- CMake 3.15+

**Optional Dependencies**:
- libjpeg-turbo (JPEG decoding)
- libpng (PNG decoding)
- libwebp (WebP decoding)

---

## Quick Start

### Basic Usage

```python
import turboloader

# Create pipeline
pipeline = turboloader.Pipeline(
    tar_paths=['imagenet.tar'],
    num_workers=8,
    batch_size=32,
    decode_jpeg=True
)

pipeline.start()

# Get batches
for _ in range(100):
    batch = pipeline.next_batch(32)
    for sample in batch:
        img = sample.get_image()  # NumPy array (H, W, C)
        # Your training code here...

pipeline.stop()
```

### With SIMD Transforms

```python
import turboloader

# Configure SIMD-accelerated transforms
config = turboloader.TransformConfig()
config.enable_resize = True
config.resize_width = 224
config.resize_height = 224
config.enable_normalize = True
config.mean = [0.485, 0.456, 0.406]
config.std = [0.229, 0.224, 0.225]

pipeline = turboloader.Pipeline(
    tar_paths=['imagenet.tar'],
    num_workers=8,
    decode_jpeg=True,
    enable_simd_transforms=True,
    transform_config=config
)

pipeline.start()
batch = pipeline.next_batch(256)
pipeline.stop()
```

### With Data Augmentation

```python
import turboloader

# Create augmentation pipeline
aug_pipeline = turboloader.AugmentationPipeline()
aug_pipeline.add_transform(turboloader.RandomHorizontalFlip(0.5))
aug_pipeline.add_transform(turboloader.ColorJitter(brightness=0.2, contrast=0.2))
aug_pipeline.add_transform(turboloader.RandomCrop(224, 224))

# Use with data loader (planned feature)
# pipeline = turboloader.Pipeline(tar_paths=['data.tar'], augmentations=aug_pipeline)
```

---

## Architecture

TurboLoader is built on several high-performance components:

### Core Components

1. **Thread-Safe Concurrent Queues**
   - Mutex-based synchronization for reliable multi-threaded operation
   - Thread-safe data passing between reader and worker threads
   - Stable performance with high worker counts (8+ workers)

2. **Memory-Mapped I/O**
   - `mmap()` for zero-copy file reading
   - Efficient TAR archive parsing
   - Minimizes memory allocations

3. **SIMD Transforms**
   - AVX2/AVX-512 on x86_64
   - NEON on ARM (Apple Silicon, ARM servers)
   - Vectorized resize, normalize, color conversion

4. **Thread-Local Decoders**
   - Per-thread JPEG/PNG/WebP decoders
   - Eliminates decoder allocation overhead
   - Maximizes cache locality

### Supported Transforms

TurboLoader v0.3.x includes 7 SIMD-accelerated augmentation transforms:

- **RandomHorizontalFlip**: SIMD-optimized horizontal flip
- **RandomVerticalFlip**: SIMD-optimized vertical flip
- **ColorJitter**: Brightness, contrast, saturation adjustments
- **RandomRotation**: Bilinear interpolation rotation
- **RandomCrop**: Random crop with padding
- **RandomErasing**: Cutout augmentation
- **GaussianBlur**: Separable Gaussian filter (SIMD)

---

## API Reference

### Pipeline

```python
class Pipeline:
    def __init__(
        self,
        tar_paths: List[str],
        num_workers: int = 4,
        queue_size: int = 256,
        shuffle: bool = False,
        decode_jpeg: bool = False,
        enable_simd_transforms: bool = False,
        transform_config: Optional[TransformConfig] = None
    )

    def start() -> None
    def stop() -> None
    def reset() -> None
    def next_batch(batch_size: int) -> List[Sample]
    def total_samples() -> int
```

### TransformConfig

```python
class TransformConfig:
    enable_resize: bool = False
    resize_width: int = 224
    resize_height: int = 224
    resize_method: ResizeMethod = ResizeMethod.BILINEAR

    enable_normalize: bool = False
    mean: List[float] = [0.0, 0.0, 0.0]
    std: List[float] = [1.0, 1.0, 1.0]

    enable_color_convert: bool = False
    src_color: ColorSpace = ColorSpace.RGB
    dst_color: ColorSpace = ColorSpace.RGB
    output_float: bool = False
```

### Augmentation Transforms

```python
class AugmentationPipeline:
    def __init__(seed: Optional[int] = None)
    def add_transform(transform: AugmentationTransform) -> None
    def clear() -> None
    def num_transforms() -> int

class RandomHorizontalFlip(AugmentationTransform):
    def __init__(probability: float = 0.5)

class ColorJitter(AugmentationTransform):
    def __init__(
        brightness: float = 0.0,
        contrast: float = 0.0,
        saturation: float = 0.0,
        hue: float = 0.0
    )
```

---

## Roadmap

### TurboLoader.0 (Q1 2025) - HIGH PRIORITY

**Complete pipeline rewrite to fix critical performance issues**

See [ARCHITECTURE_V2.md](https://github.com/ALJainProjects/TurboLoader/blob/TurboLoader-rewrite/ARCHITECTURE_V2.md) for full design.

#### Core Infrastructure
- [ ] Lock-free SPSC ring buffers (~50x faster than mutex queues)
- [ ] Object pool for buffer reuse (eliminate malloc/free overhead)
- [ ] Zero-copy sample struct using `std::span` views

#### I/O Layer
- [ ] Per-worker TAR file handles (eliminate mutex bottleneck)
- [ ] Memory-mapped I/O for true zero-copy reads
- [ ] Worker-based sample partitioning

#### Decoding & Performance
- [ ] TurboJPEG SIMD decoder integration (2-3x faster)
- [ ] Object pool for decoded buffers
- [ ] Fallback to libjpeg for compatibility

#### Testing & Validation
- [ ] Comprehensive unit tests (all components)
- [ ] Performance benchmarks vs PyTorch (target: >100 img/s)
- [ ] Memory leak checks (valgrind)
- [ ] Thread safety verification (ThreadSanitizer)

**Expected Performance**: 150-200 img/s (3-4x faster than PyTorch baseline)

**Estimated Timeline**: 11-17 hours of development

**Branch**: `TurboLoader-rewrite`

---

### v0.4.0 (Q2 2025)
- [ ] Full ImageNet benchmark suite
- [ ] TensorFlow/JAX integration
- [ ] Additional image formats (TIFF, BMP)
- [ ] Video decoding support

### v0.5.0 (Q3 2025)
- [ ] GPU-accelerated JPEG decoding (nvJPEG)
- [ ] Distributed training support
- [ ] S3/GCS remote dataset loading

### v1.0.0 (Q4 2025)
- [ ] Production-ready API stability
- [ ] Comprehensive documentation
- [ ] Full test coverage
- [ ] Performance optimization

---

## Contributing

Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

### Development Setup

```bash
# Clone repository
git clone https://github.com/ALJainProjects/TurboLoader.git
cd TurboLoader

# Install dependencies
brew install cmake libjpeg-turbo libpng libwebp  # macOS
# or
apt-get install cmake libjpeg-turbo8-dev libpng-dev libwebp-dev  # Ubuntu

# Build from source
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j8

# Run tests
./tests/turboloader_tests
./tests/test_simd_transforms
```

---

## License

MIT License - see [LICENSE](LICENSE) for details.

---

## Citation

If you use TurboLoader in your research, please cite:

```bibtex
@software{turboloader2025,
  author = {Jain, Arnav},
  title = {TurboLoader: High-Performance ML Data Loading},
  year = {2025},
  url = {https://github.com/ALJainProjects/TurboLoader}
}
```

---

## Acknowledgments

- Inspired by [FFCV](https://github.com/libffcv/ffcv) and [NVIDIA DALI](https://github.com/NVIDIA/DALI)
- Built with [pybind11](https://github.com/pybind/pybind11)
- Uses [libjpeg-turbo](https://libjpeg-turbo.org/) for fast JPEG decoding

---

## Support

- **Issues**: [GitHub Issues](https://github.com/ALJainProjects/TurboLoader/issues)
- **Documentation**: [docs/](docs/)
- **PyPI**: [https://pypi.org/project/turboloader/](https://pypi.org/project/turboloader/)
