Metadata-Version: 2.2
Name: dstorage-gpu
Version: 1.0.0
Summary: Fast GPU tensor loading via DirectStorage on Windows (NVMe -> GPU, no CPU copy)
Keywords: directstorage,gpu,cuda,nvme,pytorch,tensor,windows
Author: James
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Project-URL: Homepage, https://github.com/jamesalexander/GpuFastLoadWindows
Project-URL: Repository, https://github.com/jamesalexander/GpuFastLoadWindows
Requires-Python: >=3.10
Requires-Dist: torch>=2.0
Requires-Dist: numpy
Description-Content-Type: text/markdown

# dstorage-gpu

[![PyPI](https://img.shields.io/pypi/v/dstorage-gpu)](https://pypi.org/project/dstorage-gpu/)
[![Windows](https://img.shields.io/badge/platform-Windows-blue)]()
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

Fast GPU tensor loading on Windows using Microsoft DirectStorage — loads data directly from NVMe to GPU memory, bypassing CPU RAM entirely.

## Problem

Loading large model weights into GPU memory on Windows is slow. `torch.load` copies data through CPU RAM (NVMe -> page cache -> CPU -> GPU). On Linux, NVIDIA GPUDirect Storage bypasses the CPU, but `torch.cuda.gds.GdsFile` raises `RuntimeError` on Windows.

**dstorage-gpu** solves this using Microsoft DirectStorage with D3D12-CUDA interop:

```
Traditional:  NVMe -> OS page cache -> CPU RAM -> cudaMemcpy H2D -> GPU  (~0.7 GB/s)
dstorage-gpu: NVMe -> DirectStorage -> D3D12 GPU buffer -> CUDA tensor   (~11 GB/s)
```

## Performance

Tested on NVIDIA RTX 3060 (12 GB), Windows 11, Crucial CT4000P3 NVMe (PCIe 3.0 x4).

### Single file (500 MB float32, 10 runs, 3 warmup)

| Method | Mean (ms) | GB/s | vs torch.load |
|---|---|---|---|
| **dstorage-gpu** | **42** | **11.7** | **16x faster** |
| torch.load -> .cuda() | 673 | 0.73 | baseline |
| np.fromfile -> .cuda() | 862 | 0.57 | 0.8x |

### Batch load (8x64 MB, single fence, 10 runs)

| Method | Mean (ms) | GB/s | vs baseline |
|---|---|---|---|
| **dstorage-gpu batch** | **54** | **9.4** | **17x faster** |
| np.fromfile (sequential) | 917 | 0.55 | baseline |

Peak throughput: **12.4 GB/s** (single file), **10.2 GB/s** (batch). Performance scales with NVMe speed — PCIe 4.0 drives should approach 14+ GB/s.

## Installation

### From PyPI (pre-built wheel)

```powershell
pip install dstorage-gpu
```

Requires: Windows 10 1909+, NVIDIA GPU (Pascal or later), CUDA Toolkit 12.x, PyTorch 2.x (CUDA build).

### From source

```powershell
git clone https://github.com/jamesburton/dstorage_gpu.git
cd dstorage_gpu

# Download DirectStorage SDK
.\setup.ps1

# Build and install
pip install -e .
```

Build prerequisites: Visual Studio 2022 (C++ workload), CMake 3.25+, CUDA Toolkit 12.x, pybind11.

## Quick Start

```python
from dstorage_gpu import DirectStorageLoader

loader = DirectStorageLoader()

# Load a single binary file directly to GPU
tensor = loader.load_tensor("weights.bin", num_elements=131072000)
print(tensor.shape, tensor.device)  # torch.Size([131072000]) cuda:0
```

## API Reference

### `DirectStorageLoader()`

Creates a loader instance. Caches D3D12 device, DXGI factory, and DirectStorage queue internally — create once, reuse for all loads.

### `loader.load_tensor(filepath, num_elements, dtype=torch.float32, device="cuda")`

Load a flat binary file directly to a CUDA tensor. No CPU RAM involved.

| Parameter | Type | Description |
|---|---|---|
| `filepath` | `str \| Path` | Path to raw binary file |
| `num_elements` | `int` | Number of elements (e.g., file_bytes // 4 for float32) |
| `dtype` | `torch.dtype` | Element type (default: `torch.float32`) |
| `device` | `str` | Must be `"cuda"` |

Returns: `torch.Tensor` on CUDA device.

### `loader.load_tensors(file_specs, dtype=torch.float32)`

Batch-load multiple files with a single DirectStorage submit + fence. Significantly faster than loading files sequentially.

| Parameter | Type | Description |
|---|---|---|
| `file_specs` | `list[tuple[str, int]]` | List of `(filepath, num_elements)` tuples |
| `dtype` | `torch.dtype` | Element type (default: `torch.float32`) |

Returns: `list[torch.Tensor]` on CUDA device.

### `loader.load_tensor_timed(filepath, num_elements, dtype=torch.float32)`

Same as `load_tensor` but returns `(tensor, elapsed_seconds)` for benchmarking.

### `load_tensor(filepath, num_elements, dtype=torch.float32)`

Module-level convenience function using a singleton `DirectStorageLoader`.

```python
from dstorage_gpu import load_tensor
tensor = load_tensor("weights.bin", num_elements=131072000)
```

## Usage Examples

### Loading raw model weights

```python
from pathlib import Path
from dstorage_gpu import DirectStorageLoader

loader = DirectStorageLoader()

# Load a raw float32 weight matrix
path = "model/attention_weight.bin"
num_floats = Path(path).stat().st_size // 4  # 4 bytes per float32
weight = loader.load_tensor(path, num_floats).reshape(768, 3072)
```

### Batch-loading transformer layers

```python
from pathlib import Path
from dstorage_gpu import DirectStorageLoader

loader = DirectStorageLoader()

# Build specs for all layer files
layer_specs = []
for i in range(24):
    path = f"model/layer_{i:02d}.bin"
    num_elements = Path(path).stat().st_size // 4
    layer_specs.append((path, num_elements))

# Single DirectStorage submit — all 24 layers loaded with one fence wait
layers = loader.load_tensors(layer_specs)
```

### Measuring throughput

```python
from dstorage_gpu import DirectStorageLoader

loader = DirectStorageLoader()
tensor, elapsed = loader.load_tensor_timed("weights.bin", 131_072_000)

size_gb = 131_072_000 * 4 / 1e9
print(f"{size_gb:.2f} GB loaded in {elapsed:.3f}s = {size_gb / elapsed:.1f} GB/s")
```

### Preparing raw binary files from PyTorch checkpoints

```python
import torch

# Convert a .pt checkpoint to raw binary for fast loading
state_dict = torch.load("model.pt", map_location="cpu")
for name, param in state_dict.items():
    # Save each parameter as a flat binary file
    param.contiguous().numpy().tofile(f"model/{name}.bin")
    print(f"{name}: {param.shape} -> {param.numel() * 4 / 1e6:.1f} MB")
```

## When to Use dstorage-gpu

**Best for:**
- Loading large model weights (500 MB+) from NVMe
- Batch loading many files (MoE experts, transformer layers)
- Cold loads (file not in OS page cache)
- Latency-sensitive inference pipelines
- Any scenario where NVMe-to-GPU bandwidth matters

**Not needed for:**
- Small files (<10 MB) — overhead of D3D12 setup dominates
- Files already in OS page cache and loaded repeatedly — `torch.load` is fast for cached reads
- CPU-only workflows
- Linux — use NVIDIA GPUDirect Storage (`cuFile`) instead

## Project Structure

```
dstorage_gpu/
├── pyproject.toml                   # pip install config (scikit-build-core)
├── CMakeLists.txt                   # Top-level build (builds _native.pyd)
├── LICENSE                          # MIT License
├── setup.ps1                        # Downloads DirectStorage SDK, sets env vars
├── dstorage_gpu/                    # Python package
│   ├── __init__.py                  # Public API: DirectStorageLoader, load_tensor
│   ├── _loader.py                   # Loader implementation
│   └── src/dstorage_gpu_ext.cpp     # C++ pybind11 extension (D3D12-CUDA interop)
├── benchmark/
│   ├── generate_test_file.py        # Generate test data (500 MB, batch, sweep)
│   ├── run_benchmark.py             # Benchmark suite
│   └── results/                     # Saved results
├── tests/                           # Test framework for contributors
│   └── conftest.py
└── docs/
    ├── ARCHITECTURE.md              # Technical deep-dive
    ├── BUILD_GUIDE.md               # Detailed build instructions
    └── CONTRIBUTING.md              # How to contribute (RTX IO, AMD, etc.)
```

## How It Works (Technical)

1. **GPU adapter matching**: CUDA device LUID is matched to a DXGI adapter, ensuring DirectStorage and CUDA use the same physical GPU.

2. **D3D12 shared buffer**: A committed resource is created with `D3D12_HEAP_FLAG_SHARED` so it can be shared across APIs via an NT handle.

3. **DirectStorage fill**: The GPU-destination queue reads directly from NVMe into the D3D12 buffer via DMA — no CPU staging buffer.

4. **CUDA import**: `CreateSharedHandle` exports an NT handle; `cudaImportExternalMemory` + `cudaExternalMemoryGetMappedBuffer` maps it into CUDA address space.

5. **D2D copy**: A device-to-device `cudaMemcpy` copies into a PyTorch tensor. The external memory is then released.

For batch loads, multiple files are enqueued and submitted with a single fence, reducing per-request overhead.

See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full technical deep-dive.

## Platform Support

| Platform | GPU | Status | Technology |
|---|---|---|---|
| **Windows** | **NVIDIA (CUDA)** | **Supported** | Microsoft DirectStorage + D3D12-CUDA interop |
| Windows | AMD (ROCm) | Open challenge | DirectStorage + D3D12-HIP interop (see CONTRIBUTING.md) |
| Linux | NVIDIA (CUDA) | Use native | NVIDIA GPUDirect Storage (`cuFile` / `nvidia-fs`) |
| Linux | AMD (ROCm) | Open challenge | AMD SmartAccess Storage (see CONTRIBUTING.md) |

## Contributing

We welcome contributions! Key open challenges:

- **AMD GPU support** via D3D12-HIP interop or AMD SmartAccess Storage
- **Linux NVIDIA support** via RTX IO / cuFile wrapper for API parity
- **GDeflate compression** support for compressed tensor files
- **Multi-GPU** loading with adapter selection

See [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md) for details, test framework, and instructions.

## Hardware Used for Development

- GPU: NVIDIA GeForce RTX 3060 (12 GB)
- Storage: Crucial CT4000P3 4 TB NVMe (PCIe 3.0 x4, ~3.5 GB/s seq read)
- OS: Windows 11 Pro

## License

[MIT](LICENSE)
