Metadata-Version: 2.4
Name: torchbridge-ml
Version: 0.5.80
Summary: Cross-backend validation and configuration intelligence for PyTorch — NVIDIA, AMD, Trainium, and TPU
Author: TorchBridge Team
License: Apache-2.0
Project-URL: Homepage, https://github.com/CloudlyIO/torchbridge
Project-URL: Documentation, https://torchbridge.readthedocs.io
Project-URL: Repository, https://github.com/CloudlyIO/torchbridge
Project-URL: Bug Tracker, https://github.com/CloudlyIO/torchbridge/issues
Keywords: pytorch,hardware-abstraction,multi-backend,cuda,amd,trainium,tpu,machine-learning,deep-learning
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch<3.0.0,>=2.0.0
Requires-Dist: numpy<3.0.0,>=1.21.0
Provides-Extra: dev
Requires-Dist: pytest<9.0.0,>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov<6.0.0,>=4.0.0; extra == "dev"
Requires-Dist: pytest-xdist<4.0.0,>=3.0.0; extra == "dev"
Requires-Dist: pytest-benchmark<5.0.0,>=4.0.0; extra == "dev"
Requires-Dist: hypothesis<7.0.0,>=6.0.0; extra == "dev"
Requires-Dist: matplotlib<4.0.0,>=3.5.0; extra == "dev"
Requires-Dist: seaborn<1.0.0,>=0.11.0; extra == "dev"
Requires-Dist: jupyter<2.0.0,>=1.0.0; extra == "dev"
Requires-Dist: tensorboard<3.0.0,>=2.9.0; extra == "dev"
Requires-Dist: ruff<1.0.0,>=0.4.0; extra == "dev"
Requires-Dist: mypy<2.0.0,>=1.0.0; extra == "dev"
Requires-Dist: pre-commit<4.0.0,>=3.5.0; extra == "dev"
Requires-Dist: bandit[toml]<2.0.0,>=1.7.0; extra == "dev"
Provides-Extra: all
Requires-Dist: transformers<5.0.0,>=4.35.0; extra == "all"
Requires-Dist: datasets<3.0.0,>=2.14.0; extra == "all"
Requires-Dist: tokenizers<1.0.0,>=0.14.0; extra == "all"
Requires-Dist: triton<4.0.0,>=2.1.0; extra == "all"
Requires-Dist: flash-attn<4.0.0,>=2.3.0; extra == "all"
Requires-Dist: accelerate<1.0.0,>=0.24.0; extra == "all"
Provides-Extra: cloud
Requires-Dist: boto3<2.0.0,>=1.28.0; extra == "cloud"
Requires-Dist: google-cloud-storage<3.0.0,>=2.10.0; extra == "cloud"
Requires-Dist: azure-storage-blob<13.0.0,>=12.17.0; extra == "cloud"
Requires-Dist: kubernetes<32.0.0,>=27.2.0; extra == "cloud"
Provides-Extra: serving
Requires-Dist: fastapi<1.0.0,>=0.103.0; extra == "serving"
Requires-Dist: uvicorn[standard]<1.0.0,>=0.23.0; extra == "serving"
Requires-Dist: torchserve<1.0.0,>=0.8.0; extra == "serving"
Requires-Dist: gradio<6.0.0,>=3.45.0; extra == "serving"
Requires-Dist: streamlit<2.0.0,>=1.27.0; extra == "serving"
Provides-Extra: monitoring
Requires-Dist: prometheus-client<1.0.0,>=0.17.0; extra == "monitoring"
Requires-Dist: wandb<1.0.0,>=0.16.0; extra == "monitoring"
Requires-Dist: tensorboard<3.0.0,>=2.14.0; extra == "monitoring"
Requires-Dist: mlflow<3.0.0,>=2.7.0; extra == "monitoring"
Requires-Dist: optuna<4.0.0,>=3.4.0; extra == "monitoring"
Provides-Extra: benchmark
Requires-Dist: memory-profiler<1.0.0,>=0.61.0; extra == "benchmark"
Requires-Dist: py-spy<1.0.0,>=0.3.14; extra == "benchmark"
Requires-Dist: torch-tb-profiler<1.0.0,>=0.4.0; extra == "benchmark"
Requires-Dist: psutil<6.0.0,>=5.9.0; extra == "benchmark"
Requires-Dist: gpustat<2.0.0,>=1.1.0; extra == "benchmark"
Provides-Extra: tracing
Requires-Dist: opentelemetry-api<2.0.0,>=1.20.0; extra == "tracing"
Requires-Dist: opentelemetry-sdk<2.0.0,>=1.20.0; extra == "tracing"
Requires-Dist: opentelemetry-exporter-otlp-proto-http<2.0.0,>=1.20.0; extra == "tracing"
Provides-Extra: quantization
Requires-Dist: torchao<1.0.0,>=0.4.0; extra == "quantization"
Provides-Extra: checkpoint
Requires-Dist: s3fs<2026.0.0,>=2024.1.0; extra == "checkpoint"
Requires-Dist: gcsfs<2026.0.0,>=2024.1.0; extra == "checkpoint"
Requires-Dist: adlfs<2026.0.0,>=2024.1.0; extra == "checkpoint"
Provides-Extra: docs
Requires-Dist: sphinx<8.0.0,>=7.0.0; extra == "docs"
Requires-Dist: furo<2026.0.0,>=2024.1.0; extra == "docs"
Requires-Dist: myst-parser<4.0.0,>=2.0.0; extra == "docs"
Requires-Dist: sphinx-autobuild<2026.0.0,>=2024.1.0; extra == "docs"
Dynamic: license-file

# TorchBridge

TorchBridge **validates that your model produces correct outputs across PyTorch backends** and recommends optimal hardware configurations. It answers two questions no other tool answers in a single command:

1. **"Does my model produce correct outputs across backends?"** — Run it on CUDA and ROCm and get max_diff, cosine_sim, per-layer divergence, pass/fail against empirical tolerances.
2. **"What's the optimal configuration for my model on this hardware?"** — Compatibility matrices that translate `(backend, architecture) → format/kernel/method` with fallback chains.

[![Version](https://img.shields.io/pypi/v/torchbridge-ml?label=version&color=green)](./CHANGELOG.md) [![License](https://img.shields.io/badge/license-Apache%202.0-blue)](./LICENSE) [![Tests](https://img.shields.io/badge/tests-1%2C900%2B%20passed-blue)](./docs/reference/hardware-matrix.md) [![Cloud GPU](https://img.shields.io/badge/platforms-8%20validated%2C%206%20GPU-brightgreen)](./docs/reference/cloud-validation.md) [![AWS A10G](https://img.shields.io/badge/AWS%20A10G-PASS-brightgreen)](./docs/reference/cloud-validation.md) [![GCP T4](https://img.shields.io/badge/GCP%20T4-PASS-brightgreen)](./docs/reference/cloud-validation.md) [![H100 NVL](https://img.shields.io/badge/H100%20NVL-PASS-brightgreen)](./docs/reference/cloud-validation.md) [![MI300X](https://img.shields.io/badge/MI300X-PASS-brightgreen)](./docs/reference/cloud-validation.md) [![TPU v5e](https://img.shields.io/badge/TPU%20v5e-PASS-brightgreen)](./docs/reference/cloud-validation.md) [![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org) [![PyTorch](https://img.shields.io/badge/pytorch-2.0%2B-orange)](https://pytorch.org)

## Quick Start

```bash
pip install torchbridge-ml
```

### Cross-Backend Validation (the hero command)

```bash
# Compare CUDA vs ROCm outputs — max_diff, cosine_sim, pass/fail
tb-validate --compare cuda rocm --model ./model.pt

# Per-layer divergence report
tb-validate --compare cuda rocm --model ./model.pt --per-layer

# Multi-step agentic trace — track divergence amplification across 50 steps
tb-validate --compare cuda rocm --model ./model.pt --trace --steps 50 --autoregressive

# Compliance certificate + OTel span export (Langfuse, W&B, etc.)
tb-validate --compare cuda rocm --model ./model.pt --cert --otel

# CI mode — exits non-zero if max_diff exceeds tolerance
tb-validate --compare cuda rocm --model ./model.pt --ci
```

### Hardware Configuration Advisor

```bash
# What's the optimal config for this hardware?
tb-advisor

# Disaggregated prefill/decode fleet config
tb-advisor --mode disaggregated --prefill-backend nvidia --decode-backend amd

# Heterogeneous cluster training config (NVIDIA + AMD mixed)
tb-advisor --mode heterogeneous --nvidia hopper:8 --amd cdna3:4

# Doctor — diagnose your hardware setup
tb-doctor
```

### Python API

```python
from torchbridge.backends import BackendFactory, detect_best_backend

backend_type = detect_best_backend()  # NVIDIA, AMD, Trainium, TPU, or CPU
backend = BackendFactory.create(backend_type)
print(backend.get_device_info())
```

```python
# Cross-backend validation
from torchbridge import UnifiedValidator

validator = UnifiedValidator()
results = validator.validate_model(model, input_shape=(1, 768))
print(f"Validation: {results.passed}/{results.total_tests} tests passed")
```

## What TorchBridge Does

| Capability | What TorchBridge adds |
|------------|----------------------|
| **Cross-backend validation** | `tb-validate --compare cuda rocm` — per-layer divergence, empirical tolerance DB (5 model families × 5 backends × 3 dtypes), CI-ready JSON |
| **Multi-step agentic trace** | `tb-validate --trace --steps 50 --autoregressive` — tracks how max_diff amplifies across N autoregressive steps; reports first-divergence-step and amplification factor |
| **Compliance certificates** | `tb-validate --cert` — SHA256-signed pass/fail certificate for KV handoff physical spec (page size, alignment, layout) |
| **Observability integration** | `tb-validate --otel` — emits validation spans (max_diff, cosine_sim, per-layer child spans) to any OTLP endpoint (Langfuse, W&B Weave, Honeycomb) |
| **Compatibility matrices** | 12 empirically-sourced matrices: `(backend, architecture) → optimal quant format / attention kernel / adapter method / FSDP strategy` |
| **Config advisory** | `tb-advisor` — FSDP, quantization, KV cache, speculative decoding, disaggregated fleet (`--mode disaggregated`), heterogeneous clusters (`--mode heterogeneous`) |
| **Backend detection** | Hardware identification, capability queries, priority chain across NVIDIA/AMD/Trainium/TPU/CPU |
| **Tolerance DB** | 80 empirical entries, 3-level fallback, `--model-family` flag — tolerances sourced from real Qwen3-0.6B runs across 6 GPU platforms |
| **CLI diagnostics** | `tb-doctor`, `tb-validate`, `tb-advisor`, `tb-quantize`, `tb-migrate`, `tb-benchmark` |

## What TorchBridge Is NOT

- **Not a quantization library** — dispatches format selection to torchao; TorchBridge adds the compatibility matrix
- **Not a serving runtime** — the inference server is a validation demo, not a production serving replacement for vLLM or TGI
- **Not a training framework** — adapter math (LoRA/QLoRA) is correct and kept; use PEFT for full training workflows
- **Not a PyTorch wrapper** — if a method body is `return torch.something(...)` with no selection logic, it doesn't belong here

## Supported Backends

| Backend | Hardware | Precision | Status |
|---------|----------|-----------|--------|
| **NVIDIA** | B200, H100, H200, A100, L4, T4 | FP4, FP8, BF16, FP16, FP32 | Production |
| **AMD** | MI350X, MI325X, MI300X, MI200 | FP8, BF16, FP16, FP32 | Production |
| **Trainium** | Trn1, Trn2, Trn3 (AWS NeuronX) | BF16, FP16, FP32 | Supported |
| **TPU** | v4, v5e, v5p, v6e, v7 | BF16, FP32 | Production |
| **CPU** | x86, ARM (Apple Silicon) | FP32, BF16 | Fallback |

See [Hardware Matrix](./docs/reference/hardware-matrix.md) for full details.

## Cloud Hardware Validation

Cross-backend numerical consistency validated on 8 platforms (6 real GPU/accelerator, 2 CPU-fallback†) using Qwen3-0.6B:

| Platform | Hardware | Max Diff | Cosine Sim | Latency | Status |
|----------|----------|----------|------------|---------|--------|
| AWS | NVIDIA A10G (24GB) | 1.96e-05 | 1.000001 | 41.8 ms | PASS |
| GCP | NVIDIA T4 (16GB) | 2.67e-05 | 1.000001 | 50.8 ms | PASS |
| RunPod | NVIDIA H100 NVL (100GB) | 2.29e-05 | 1.000001 | 18.8 ms | PASS |
| AMD DevCloud | AMD MI300X (192GB) | 4.82e-05 | 1.000001 | 30.0 ms | PASS |
| GCP | TPU v5e | 1.08e-01 | 0.999980 | 47.5 ms | PASS |
| Local | Apple Silicon (MPS) | 4.58e-05 | 1.000002 | 27.8 ms | PASS |
| AWS Trainium† | Trn1.2xlarge (NeuronX) | 0.00e+00 | 1.000001 | 103.3 ms (CPU) | PASS |
| AWS Inferentia2† | inf2.xlarge (NeuronX) | 0.00e+00 | 1.000001 | 321.7 ms (CPU) | PASS |

† **CPU fallback:** NeuronX SDK compilation requires quota-enabled Trn1/Inf2 instances not available in the validation environment. These rows confirm correct CPU-path behavior. Real NeuronX accelerator validation is pending quota approval.

All GPU/accelerator backends produce semantically identical outputs (cosine similarity > 0.999).

See [full validation report](./docs/reference/cloud-validation.md) for detailed benchmarks and results.

## Project Structure

```
src/torchbridge/
├── backends/          # Vendor-specific backend implementations
│   ├── nvidia/        #   NVIDIA CUDA backend
│   ├── amd/           #   AMD ROCm backend
│   ├── trainium/      #   AWS Trainium/NeuronX backend
│   └── tpu/           #   Google TPU/XLA backend
├── precision/         # Quantization compatibility matrix + torchao dispatch
├── attention/         # Attention kernel compatibility matrix + dispatcher
├── distributed/       # FSDP/pipeline config advisor
├── adapters/          # LoRA/QLoRA adapter injection (correct math)
├── inference/         # Speculative decoding compatibility matrix
├── checkpoint/        # DCP wrapper with cross-backend metadata
├── testing/           # DivergenceTracer, ToleranceDB, MultiStepTracer, @cross_backend
├── validation/        # UnifiedValidator — model structure, hardware, numerical stability
├── cli/               # Command-line tools (13 entry points)
├── models/            # LLM KV cache advisor
└── utils/             # Utilities
```

## Quality

- **2,223 tests passing** (hardware-gated skips on non-GPU environments)
- **0 ruff violations** -- clean linting
- **0 mypy errors** -- full type coverage
- **Cloud validated** on 8 platforms (6 GPU-validated: A10G, T4, H100 NVL, MI300X, TPU v5e, MPS; 2 CPU-fallback†: Trainium, Inferentia2)

```bash
python3 -m pytest tests/ -q
ruff check src/ tests/
```

## Documentation

| Document | Description |
|----------|-------------|
| [Installation](./docs/getting_started/installation.md) | Setup and requirements |
| [Quick Start](./docs/getting_started/quickstart.md) | First steps with TorchBridge |
| [Troubleshooting](./docs/getting_started/troubleshooting.md) | Common issues and fixes |
| [Backends Overview](./docs/backends/overview.md) | How the backend system works |
| [Backend Selection](./docs/guides/backend-selection.md) | Choosing the right backend |
| [Hardware Setup](./docs/guides/hardware-setup.md) | Driver and toolkit installation |
| [Distributed Training](./docs/guides/distributed-training.md) | Multi-GPU and multi-node |
| [Deployment](./docs/guides/deployment.md) | Serving and containerization |
| [CLI Reference](./docs/guides/cli.md) | Command-line tools |
| [Hardware Matrix](./docs/reference/hardware-matrix.md) | Full hardware support table |
| [Changelog](./CHANGELOG.md) | Version history |

## Community

The empirical tolerance database (`testing/tolerance_db.py`) is only as strong as the hardware it has been measured on. Contributions that add or correct tolerance entries for hardware you have access to — AMD MI350X, Trainium2, TPU v7 Ironwood, new PyTorch versions — directly expand the validation coverage for everyone. See [CONTRIBUTING.md](./CONTRIBUTING.md) for how to add entries and the source-label conventions (`"measured"`, `"derived"`, `"fallback"`).

## License

Licensed under the Apache License, Version 2.0. See [LICENSE](./LICENSE) for the full text.
