Metadata-Version: 2.4
Name: tritoncu
Version: 1.0.0
Summary: A source translator for kernels written against the Triton API to CUDA C++
Requires-Python: >=3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

### tritoncu

A source translator for kernels written against the Triton API to CUDA C++. No GPU required at translation time. Generated `.cu` files are readable, compilable with NVCC 12.5.1 and embeddable in any CUDA project.

Tested on an NVIDIA Tesla T4 cloud GPU with the following configurations.

- **Compute Capability**: 7.5
- **Max Threads per Block**: 1024
- **Max Grid Size**: 2147483647 x 65535 x 65535
- **Shared Memory per Block**: 48 KB
- **Total Global Memory**: 14.56 GB
- **Memory Clock Rate**: 5.00 GHz
- **Memory Bus Width**: 256 bits
- **Warp Size**: 32

Refer to the [tests](/test.py) for various kernels.

### Tile-to-thread semantic translation

Triton programs operate over tiles, each kernel body sees an entire vector or matrix as if it were a single value. CUDA assigns one thread per scalar. tritoncu resolves this by allocating each tile as a stack array on the single CUDA thread mapped to that block, then expanding every tiled operation into an explicit `for` loop.

| Tile-based model                         | CUDA threading model                                                    |
| ---------------------------------------- | ----------------------------------------------------------------------- |
| `tl.program_id(axis=0)`                  | `blockIdx.x`                                                            |
| `tl.arange(0, BLOCK_SIZE)`               | `int32_t arr[BLOCK_SIZE]; for (int i=0; i<BLOCK_SIZE; ++i) arr[i] = i;` |
| Tile variable `x[BLOCK_SIZE]`            | Stack array `float x[BLOCK_SIZE]` per thread                            |
| `tl.load(ptr + offsets, mask=m)`         | Loop: `out[i] = m[i] ? ((float*)ptr)[i] : other`                        |
| `tl.store(ptr + offsets, val, mask=m)`   | Loop: `if (m[i]) ((float*)ptr)[i] = val[i]`                             |
| Elementwise `x + y` (tile + tile)        | Loop: `out[i] = x[i] + y[i]`                                            |
| Elementwise `x * scalar` (tile + scalar) | Loop: `out[i] = x[i] * scalar`                                          |
| `tl.where(cond, x, y)`                   | Loop: `out[i] = cond[i] ? x[i] : y`                                     |
| `tl.sum(x, axis=0)`                      | Loop: `acc += x[i]`                                                     |
| `tl.max(x, axis=0)`                      | Loop: `acc = x[i] > acc ? x[i] : acc`                                   |
| `tl.dot(a, b, acc)`                      | Triply-nested loop: `c[m][n] += a[m*K+k] * b[k*N+n]`                    |
| `tl.exp(x)`, `tl.sqrt(x)`, etc.          | Loop: `out[i] = expf(x[i])` / `sqrtf(x[i])`                             |
| `tl.atomic_add(ptr, val)`                | `atomicAdd(&ptr[i], val)`                                               |
| `tl.debug_barrier()`                     | `__syncthreads()`                                                       |
| `tl.constexpr` param                     | `const int` kernel parameter                                            |
| `tl.float32` pointer annotation          | `float*` kernel parameter                                               |
| `x[:, None]`, `x[None, :]`               | Scalar index extraction, `None`/slice dims stripped                     |

### Installation

```sh
pip install tritoncu
pip install cupy-cuda12x  # optional, for launching
```

### Usage

```py
import tritoncu
import tritoncu.language as tl

@tritoncu.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
  pid = tl.program_id(axis=0)
  offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
  mask = offsets < n_elements
  x = tl.load(x_ptr + offsets, mask=mask, other=0.0)
  y = tl.load(y_ptr + offsets, mask=mask, other=0.0)
  tl.store(out_ptr + offsets, x + y, mask=mask)

handle = add_kernel.compile(meta={"BLOCK_SIZE": 1024})
handle.get_source()
handle.get_kernel_source()
handle.write_to_disk("./out", "add", write_header=True)
```

#### Launching with CuPy

```py
add_kernel[grid](*args, BLOCK_SIZE=1024)
```

#### Multi-kernel builds

```py
from tritoncu import CudaSourceBuilder
  builder = CudaSourceBuilder()
  for name, src in handle.builder.kernels.items():
    builder.add_kernel(name, src)
  builder.write_to_disk("./out", "kernels")
```

### Testing

```sh
python3 test.py
```

### License

Apache v2.0 License
