Metadata-Version: 2.4
Name: cuda_symreg
Version: 0.1.1
Summary: Regresión Simbólica con CUDA
Author-email: Ariel Lisímaco Pincay Pérez <arielpincay812@gmail.com>
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.20.0

# CUDA Symbolic Regression

🚀 **High-performance GPU-accelerated symbolic regression using NVIDIA CUDA and Genetic Programming**

A Python package that leverages CUDA to evolve mathematical expressions that fit your data. Perfect for discovering underlying equations in scientific datasets, feature engineering, and interpretable machine learning.

[![PyPI version](https://badge.fury.io/py/cuda-symreg.svg)](https://badge.fury.io/py/cuda-symreg)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![CUDA](https://img.shields.io/badge/CUDA-11.0+-green.svg)](https://developer.nvidia.com/cuda-toolkit)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## ✨ Features

- **⚡ GPU Acceleration**: 10-100x faster than CPU-based symbolic regression
- **🎯 Scikit-learn API**: Familiar `.fit()` and `.predict()` interface
- **🔧 Customizable Operators**: Control which mathematical operations to use
- **🌳 Tree-based Expressions**: Interpretable mathematical formulas
- **🎲 Advanced Evolution**: Tournament selection, crossover, mutation, and immigration
- **📊 Production Ready**: Efficient memory management and numerical stability

## 📋 Requirements

- **NVIDIA GPU** with CUDA support (Compute Capability 7.5+)
- **CUDA Toolkit** 11.0 or higher
- **Python** 3.8+
- **NumPy** 1.20.0+

## 🚀 Installation

### Via pip (Recommended)
```bash
pip install cuda-symreg
```

### From source
```bash
git clone https://github.com/arielpincayy/symreg_ga.git
cd cuda-symreg
pip install -e .
```

### Verify Installation
```python
import cuda_symreg
print(cuda_symreg.__version__)
```

## 🎯 Quick Start
```python
from cuda_symreg import CUDASymbolicRegressor
import numpy as np

# Generate sample data: y = x² + 2x + 1
X = np.linspace(-10, 10, 1000).reshape(-1, 1).astype(np.float32)
y = (X[:, 0]**2 + 2*X[:, 0] + 1).astype(np.float32)

# Define operator weights (optional)
# Order: ADD, SUB, MUL, DIV, SIN, COS, ABS, POW, LOG, EXP, NOP
cdf = np.array([0.25, 0.5, 0.75, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 1.0], dtype=np.float32)

# Create and train model
model = CUDASymbolicRegressor()
expression, fitness = model.fit(X, y, cdf, n_gen=100, n_ind=512)

print(f"📐 Discovered equation: {expression}")
print(f"📊 RMSE: {fitness:.6f}")

# Make predictions
y_pred = model.predict(X)
```

## 📖 Documentation

### `CUDASymbolicRegressor`

#### `fit(X, y, cdf, **params)`

Evolves a mathematical expression to fit the data.

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `X` | ndarray | required | Input features (n_samples, n_features), float32 |
| `y` | ndarray | required | Target values (n_samples,), float32 |
| `cdf` | ndarray | required | Operator probabilities (11,), float32 |
| `n_gen` | int | 100 | Maximum number of generations |
| `n_ind` | int | 1024 | Population size |
| `tourn` | int | 15 | Tournament size for selection |
| `height` | int | 6 | Maximum tree height |
| `mut` | float | 0.2 | Mutation probability [0-1] |
| `repro` | float | 0.7 | Reproduction rate [0-1] |
| `rand` | float | 0.1 | Immigration rate [0-1] |

**Returns:**
- `expression` (str): Mathematical expression as string
- `fitness` (float): Root Mean Square Error (RMSE)

#### `predict(X)`

Evaluates the evolved expression on new data.

**Parameters:**
- `X` (ndarray): Input features (n_samples, n_features), float32

**Returns:**
- `y_pred` (ndarray): Predicted values (n_samples,)

### Operator CDF (Cumulative Distribution Function)

Control which operators are used during evolution:
```python
# Example 1: Only basic arithmetic (25% each)
cdf = np.array([0.25, 0.5, 0.75, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], dtype=np.float32)

# Example 2: Favor multiplication and division
cdf = np.array([0.1, 0.2, 0.6, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 1.0], dtype=np.float32)

# Example 3: Include trigonometric functions
cdf = np.array([0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 0.95, 0.95, 0.95, 0.95, 1.0], dtype=np.float32)
```

**Operator indices:**
0. ADD (+), 1. SUB (-), 2. MUL (×), 3. DIV (÷)
4. SIN, 5. COS, 6. ABS (|·|)
7. POW (^), 8. LOG, 9. EXP, 10. NOP (no-op)

### Supported Operators

| Operator | Description | Notes |
|----------|-------------|-------|
| `+` | Addition | Binary |
| `-` | Subtraction | Binary |
| `*` | Multiplication | Binary |
| `/` | Division | Protected (returns ∞ if divisor < 1e-6) |
| `sin()` | Sine | Unary (uses left child) |
| `cos()` | Cosine | Unary (uses left child) |
| `abs()` or `\|·\|` | Absolute value | Unary |
| `pow(a,b)` | Power | Protected (uses \|a\|) |
| `log()` | Natural logarithm | Protected (uses \|a\|) |
| `exp()` | Exponential | Unary |

## 🎨 Examples

### Example 1: Polynomial Regression
```python
import numpy as np
from cuda_symreg import CUDASymbolicRegressor

# Target: f(x) = 3x³ - 2x² + x - 5
X = np.linspace(-2, 2, 500).reshape(-1, 1).astype(np.float32)
y = (3*X[:, 0]**3 - 2*X[:, 0]**2 + X[:, 0] - 5).astype(np.float32)

# Use only arithmetic operators
cdf = np.array([0.25, 0.5, 0.75, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], dtype=np.float32)

model = CUDASymbolicRegressor()
expr, rmse = model.fit(X, y, cdf, n_gen=200, height=6)

print(f"Found: {expr}")
print(f"RMSE: {rmse:.6f}")
```

### Example 2: Trigonometric Function
```python
# Target: f(x) = sin(2x) + cos(x)
X = np.linspace(0, 2*np.pi, 1000).reshape(-1, 1).astype(np.float32)
y = (np.sin(2*X[:, 0]) + np.cos(X[:, 0])).astype(np.float32)

# Enable trigonometric operators
cdf = np.array([0.2, 0.4, 0.6, 0.7, 0.85, 0.95, 0.95, 0.95, 0.95, 0.95, 1.0], dtype=np.float32)

model = CUDASymbolicRegressor()
expr, rmse = model.fit(X, y, cdf, n_gen=300, n_ind=1024, height=5)

print(f"Discovered: {expr}")
```

### Example 3: Multivariate Function
```python
# Target: f(x,y) = x² + xy + y²
n = 1000
X = np.random.randn(n, 2).astype(np.float32)
y = (X[:, 0]**2 + X[:, 0]*X[:, 1] + X[:, 1]**2).astype(np.float32)

cdf = np.array([0.3, 0.6, 0.9, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 1.0], dtype=np.float32)

model = CUDASymbolicRegressor()
expr, rmse = model.fit(X, y, cdf, n_gen=150, height=5)

print(f"Equation: {expr}")

# Test on new data
X_test = np.random.randn(100, 2).astype(np.float32)
y_pred = model.predict(X_test)
```

### Example 4: Pagie Polynomial (Benchmark)
```python
# Classic symbolic regression benchmark
# f(x,y) = 1/(1+x⁻⁴) + 1/(1+y⁻⁴)
n = 4096
x = np.linspace(-5, 5, int(np.sqrt(n)))
x0, x1 = np.meshgrid(x, x)
X = np.stack([x0.flatten(), x1.flatten()], axis=1).astype(np.float32)
y = (X[:, 0]**4 / (X[:, 0]**4 + 1) + X[:, 1]**4 / (X[:, 1]**4 + 1)).astype(np.float32)

cdf = np.array([0.2, 0.4, 0.68, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 0.96, 1.0], dtype=np.float32)

model = CUDASymbolicRegressor()
expr, rmse = model.fit(X, y, cdf, n_gen=1000, n_ind=2048, height=7)

print(f"Solution: {expr}")
print(f"RMSE: {rmse:.6f}")

# Calculate R² score
y_pred = model.predict(X)
r2 = 1 - np.sum((y - y_pred)**2) / np.sum((y - y.mean())**2)
print(f"R² score: {r2:.4f}")
```

## ⚙️ Algorithm Details

### Evolutionary Process

1. **Initialization**: Generate random expression trees on GPU
2. **Fitness Evaluation**: Calculate RMSE for each individual
3. **Selection**: Tournament selection chooses parents
4. **Crossover**: Combine parent trees at random midpoint
5. **Mutation**: Randomly modify operators, constants, or variables
6. **Immigration**: Inject random individuals to maintain diversity
7. **Elitism**: Best solutions are preserved

### Convergence

The algorithm stops when:
- Fitness < 1e-5 (near-perfect fit)
- No improvement for 20 generations (stagnation)
- Maximum generations reached

### Tree Representation

Expressions are stored as complete binary trees:
```
       +
      / \
     *   x1
    / \
   x0  2.5
```
Evaluates to: `(x0 * 2.5) + x1`

## 🎛️ Hyperparameter Tuning

### Population Size (`n_ind`)
- **Small (256-512)**: Fast, simple problems
- **Medium (512-1024)**: General purpose
- **Large (1024-2048)**: Complex problems, better exploration

### Tree Height (`height`)
- **3-4**: Simple linear/quadratic relationships
- **5-6**: Moderate complexity (recommended)
- **7-8**: Complex nested expressions
- **9+**: Very complex, slower convergence

### Mutation Rate (`mut`)
- **Low (0.1-0.2)**: Exploitation-focused
- **Medium (0.2-0.4)**: Balanced (recommended)
- **High (0.4-0.6)**: Exploration-focused

### Immigration Rate (`rand`)
- **Low (0.05-0.1)**: Maintains convergence
- **Medium (0.1-0.15)**: Prevents stagnation (recommended)
- **High (0.15-0.25)**: Maximum diversity

## 🔧 Troubleshooting

### "Library not found" Error
```python
# Check if library exists
import os
from cuda_symreg import core
print(os.path.exists(core.__file__.replace('core.py', 'lib/libgasymreg.so')))
```

### CUDA Out of Memory
```python
# Reduce population size and tree height
model.fit(X, y, cdf, n_ind=256, height=4)
```

### Poor Convergence
```python
# Increase diversity
model.fit(X, y, cdf, n_gen=500, n_ind=1024, rand=0.15, mut=0.3)
```

### Numerical Instability
```python
# Disable problematic operators (POW, EXP, LOG)
cdf = np.array([0.25, 0.5, 0.75, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], dtype=np.float32)
```

## 📊 Performance Benchmarks

| Dataset | Samples | Variables | GPU (RTX 3090) | CPU (gplearn) | Speedup |
|---------|---------|-----------|----------------|---------------|---------|
| Simple poly | 1,000 | 1 | 0.5s | 15s | 30x |
| Pagie-1 | 4,096 | 2 | 3.2s | 280s | 87x |
| Complex | 10,000 | 3 | 8.1s | 650s | 80x |

*Note: CPU baseline using gplearn with equivalent parameters*

## 🤝 Comparison with Other Libraries

| Feature | cuda-symreg | PySR | gplearn |
|---------|-------------|------|---------|
| GPU Support | ✅ CUDA | ✅ Julia | ❌ CPU only |
| Speed | ⚡ Very Fast | ⚡ Fast | 🐢 Slow |
| API Style | scikit-learn | Custom | scikit-learn |
| Dependencies | NumPy | Julia | sklearn |
| Tree Evaluation | Python | Julia | Python |

## 🗺️ Roadmap

- [ ] Multi-GPU support
- [ ] Batch prediction on GPU
- [ ] Model serialization (save/load)
- [ ] Automatic operator selection
- [ ] Parsimony pressure (simpler expressions)
- [ ] Expression simplification
- [ ] Cross-validation support
- [ ] Parallel island models

## 🐛 Known Issues

1. Small memory leak (~few KB per `.fit()` call)
2. Expression bloat without parsimony pressure
3. No automatic constant optimization (planned)

## 📚 Citation

If you use this package in your research, please cite:
```bibtex
@software{cuda_symreg_2025,
  title={CUDA Symbolic Regression: GPU-Accelerated Genetic Programming},
  author={Pincay Pérez, Ariel Lisímaco},
  year={2025},
  url={https://github.com/arielpincayy/symreg_ga},
  version={0.1.0}
}
```

## 📝 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- NVIDIA CUDA Toolkit and CUB library
- Inspired by PySR, gplearn, and EQL
- Built with ❤️ for the scientific computing community

## 📧 Contact & Support

- **Issues**: [GitHub Issues](https://github.com/arielpincayy/symreg_ga/issues)
- **Email**: arielpincay812@gmail.com
- **Discussions**: [GitHub Discussions](https://github.com/arielpincayy/symreg_ga/discussions)

## 🌟 Star History

If you find this project useful, please consider giving it a star! ⭐

---

**Made with 🚀 by [Ariel Lisímaco Pincay Pérez](https://github.com/arielpincayy)**
