Metadata-Version: 2.4
Name: azure-gpu-functions
Version: 1.0.1
Summary: GPU-accelerated machine learning training for Azure Functions with distributed computing, monitoring, and cost optimization
Home-page: https://github.com/pxcallen_amadeus/azureGPUtrainingappfunc
Author: Amadeus GPU Training Team
Author-email: Amadeus GPU Training Team <gpu-training@amadeus.com>
Maintainer-email: Amadeus GPU Training Team <gpu-training@amadeus.com>
License: MIT License
        
        Copyright (c) 2025 Amadeus IT Group
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Homepage, https://github.com/pxcallen_amadeus/azureGPUtrainingappfunc
Project-URL: Documentation, https://github.com/pxcallen_amadeus/azureGPUtrainingappfunc/blob/main/FRAMEWORK_GUIDE.md
Project-URL: Repository, https://github.com/pxcallen_amadeus/azureGPUtrainingappfunc
Project-URL: Bug Reports, https://github.com/pxcallen_amadeus/azureGPUtrainingappfunc/issues
Project-URL: Changelog, https://github.com/pxcallen_amadeus/azureGPUtrainingappfunc/blob/main/CHANGELOG.md
Keywords: azure,gpu,functions,machine-learning,distributed-training,ray,monitoring,cost-optimization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24.0
Requires-Dist: requests>=2.31.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: GPUtil>=1.4.0
Provides-Extra: azure
Requires-Dist: azure-storage-blob>=12.18.0; extra == "azure"
Requires-Dist: azure-identity>=1.15.0; extra == "azure"
Requires-Dist: azure-mgmt-costmanagement>=4.0.1; extra == "azure"
Requires-Dist: azure-mgmt-monitor>=6.0.2; extra == "azure"
Provides-Extra: ml
Requires-Dist: torch>=2.1.0; extra == "ml"
Requires-Dist: torchvision>=0.16.0; extra == "ml"
Requires-Dist: transformers>=4.35.0; extra == "ml"
Requires-Dist: datasets>=2.14.0; extra == "ml"
Requires-Dist: accelerate>=0.24.0; extra == "ml"
Requires-Dist: peft>=0.6.0; extra == "ml"
Requires-Dist: tokenizers>=0.15.0; extra == "ml"
Requires-Dist: scikit-learn>=1.3.0; extra == "ml"
Requires-Dist: pandas>=2.1.0; extra == "ml"
Requires-Dist: matplotlib>=3.8.0; extra == "ml"
Provides-Extra: distributed
Requires-Dist: ray[default]>=2.9.0; extra == "distributed"
Requires-Dist: ray[serve]>=2.9.0; extra == "distributed"
Requires-Dist: ray[tune]>=2.9.0; extra == "distributed"
Provides-Extra: monitoring
Requires-Dist: websockets>=12.0; extra == "monitoring"
Requires-Dist: fastapi>=0.104.1; extra == "monitoring"
Requires-Dist: uvicorn>=0.24.0; extra == "monitoring"
Requires-Dist: prometheus-client>=0.19.0; extra == "monitoring"
Requires-Dist: opentelemetry-distro>=0.43b0; extra == "monitoring"
Requires-Dist: opentelemetry-instrumentation>=0.43b0; extra == "monitoring"
Provides-Extra: all
Requires-Dist: azure-storage-blob>=12.18.0; extra == "all"
Requires-Dist: azure-identity>=1.15.0; extra == "all"
Requires-Dist: azure-mgmt-costmanagement>=4.0.1; extra == "all"
Requires-Dist: azure-mgmt-monitor>=6.0.2; extra == "all"
Requires-Dist: torch>=2.1.0; extra == "all"
Requires-Dist: torchvision>=0.16.0; extra == "all"
Requires-Dist: transformers>=4.35.0; extra == "all"
Requires-Dist: datasets>=2.14.0; extra == "all"
Requires-Dist: accelerate>=0.24.0; extra == "all"
Requires-Dist: peft>=0.6.0; extra == "all"
Requires-Dist: tokenizers>=0.15.0; extra == "all"
Requires-Dist: scikit-learn>=1.3.0; extra == "all"
Requires-Dist: pandas>=2.1.0; extra == "all"
Requires-Dist: matplotlib>=3.8.0; extra == "all"
Requires-Dist: ray[default]>=2.9.0; extra == "all"
Requires-Dist: ray[serve]>=2.9.0; extra == "all"
Requires-Dist: ray[tune]>=2.9.0; extra == "all"
Requires-Dist: websockets>=12.0; extra == "all"
Requires-Dist: fastapi>=0.104.1; extra == "all"
Requires-Dist: uvicorn>=0.24.0; extra == "all"
Requires-Dist: prometheus-client>=0.19.0; extra == "all"
Requires-Dist: opentelemetry-distro>=0.43b0; extra == "all"
Requires-Dist: opentelemetry-instrumentation>=0.43b0; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Azure Functions GPU Training & Testing System

A comprehensive Azure Functions application for GPU-accelerated computing with model caching, performance monitoring, cost analysis, and interactive GPU testing dashboard.

## 📚 Documentation

### For Amadeus Teams, Apprentices & Interns
👉 **[Complete Framework Guide](./FRAMEWORK_GUIDE.md)** - Comprehensive learning resource covering:
- Step-by-step tutorials for beginners
- Advanced GPU computing concepts
- Infrastructure as Code patterns
- Performance optimization techniques
- Production deployment strategies
- Troubleshooting and best practices

### Quick Reference
- [Getting Started](#-quick-start) - Basic setup and deployment
- [API Endpoints](#-api-endpoints) - Function specifications
- [Testing Guide](#-testing--validation) - Validation procedures
- [Troubleshooting](#-troubleshooting) - Common issues and solutions

---

## 🚀 Key Features

### GPU Training Capabilities
- **Multi-stage Docker builds** with dependency and model caching layers
- **Azure Blob Storage integration** for shared model artifacts
- **Background model preloading** during container startup
- **Ray distributed computing** with cached initialization
- **4 Training Types**: Small, Medium, Large, XL configurations
- **Framework Support**: PyTorch and TensorFlow

### GPU Testing Capabilities
- **Matrix Multiplication Tests**: Performance benchmarking with FLOPS measurement
- **Tensor Operations**: Various PyTorch tensor operations (add, multiply, sin, exp)
- **Memory Bandwidth Tests**: GPU memory transfer performance measurement
- **Compute Intensity Tests**: Heavy computational workloads with multiple operations
- **Real-time Dashboard**: Web-based monitoring interface with auto-refresh
- **Performance Metrics**: FLOPS, bandwidth, efficiency percentages, timing
- **Cost Analysis**: GPU usage cost calculation ($1.60/hr T4, $4.80/hr A100)

### Monitoring & Analytics
- **Infrastructure as Code** with Bicep templates
- **Performance monitoring** and benchmarking tools
- **Historical Tracking**: Persistent storage of test results in Azure Blob Storage
- **Resource cleanup** and health monitoring scripts

## 📁 Project Structure

```
├── Dockerfile                    # Multi-stage build with caching
├── Dockerfile.simple            # GPU testing container
├── startup.sh                   # Background preloading script
├── function_app.py             # Main Azure Functions app
├── test_gpu_standalone.py      # Local GPU testing script
├── run_gpu_tests.sh           # Interactive test runner
├── launch_dashboard.sh        # Dashboard launcher
├── host.json                   # Function configuration
├── requirements.txt            # Python dependencies
├── scripts/
│   ├── cache_models.py         # Model caching script
│   ├── performance_monitor.py  # Performance monitoring
│   ├── quick_training_test.sh # Training test runner
│   ├── training_performance_test.sh # Full performance test
│   ├── deploy_simple.sh       # GPU deployment script
│   ├── deploy_cpu_test.sh     # CPU test deployment
│   └── health_check.sh        # Resource monitoring
├── model_storage.py            # Azure Blob Storage integration
├── infra/
│   ├── main.bicep              # Main infrastructure template
│   └── modules/
│       └── models-container.bicep  # Models storage template
├── deploy_with_caching.sh      # Training deployment script
└── README.md                   # This file
```

## 🏗️ Architecture

### Training System
1. **Build-time Caching**: Dependencies and common models cached in Docker layers
2. **Runtime Caching**: Models cached in Azure Blob Storage for sharing across instances
3. **Startup Optimization**: Background preloading of models and Ray initialization
4. **Local Cache**: Function instances maintain local cache for frequently used models

### Testing System
1. **GPU Test Functions**: Standalone PyTorch operations for performance measurement
2. **Azure Storage Logging**: Persistent test result storage and retrieval
3. **Web Dashboard**: HTML interface for real-time monitoring and historical data
4. **Cost Calculation**: Automatic GPU usage cost tracking

### Components
- **Dockerfile**: Multi-stage build with dependency caching and model preloading
- **Dockerfile.simple**: GPU-enabled container for testing
- **startup.sh**: Bash script for background initialization of models and services
- **model_storage.py**: Python class for managing model caching between local and cloud storage
- **test_gpu_standalone.py**: Local GPU testing without Azure dependencies

## 🚀 Quick Start

### Prerequisites
- Azure CLI (`az`)
- Docker with NVIDIA support
- Python 3.9+
- Azure subscription with GPU quota approved

### 1. GPU Testing (Local)

Run the standalone GPU test suite to validate functionality:

```bash
python3 test_gpu_standalone.py
```

This will run all GPU tests and show performance metrics on available hardware.

### 2. Interactive Testing Menu

Use the bash script for easy testing:

```bash
./run_gpu_tests.sh
```

### 3. Launch Dashboard (Local)

Start the Azure Functions runtime and open the dashboard:

```bash
./launch_dashboard.sh
```

### 4. Deploy Training System

Deploy the full training system with caching:

```bash
chmod +x deploy_with_caching.sh
./deploy_with_caching.sh
```

### 5. Deploy GPU Testing

Deploy to Azure Container Apps with GPU support:

```bash
./scripts/deploy_simple.sh
```

## 📊 API Endpoints

### GPU Testing
- `POST /api/gpu-test` - Run GPU performance tests
  - Query params: `test_type` (matrix_multiplication, tensor_operations, memory_bandwidth, compute_intensity)
  - Query params: `matrix_size`, `iterations`
- `GET /api/dashboard` - Interactive monitoring dashboard
- `GET /api/test-results` - Historical test results (JSON)

### Training (Legacy)
- `POST /api/train` - Start GPU training jobs
- `GET /api/status/{job_id}` - Check training status
- `GET /api/cost-analysis` - Training cost analysis

### Model Management
- `POST /api/model-management` - Load/cache models
- `GET /api/ray-monitoring` - Ray cluster status

## 🎯 Training Types & GPU Testing

### Training Configurations

#### Small Training
- **Model**: `distilbert-base-uncased`
- **Duration**: ~1 minute
- **Use Case**: Quick testing, validation

#### Medium Training
- **Model**: `microsoft/DialoGPT-small`
- **Duration**: ~2 minutes
- **Features**: LoRA fine-tuning

#### Large Training
- **Model**: `microsoft/DialoGPT-medium`
- **Duration**: ~4 minutes
- **Features**: Distributed training with Ray

#### XL Training
- **Model**: `mistralai/Mistral-7B-v0.1`
- **Duration**: ~5 minutes
- **Features**: Full distributed training

### GPU Test Types

#### Matrix Multiplication
- **Operations**: Large matrix multiplication with timing
- **Metrics**: TFLOPS, efficiency percentage, memory usage
- **Use Case**: General GPU compute performance

#### Tensor Operations
- **Operations**: Add, multiply, sin, exp, sum operations
- **Metrics**: Operations per second, memory allocation
- **Use Case**: ML workload simulation

#### Memory Bandwidth
- **Operations**: GPU memory copy and transfer tests
- **Metrics**: GB/s bandwidth, data transfer amounts
- **Use Case**: Memory subsystem performance

#### Compute Intensity
- **Operations**: Multiple matrix multiplications with trigonometry
- **Metrics**: Compute throughput, operation efficiency
- **Use Case**: Heavy computational workloads

## 📈 Performance Benchmarks

### Training Performance
- **Cold Start Time**: Reduced from 30-60 seconds to 5-15 seconds
- **Model Loading**: Cache hits for frequently used models
- **Training Throughput**: 1-20 samples/second depending on model size

### GPU Testing Performance (T4)
- **Matrix Multiplication**: ~65 TFLOPS theoretical peak
- **Memory Bandwidth**: ~320 GB/s
- **Efficiency**: Measured vs theoretical performance

### Cost Analysis
- **T4 GPU**: $1.60/hour
- **A100 GPU**: $4.80/hour
- Per-test and per-training costs calculated automatically

## 🔧 Configuration

### Environment Variables

```bash
# Azure Storage
AZURE_STORAGE_CONNECTION_STRING=your_connection_string
MODEL_STORAGE_CONTAINER=models

# GPU Settings
GPU_TYPE=T4  # or A100
GPU_COST_PER_HOUR=1.60

# Model Caching
MODEL_CACHE_ENABLED=true
TRANSFORMERS_CACHE=/home/site/wwwroot/.cache/huggingface

# Function Configuration
FUNCTION_URI=https://your-function-app.azurewebsites.net
```

## 🏛️ Infrastructure

### Bicep Templates
- `infra/main.bicep`: Main deployment template
- `infra/modules/models-container.bicep`: Models storage container

### Resources Created
- Azure Function App with GPU support
- Azure Container Registry
- Azure Storage Account with models/test results
- Application Insights for monitoring

## 🧪 Testing & Validation

### Local GPU Testing
```bash
# Run all GPU tests
python3 test_gpu_standalone.py

# Run specific test via API
python3 test_gpu_jobs.py --test matrix_multiplication --size 1024 --iterations 5

# Launch interactive dashboard
./launch_dashboard.sh
```

### Training Testing
```bash
# Test all training types
./scripts/quick_training_test.sh all

# Run comprehensive performance test
./scripts/training_performance_test.sh
```

### Azure Testing
```bash
# Test GPU endpoint
curl -X POST "https://your-app.azurecontainerapps.io/api/gpu-test?test_type=matrix_multiplication"

# View dashboard
open "https://your-app.azurecontainerapps.io/api/dashboard"
```

## 📊 Dashboard Features

- **Real-time Monitoring**: Live test execution tracking
- **Performance Charts**: Visual representation of metrics
- **Cost Tracking**: Running cost calculation for GPU usage
- **Historical Data**: Previous test results and comparisons
- **Training Status**: Active training job monitoring
- **Resource Usage**: GPU utilization and memory tracking

## 🔍 Monitoring & Cleanup

### Health Check Script
```bash
# Basic health check
./scripts/health_check.sh

# Automatic cleanup
./scripts/health_check.sh --auto-cleanup
```

### Resource Monitoring
```bash
# Check training resources
python scripts/check_training_resources.py

# Cleanup resources
python scripts/cleanup_resources.py --wet-run
```

## 🐳 Docker Build Options

### Training System Build
```bash
docker build -t gpu-training:latest .
```

### GPU Testing Build
```bash
docker build -f Dockerfile.simple -t gpu-testing:latest .
```

Both use multi-stage builds with dependency caching and CUDA optimization.

## 🔧 Troubleshooting

### Common Issues

1. **GPU Not Available**: Check CUDA drivers and Azure GPU quota
2. **Slow Startup**: Verify model preloading in blob storage
3. **Cache Misses**: Check Azure Storage permissions
4. **Build Failures**: Ensure Docker has sufficient memory

### Debug Commands
```bash
# Function logs
az functionapp logstream --name your-function-app --resource-group your-rg

# Container logs
az containerapp logs show --name your-container-app --resource-group your-rg

# Test model loading
curl -X POST https://your-app.azurewebsites.net/api/model-management \
  -H "Content-Type: application/json" \
  -d '{"action": "load_model", "model": "gpt2"}'
```

## 📚 Best Practices

### Model Management
- Preload frequently used models during deployment
- Use model versioning for updates
- Monitor cache hit rates

### Cost Optimization
- Use appropriate GPU types (T4 for dev, A100 for prod)
- Implement caching to reduce startup costs
- Monitor utilization to right-size instances

### Performance Tuning
- Adjust pre-warmed instances based on traffic
- Use Azure Front Door for global distribution
- Implement request batching for similar workloads

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Add comprehensive tests
4. Submit a pull request

## 📄 License

MIT License - see LICENSE file for details.

## 🆘 Support

- Check troubleshooting section
- Review Azure Functions documentation
- Open repository issues

---

**Note**: GPU functionality requires Azure subscription with approved GPU quota. Contact Azure support to enable GPU workloads.

---

## 📚 Amadeus Learning Resources

For comprehensive learning materials designed specifically for Amadeus teams, apprentices, and interns, see:

👉 **[FRAMEWORK_GUIDE.md](./FRAMEWORK_GUIDE.md)** - Complete educational framework covering:
- Step-by-step tutorials for beginners
- Advanced GPU computing concepts
- Infrastructure as Code patterns
- Performance optimization techniques
- Production deployment strategies
- Troubleshooting and best practices
