Metadata-Version: 2.4
Name: utdg-env
Version: 0.1.6
Summary: Gymnasium environment for UTDG (Untitled Tower Defense Game)
Author-email: Chris <christiancontrerascampana@gmail.com>
Project-URL: Homepage, https://huggingface.co/spaces/chrisjcc/utdg-train
Project-URL: Bug Tracker, https://github.com/chrisjcc/utdg/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: gymnasium>=0.29.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: omegaconf>=2.3.0
Requires-Dist: hydra-core>=1.3.0
Requires-Dist: stable-baselines3>=2.2.0
Requires-Dist: sb3-contrib>=2.2.0
Requires-Dist: wandb>=0.16.0
Requires-Dist: huggingface_hub>=0.20.0
Requires-Dist: websockets>=12.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: mypy; extra == "dev"

# UTDG Gymnasium Environment

A Gymnasium-compatible environment for training reinforcement learning agents on the Untitled Tower Defense Game (UTDG).

## Overview

This package provides a WebSocket-based interface between the Godot game engine and Python, allowing you to train RL agents using popular frameworks like Stable-Baselines3 or Ray RLlib.

## Features

- **Gymnasium API**: Standard `reset()`, `step()`, and `close()` methods
- **WebSocket Communication**: Real-time bidirectional communication with Godot
- **Flexible Action Space**: Discrete actions for tower placement
- **Rich Observations**: Game state including gold, health, enemy/tower positions
- **Easy Integration**: Compatible with Stable-Baselines3, RLlib, and other RL libraries

## Installation

### From source:

```bash
cd python_env
pip install -e .
```

### With training dependencies:

```bash
pip install -e ".[train]"
```

### From requirements file:

```bash
pip install -r requirements.txt
```

## Configuration System

UTDG uses [Hydra](https://hydra.cc/) for configuration management, providing a flexible and composable way to manage experiment parameters.

### Configuration Structure

```
python_env/configs/
├── default.yaml              # Main composition file (for testing)
├── default_training.yaml     # Training composition file (includes PPO config)
├── env/
│   └── default.yaml         # Environment settings
├── game/
│   ├── default.yaml         # Game configuration (consolidated)
│   ├── enemy_spawn.yaml     # Alternative enemy spawn config
│   ├── rewards.yaml         # Alternative reward config
│   └── difficulty.yaml      # Alternative difficulty config
└── training/
    ├── common.yaml          # Common training settings
    └── ppo.yaml             # PPO hyperparameters
```

**Note:** `simple_test.py` uses `default.yaml` while `train_ppo.py` uses `default_training.yaml` which includes PPO hyperparameters by default.

### Using Hydra CLI

Run scripts with default configuration:
```bash
# From python_env/ directory
python examples/simple_test.py

# From root UTDG directory
python python_env/examples/simple_test.py
```

Override specific parameters:
```bash
# Change game parameters
python examples/simple_test.py game.base_health=200 experiment.seed=123

# Configure runtime settings (Godot launch)
python examples/simple_test.py \
    runtime.auto_launch=true \
    runtime.godot_path="builds/UTDG-macOS.app/Contents/MacOS/UntitledTowerDefenseGame" \
    runtime.episodes=5

# Training with custom hyperparameters
python examples/train_ppo.py \
    ppo.learning_rate=0.0001 \
    training.training.total_timesteps=500000
```

Multi-run for parameter sweeps:
```bash
python examples/simple_test.py -m \
    game.difficulty_config.enemy_health_multiplier=1.0,1.5,2.0
```

### Configuration Parameters

**Experiment metadata** (`experiment.*`):
- `name`: Experiment name
- `seed`: Random seed for reproducibility
- `log_dir`: Directory for logs

**WebSocket connection** (`websocket.*`):
- `host`: WebSocket host (default: localhost)
- `port`: WebSocket port (default: 9090)
- `timeout`: Connection timeout in seconds

**Runtime settings** (`runtime.*`):
- `episodes`: Number of episodes to run
- `godot_path`: Path to Godot executable
- `auto_launch`: Auto-launch Godot if not running
- `headless`: Run Godot in headless mode

**Game configuration** (`game.*`):
- `base_health`: Starting base health
- `starting_gold`: Starting gold amount
- `enemy_spawn_config.*`: Enemy spawning parameters
- `reward_config.*`: Reward shaping settings
- `difficulty_config.*`: Difficulty multipliers

**Environment** (`env.*`):
- `observation_space.*`: Observation space configuration
- `action_space.*`: Action space settings
- `episode.*`: Episode settings

**Training** (`training.*` and `ppo.*`):
- See `configs/training/ppo.yaml` for PPO hyperparameters

## Quick Start

### 1. Start Godot Game

First, ensure the Godot game is running with the RL Bridge enabled:

```bash
# Headless mode (no GUI)
godot --headless --path /path/to/UTDG

# Or with GUI for visualization
godot --path /path/to/UTDG
```

**Note**: The game must have the `RLBridge` node added to the main scene (see Setup section below).

### 2. Train an Agent

```python
from utdg_gym import UntitledTowerDefenseEnv
from stable_baselines3 import PPO

# Create environment
env = UntitledTowerDefenseEnv(
    host="127.0.0.1",
    port=9876,
    max_episode_steps=1000
)

# Create PPO agent
model = PPO("MultiInputPolicy", env, verbose=1)

# Train the agent
model.learn(total_timesteps=100000)

# Save the model
model.save("ppo_utdg")

# Close environment
env.close()
```

### 3. Use the Example Scripts

```bash
# Simple test (displays configuration)
python examples/simple_test.py

# Simple test with Godot auto-launch
python examples/simple_test.py \
    runtime.auto_launch=true \
    runtime.godot_path="path/to/godot/executable"

# Train PPO agent with custom parameters
python examples/train_ppo.py \
    ppo.learning_rate=0.0003 \
    training.training.total_timesteps=100000
```

## Godot Setup

To enable RL integration in your Godot project:

1. **Add RLBridge Node**: Open `base_level.tscn` and add the `RLBridge` node as a child of the root node.

2. **Configure References**: Select the RLBridge node and set the following exported variables:
   - `bank`: Reference to the Bank node
   - `base`: Reference to the Base node
   - `enemy_manager`: Reference to the EnemyManager node
   - `tower_manager`: Reference to the TowerManager node
   - `difficulty_manager`: Reference to the DifficultyManager node

3. **Configure Port** (optional): Set the `port` variable to match your Python environment.
   - Hydra config default: **9090** (`configs/default.yaml`)
   - Environment default: **9876** (`utdg_gym/env.py`)
   - Override via CLI: `websocket.port=9876` or `websocket.port=9090`
   - Ensure Godot and Python use the same port!

4. **Auto-start**: Ensure `auto_start` is enabled to start the WebSocket server automatically.

## Environment Details

### Observation Space

The observation is a dictionary containing:

- `gold`: Current gold amount (Box: [0, 10000])
- `base_health`: Current base health (Box: [0, 100])
- `base_max_health`: Maximum base health (Box: [0, 100])
- `num_enemies`: Number of active enemies (Box: [0, max_enemies])
- `num_towers`: Number of placed towers (Box: [0, max_towers])
- `game_time`: Elapsed game time in seconds (Box: [0, 1000])
- `enemy_positions`: Flattened array of enemy positions (Box: [-100, 100] × max_enemies × 3)
- `tower_positions`: Flattened array of tower positions (Box: [-100, 100] × max_towers × 3)

### Action Space

Discrete action space with the following mapping:
- **Action 0**: Wait/Skip (do nothing)
- **Actions 1-99**: Place tower at grid position (mapped to world coordinates)

### Reward Structure

The reward function is designed to align agent behavior with the core tower defense objective: protect the castle by building towers that eliminate enemies.

#### Reward Components

The total reward is computed from four components:

```python
total_reward = (+1.0 × kills) + (-10 × base_damage) + (50 × waves_cleared) + (5 × towers_built)
```

**1. Kill Reward: `+1.0 per enemy killed`**
- Rewards tower effectiveness and enemy elimination
- Scaled from game economy (15 gold per kill × 0.067 scaling factor)
- Reduced from +1.5 to emphasize base protection over kill-chasing
- Provides continuous feedback on defensive performance
- Tracked as `custom/reward_kills` in W&B

**2. Base Damage Penalty: `-10 per damage point`**
- Heavily penalizes enemies reaching the castle
- Primary failure signal that drives defensive behavior
- Calibrated to be strong (6.7× kill reward) without suppressing exploration
- Tracked as `custom/reward_damage` in W&B

**3. Wave Clear Bonus: `+50 per wave cleared`**
- Rewards overall progress and wave completion
- Sparse but significant milestone reward
- Encourages long-term survival and advancement
- Tracked as `custom/reward_wave` in W&B

**4. Tower Building Reward: `+5 per tower placed`**
- Encourages proactive tower placement and defensive expansion
- Addresses risk-aversion by providing immediate positive feedback for building
- Small enough not to dominate but helps overcome exploration barriers
- Tracked as `custom/reward_towers_built` in W&B

#### Economic Balance

The reward structure is calibrated to encourage tower building while maintaining base protection as the top priority:

| Metric | Value | Implication |
|--------|-------|-------------|
| Tower cost | 100 gold | Agent must justify investment |
| Tower building reward | +5 | Immediate feedback (computed from observation delta) |
| Reward per kill | +1.0 | **Experiment 1: Reduced from +1.5** |
| Net tower value | +5 initial, then +1.0 per kill | Encourages strategic over opportunistic kills |
| Base damage penalty | -10 | **10× kill reward (stronger than 6.7× baseline)** |
| Wave clear bonus | +50 | ~50 kills equivalent |

**Design rationale (Experiment 1):**
- **Reduced kill reward** (+1.0 vs +1.5) shifts focus toward base protection
- Achieves **10× damage dominance** (same ratio as Option B) with proven -10 penalty
- Tests hypothesis: lower kill reward → more strategic placement, less kill-greedy behavior
- Maintains proven penalty level (-10) to avoid exploration suppression seen with -15
- Tower reward (+5) now active via observation-based tracking (tower_count delta)
- Wave bonuses reward long-term survival and proper defense scaling

**Key difference from baseline:**
- Baseline: 6.7× damage dominance (kills more rewarding)
- Experiment 1: 10× damage dominance (protection more important)
- Option B failure: 10× but with -15 penalty (too harsh, suppressed exploration)

#### Expected Behavioral Changes

Agents trained with this reward structure should exhibit:

- 🛡️ **Stronger focus on base protection** (10× damage dominance vs 6.7× baseline)
- 🎯 **More strategic tower placement** (less emphasis on high-kill locations)
- 📊 **Reduced kill-chasing behavior** (lower kill reward discourages opportunistic placement)
- ⚖️ **Better long-term planning** (wave bonus relatively more valuable: 50 kills equivalent)
- 🔍 **Careful defensive positioning** (proven -10 penalty maintains exploration)

**Experiment 1 hypothesis:**
- Lower kill reward (+1.0 vs +1.5) shifts agent toward defensive strategies
- Maintains proven -10 penalty to avoid Option B's exploration suppression
- Tests if 10× damage dominance improves performance without risk-aversion

The reward function emphasizes base protection as paramount (-10 penalty, 10× kill reward), while providing feedback on defensive effectiveness: kills (+1.0), waves cleared (+50), and towers built (+5 when event available).

### Episode Termination

An episode ends when:
- Base health reaches zero (**Loss**)
- All waves complete and all enemies defeated (**Win**)
- Maximum episode steps reached (**Truncated**)

## Advanced Usage

### Custom Reward Function

You can modify the reward calculation in `RLBridge/rl_bridge.gd`:

```gdscript
func _calculate_reward(action_success: bool) -> float:
    var reward: float = 0.0
    # Your custom reward logic here
    return reward
```

### Multiple Parallel Environments

```python
from stable_baselines3.common.env_util import make_vec_env

def make_env(rank):
    def _init():
        return UntitledTowerDefenseEnv(
            port=9876 + rank  # Different port per environment
        )
    return _init

# Create 4 parallel environments
env = make_vec_env(make_env, n_envs=4)
```

**Note**: You'll need to run multiple Godot instances, each on a different port.

### Using with Ray RLlib

```python
from ray.rllib.algorithms.ppo import PPOConfig
from utdg_gym import UntitledTowerDefenseEnv

config = (
    PPOConfig()
    .environment(env=UntitledTowerDefenseEnv)
    .framework("torch")
    .training(train_batch_size=4000)
)

algo = config.build()
for i in range(100):
    result = algo.train()
    print(f"Iteration {i}: reward={result['episode_reward_mean']}")
```

## Message Protocol

Communication uses JSON messages with the following format:

```json
{
    "type": "message_type",
    "data": { ... }
}
```

### Message Types

- **reset**: Request environment reset
- **reset_response**: Initial observation after reset
- **step**: Execute action
- **step_response**: Observation, reward, done, info
- **close**: Close connection
- **config**: Configuration parameters
- **error**: Error message

See `utdg_gym/protocol.py` for detailed message schemas.

## Examples

The `examples/` directory contains:

### Reinforcement Learning
- `trainer.py`: Complete PPO training pipeline with Stable-Baselines3
- `evaluate.py`: Evaluate trained policies
- `record_video.py`: Record video demonstrations of trained agents
- `rollout.py`: Generate policy rollouts

### Imitation Learning

Train agents from human demonstrations using various imitation learning algorithms:

#### Data Collection
- `record_demos.py`: Collect human demonstration trajectories for imitation learning

#### Training Scripts
- `train_bc.py`: Behavioral Cloning (supervised learning from demonstrations)
- `train_gail.py`: GAIL (Generative Adversarial Imitation Learning)
- `train_airl.py`: AIRL (Adversarial Inverse Reinforcement Learning)

#### Inference Scripts
- `run_bc.py`: Run inference with trained BC policies
- `run_gail.py`: Run inference with trained GAIL policies
- `run_airl.py`: Run inference with trained AIRL policies

**Requirements**: `pip install imitation`

**Basic workflow**:
```bash
# 1. Collect human demonstrations
python examples/record_demos.py

# 2. Train using one of the imitation learning algorithms
python examples/train_bc.py      # Behavioral Cloning (fastest, simplest)
python examples/train_gail.py    # GAIL (adversarial training)
python examples/train_airl.py    # AIRL (learns reward function)

# 3. Run inference with trained policy
python examples/run_bc.py        # Run BC policy
python examples/run_gail.py      # Run GAIL policy
python examples/run_airl.py      # Run AIRL policy
```

**Algorithm Comparison**:
- **BC (Behavioral Cloning)**: Direct supervised learning from demonstrations. Fastest to train, but may not generalize well to novel situations.
- **GAIL**: Uses adversarial training to match expert behavior. More robust than BC, uses BasicRewardNet.
- **AIRL**: Also adversarial but learns a shaped reward function (BasicShapedRewardNet). Can transfer learned rewards to different tasks.

## Troubleshooting

### Connection Issues

- Ensure Godot game is running before starting Python script
- Check that ports match between Godot (RLBridge) and Python environment
- Verify firewall settings allow local WebSocket connections

### Performance

- Use headless mode for faster training: `godot --headless`
- Reduce observation space size if needed (adjust `max_enemies`, `max_towers`)
- Consider running multiple parallel environments for faster data collection

### Godot Errors

- Ensure all node references in RLBridge are properly set
- Check that `try_place_tower()` method exists in TowerManager
- Verify `is_spawning()` method exists in DifficultyManager

## Hyperparameter Sweeps

This project uses [Weights & Biases Sweeps](https://docs.wandb.ai/guides/sweeps) for automated hyperparameter optimization. Sweep configurations are stored in `configs/sweeps/` and use Bayesian optimization to search over model, training, and environment parameters.

### Running a Sweep

1. **Create a new sweep:**

```bash
wandb sweep --project utdg --entity rl4aa configs/sweeps/sweep.yaml
```

2. **Start the sweep agent** using the ID returned from the previous command:

```bash
wandb agent rl4aa/utdg/<SWEEP_ID>
```

You can run multiple agents in parallel (on different machines or terminals) to speed up the search.

### Sweep Configuration

The sweep uses Hydra override syntax to pass hyperparameters to the training script. Key parameters include:

| Parameter | Range | Description |
|-----------|-------|-------------|
| `model.learning_rate` | 1e-5 to 1e-3 | Learning rate (log-uniform) |
| `model.gamma` | 0.9 to 0.999 | Discount factor |
| `model.batch_size` | 32, 64, 128, 256 | Batch size |
| `model.n_steps` | 1024 to 4096 | Steps per rollout |
| `training.total_timesteps` | 100k to 300k | Total training steps |

See `configs/sweeps/sweep.yaml` for the full parameter specification.

## Contributing

Contributions are welcome! Please ensure:
- Code follows Google-style docstrings
- Type hints are used throughout
- Code is modular and well-documented

## License

See the main UTDG repository for license information.

## Acknowledgments

- Built with [Godot Engine](https://godotengine.org/)
- Uses [Gymnasium](https://gymnasium.farama.org/)
- Training examples use [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)
