SMDPfier Documentation
Welcome to SMDPfier, a Gymnasium wrapper that enables Semi-Markov Decision Process (SMDP) behavior in reinforcement learning environments through Options with simple, natural time semantics.
Overview
SMDPfier transforms any Gymnasium environment into an SMDP by allowing agents to execute Options (sequences of primitive actions) where each primitive action = 1 tick of time, enabling natural SMDP discounting.
🎯 Key Insight: Each primitive action = 1 tick. Option duration = number of actions executed. Simple and natural.
Key Features
- 🔗 Flexible Options: Static sequences or dynamic discovery via callable
- ⚡ Two Interfaces: Index-based (Discrete actions) or direct Option passing
- ⏱️ Simple Time Semantics: Each primitive action = 1 tick, duration = k_exec
- 🎭 Action Masking: Support for discrete action availability
- 📊 Rich Info: Comprehensive execution metadata in
info["smdp"] - 🛡️ Error Handling: Detailed validation and runtime errors
- 🔄 Continuous Actions: Full support for continuous action spaces
- 🎲 Built-in Defaults: Ready-to-use option generators and reward aggregators
Quick Start
Index Interface (Recommended for RL)
import gymnasium as gym
from smdpfier import SMDPfier, Option
# Create environment and define options
env = gym.make("CartPole-v1")
options = [
Option(actions=[0, 0, 1], name="left-left-right"), # 3 actions = 3 ticks
Option(actions=[1, 1, 0], name="right-right-left"), # 3 actions = 3 ticks
Option(actions=[0, 1], name="left-right"), # 2 actions = 2 ticks
]
# Wrap with SMDPfier
smdp_env = SMDPfier(
env,
options_provider=options, # Static options list
action_interface="index", # Discrete(3) action space
max_options=len(options)
)
# Use like any Gym environment
obs, info = smdp_env.reset()
obs, reward, term, trunc, info = smdp_env.step(0) # Execute first option
# Access SMDP metadata
smdp = info["smdp"]
print(f"Option '{smdp['option']['name']}' executed {smdp['k_exec']} steps")
print(f"Duration: {smdp['duration']} ticks (= k_exec)")
print(f"Per-step rewards: {smdp['rewards']}")
# Apply SMDP discounting
gamma = 0.99
discounted_reward = reward * (gamma ** smdp['duration'])
Direct Interface (Intuitive)
# Pass Option objects directly
smdp_env = SMDPfier(env, options_provider=options, action_interface="direct")
# Execute with Option object
obs, reward, term, trunc, info = smdp_env.step(options[0])
Core Concepts
Options
Options are sequences of primitive actions that are executed atomically:
# An option with 3 primitive actions
option = Option(
actions=[0, 1, 0], # Action sequence
name="left-right-left", # Human-readable name
meta={"strategy": "zigzag"} # Optional metadata
)
Time Semantics (v0.2.0+)
Simple and natural:
- Each primitive action = 1 tick of time
- Option duration = k_exec (number of primitive actions executed)
- If option completes: duration = len(option.actions)
- If terminated early: duration < len(option.actions)
Example:
option = Option([0, 1, 0], "three-action-option") # 3 actions
# If it completes normally: duration = 3 ticks
# If episode terminates after 2 actions: duration = 2 ticks
SMDP Discounting
Standard MDP: γ^{1} per primitive step
SMDP: γ^{k} where k = option duration
# MDP: Each primitive step discounts by γ
mdp_return = r₁ + γ¹·r₂ + γ²·r₃ + γ³·r₄
# SMDP: Each option discounts by γ^{duration}
# Options with lengths [3, 2, 4]:
smdp_return = r₁ + γ³·r₂ + γ⁵·r₃ + γ⁹·r₄
# ↑ ↑ ↑
# 3 3+2 3+2+4
Action Interfaces
SMDPfier provides two ways to select options:
Index Interface (Recommended for RL)
action_interface="index"
# Creates Discrete(max_options) action space
# Use integer indices: env.step(0), env.step(1), etc.
Direct Interface (Intuitive)
action_interface="direct"
# Pass Option objects directly: env.step(option)
SMDP Info Payload
Every step returns comprehensive metadata in info["smdp"]:
{
"option": {
"id": "abc123...", # Stable hash-based ID
"name": "left-right-left", # Human-readable name
"len": 3, # Number of actions
"meta": {} # User metadata
},
"k_exec": 3, # Steps actually executed
"duration": 3, # Duration in ticks (= k_exec)
"rewards": [1.0, 1.0, 1.0], # Per-step rewards
"terminated_early": False, # Episode ended during option?
"action_mask": [1, 1, 0], # Available options (index interface only)
"num_dropped": 0 # Dropped options (index interface only)
}
See the API Reference for complete details.
Documentation Guide
| Section | Focus | When to Read |
|---|---|---|
| API Reference | Complete SMDPfier API | Setting up your wrapper |
| Durations Guide | Duration = k_exec, SMDP discounting | Understanding time semantics |
| Index vs Direct | Interface comparison | Choosing action interface |
| Masking & Precheck | Action constraints | Handling invalid actions |
| Error Handling | Debugging failed options | Troubleshooting |
| FAQ | Common questions | Quick answers |
| Migration from 0.1.x | Upgrading to v0.2.0 | Updating existing code |
Quick Navigation
🚀 New to SMDPfier? Start with the Quick Start above and FAQ.
🤖 Building an RL agent? See Index Interface and Durations.
🔧 Need custom behavior? Check API Reference and examples/.
❓ Something not working? Try Error Handling and FAQ.
⬆️ Upgrading from 0.1.x? See Migration Guide.
Next: API Reference | Examples: ../examples/
pip install smdpfier
For development:
git clone https://github.com/smdpfier/smdpfier.git
cd smdpfier
pip install -e .[dev]
Examples
CartPole with Static Options (Index Interface)
import gymnasium as gym
from smdpfier import Option, SMDPfier
from smdpfier.defaults import ConstantOptionDuration, sum_rewards
# Create base environment
env = gym.make("CartPole-v1")
# Define static options
static_options = [
Option([0, 0, 1], "left-left-right", meta={"category": "mixed"}),
Option([1, 1, 0], "right-right-left", meta={"category": "mixed"}),
Option([0, 0, 0], "left-triple", meta={"category": "directional"}),
Option([1, 1, 1], "right-triple", meta={"category": "directional"}),
]
# Create SMDPfier with index interface
smdp_env = SMDPfier(
env,
options_provider=static_options,
duration_fn=ConstantOptionDuration(10), # 10 ticks per option
reward_agg=sum_rewards,
action_interface="index",
max_options=len(static_options),
)
# Execute
obs, info = smdp_env.reset(seed=42)
obs, reward, terminated, truncated, info = smdp_env.step(0)
# Check results
smdp_info = info["smdp"]
print(f"Executed option: {smdp_info['option']['name']}")
print(f"Steps: {smdp_info['k_exec']}/{smdp_info['option']['len']}")
print(f"Duration: {smdp_info['duration_exec']} ticks")
Taxi with Dynamic Options & Masking (Index Interface)
import gymnasium as gym
from smdpfier import Option, SMDPfier
from smdpfier.defaults import RandomActionDuration, mean_rewards
def create_taxi_options(obs, info):
"""Dynamic option provider based on current state."""
return [
Option([0], "south", meta={"type": "primitive"}),
Option([1], "north", meta={"type": "primitive"}),
Option([2], "east", meta={"type": "primitive"}),
Option([3], "west", meta={"type": "primitive"}),
Option([4], "pickup", meta={"type": "primitive"}),
Option([5], "dropoff", meta={"type": "primitive"}),
# Navigation sequences
Option([0, 2], "south-east", meta={"type": "navigation"}),
Option([1, 3], "north-west", meta={"type": "navigation"}),
]
def taxi_availability_function(obs):
"""Restrict certain actions based on state."""
# Movement always available
available = [0, 1, 2, 3]
# Conditionally add pickup/dropoff
if (obs + 42) % 10 < 7: # Pseudo-random condition
available.append(4) # pickup
if (obs + 17) % 10 < 6:
available.append(5) # dropoff
return available
# Create SMDPfier with masking
env = gym.make("Taxi-v3")
smdp_env = SMDPfier(
env,
options_provider=create_taxi_options,
duration_fn=RandomActionDuration(3, 8),
reward_agg=mean_rewards,
action_interface="index",
max_options=12,
availability_fn=taxi_availability_function,
precheck=True,
)
obs, info = smdp_env.reset(seed=42)
# Check masking
mask = info["action_mask"]
available_options = [i for i, avail in enumerate(mask) if avail]
print(f"Available options: {available_options}")
obs, reward, terminated, truncated, info = smdp_env.step(available_options[0])
print(f"Mean reward: {reward:.3f}")
Pendulum with Continuous Actions (Direct Interface)
import gymnasium as gym
from smdpfier import Option, SMDPfier
from smdpfier.defaults import ConstantActionDuration, discounted_sum
# Create continuous action options
continuous_options = [
Option([[1.0], [-1.0], [1.0], [-1.0]], "oscillate-high",
meta={"category": "oscillation"}),
Option([[0.5], [-0.5], [0.5]], "oscillate-medium",
meta={"category": "oscillation"}),
Option([[0.0], [0.0], [0.0]], "hold-steady",
meta={"category": "stabilization"}),
]
# Create SMDPfier with direct interface
env = gym.make("Pendulum-v1")
smdp_env = SMDPfier(
env,
options_provider=continuous_options,
duration_fn=ConstantActionDuration(4), # 4 ticks per action
reward_agg=discounted_sum,
action_interface="direct",
)
obs, info = smdp_env.reset(seed=42)
# Execute by passing Option objects directly
option_to_execute = continuous_options[0] # oscillate-high
obs, reward, terminated, truncated, info = smdp_env.step(option_to_execute)
smdp_info = info["smdp"]
print(f"Executed {smdp_info['k_exec']} actions")
print(f"Total duration: {smdp_info['duration_exec']} ticks")
print(f"Discounted reward: {reward:.3f}")
Next Steps
- API Reference - Complete API documentation
- Usage Guide - Interface comparison and examples
- Durations - Understanding duration metadata and SMDP discounting
- FAQ - Common questions and troubleshooting