Frequently Asked Questions
Common questions and answers about SMDPfier concepts, usage, and troubleshooting.
Core Concepts
What is SMDPfier?
SMDPfier is a Gymnasium wrapper that transforms any environment into a Semi-Markov Decision Process (SMDP) by enabling: - Options: Sequences of primitive actions executed atomically - Simple Time Semantics: Each primitive action = 1 tick (v0.2.0+) - SMDP Discounting: Using γ^{k} where k = number of actions executed
Key Insight (v0.2.0+): Each primitive action = 1 tick. Duration = k_exec. Simple and natural.
How is this different from other hierarchical RL libraries?
| Aspect | SMDPfier | Other Libraries |
|---|---|---|
| Focus | Option-based SMDP behavior | General hierarchical RL |
| Complexity | Single wrapper class | Full frameworks |
| Environment Support | Any Gymnasium env unchanged | Often require specific environments |
| Temporal Modeling | Each action = 1 tick | Usually step-based |
| Integration | Drop-in wrapper | Framework-specific |
When should I use SMDPfier?
✅ Use SMDPfier when you want to: - Apply SMDP discounting with γ^{duration} where duration = actions executed - Test hierarchical policies with temporal abstractions - Add option-level control to existing environments - Research temporal abstractions with simple time semantics
❌ Don't use SMDPfier when: - You need complex option discovery algorithms - You want full hierarchical RL frameworks (use HRL libraries) - You don't care about temporal discounting (standard MDP is fine)
Time Semantics (v0.2.0+)
How does time work in SMDPfier?
Simple rule: Each primitive action = 1 tick.
option = Option([0, 1, 0], "three-actions") # 3 actions
# Executes 3 primitive actions → duration = 3 ticks
# If terminated early after 2 actions → duration = 2 ticks
Why did duration modeling change in v0.2.0?
Simplicity and clarity. The old system separated "steps" from "ticks" which was confusing. Now: - Each action = 1 tick (natural and intuitive) - Duration = k_exec (number of actions executed) - No complex duration functions needed
See Migration Guide for upgrading from 0.1.x.
How do I control option duration?
Use option length:
# Want 2 ticks? Use 2 actions:
short_option = Option([0, 1], "short") # 2 ticks
# Want 5 ticks? Use 5 actions:
long_option = Option([0, 1, 0, 1, 0], "long") # 5 ticks
# Want 1 tick? Use 1 action:
instant_option = Option([0], "instant") # 1 tick
SMDP Discounting
How do I apply SMDP discounting?
Use duration from the info payload:
obs, reward, term, trunc, info = env.step(action)
# Get duration (equals k_exec)
duration = info["smdp"]["duration"]
# Apply SMDP discounting
gamma = 0.99
discounted_reward = reward * (gamma ** duration)
# Track cumulative time for multi-step discounting
total_time += duration
next_discount = gamma ** total_time
What happens with early termination?
When the episode terminates before an option completes, duration reflects actual execution:
# Option has 5 actions, but episode terminates after 2 actions
info["smdp"] = {
"k_exec": 2, # 2 actions executed
"duration": 2, # 2 ticks (= k_exec)
"terminated_early": True # Flag indicating early termination
}
Why this matters: - Use actual duration for discounting, not planned length - Prevents incorrect temporal credit assignment - Handles partial execution correctly
Action Interfaces
Index vs Direct: Which should I use?
| Use Case | Recommended Interface | Why |
|---|---|---|
| RL Training | Index | Discrete action space for algorithms |
| Scripting/Testing | Direct | Intuitive option objects |
| Action Masking | Index | Built-in mask support |
| Debugging | Direct | Clear option identification |
| Continuous Actions | Direct | Natural continuous support |
Common Pattern: Start with direct for prototyping, switch to index for RL training.
How does action masking work in index interface?
def availability_fn(obs):
"""Return available option indices."""
if obs[0] > 0.5: # Cart far right
return [0, 2] # Only left-based options
else:
return [0, 1, 2] # All options
env = SMDPfier(..., availability_fn=availability_fn)
obs, info = env.reset()
mask = info["smdp"]["action_mask"] # e.g., [1, 0, 1] = options 0,2 available
What happens with dynamic options overflow?
When dynamic options exceed max_options:
env = SMDPfier(..., max_options=3)
# If generator returns 5 options:
# - First 3 options are used
# - Last 2 options are dropped
# - info["smdp"]["num_dropped"] = 2
Environment Compatibility
Can I use continuous actions?
Yes! SMDPfier fully supports continuous action spaces:
# Continuous options
continuous_options = [
Option([[-1.0], [0.0], [1.0]], "left-center-right"), # 3 ticks
Option([[0.5, 0.2]], "multi-dim-action"), # 1 tick
]
env = SMDPfier(
gym.make("Pendulum-v1"),
options_provider=continuous_options,
action_interface="direct" # Recommended for continuous
)
What environments work with SMDPfier?
Any Gymnasium environment! SMDPfier is a generic wrapper.
✅ Confirmed Compatible:
- CartPole, Pendulum, MountainCar
- Atari games
- MuJoCo environments
- Custom environments
- Multi-agent environments (with appropriate setup)
What happens if my environment terminates during an option?
SMDPfier handles this gracefully:
- Execution stops immediately when
env.step()returnsterminated=Trueortruncated=True - Duration reflects actual execution (duration = k_exec)
- Info payload contains early termination flag
# Option with 5 actions, but episode terminates after 2 actions
info["smdp"] = {
"k_exec": 2, # Only 2 actions executed
"duration": 2, # 2 ticks (= k_exec)
"terminated_early": True, # Flag early termination
# ...
}
Common Issues
My RL algorithm isn't learning - what's wrong?
Common causes:
-
Wrong discounting: Use
duration, not step count# Wrong discount = gamma ** step_count # Right discount = gamma ** info["smdp"]["duration"] -
Action space mismatch: Ensure max_options is correct
# Correct for static options max_options = len(static_options) # Might be needed for dynamic options max_options = 10 # Set appropriately -
Reward aggregation: Check if default sum is appropriate
# Try different aggregation reward_agg = mean_rewards # Average instead of sum reward_agg = discounted_sum(0.99) # Internal discounting
I'm getting SMDPOptionValidationError - what does it mean?
This means an option failed precheck validation:
try:
env.step(action)
except SMDPOptionValidationError as e:
print(f"Option '{e.option_name}' failed at step {e.failing_step_index}")
print(f"Action {e.action_repr} is invalid in current state")
print(f"State summary: {e.short_obs_summary}")
Solutions:
- Check if actions are valid for current environment state
- Verify action space compatibility
- Use availability_fn to mask invalid options
- Turn off precheck: precheck=False
My options aren't being executed correctly
Debug steps:
-
Check option construction:
option = Option([0, 1, 0], "test") print(f"Actions: {option.actions}") print(f"Length: {len(option.actions)}") -
Verify action validity:
# Test each action individually in base environment for action in option.actions: obs, reward, term, trunc, info = base_env.step(action) if term or trunc: print(f"Action {action} terminates episode!") -
Check execution details:
obs, reward, term, trunc, info = smdp_env.step(option) smdp = info["smdp"] print(f"Planned steps: {smdp['option']['len']}") print(f"Executed steps: {smdp['k_exec']}") print(f"Rewards: {smdp['rewards']}")
Performance is slow - how can I optimize?
Performance tips:
-
Cache options when possible:
# Pre-compute static options static_options = [Option([i, j], f"option_{i}_{j}") for i in range(2) for j in range(2)] -
Use simple reward aggregators:
# Simple aggregators are faster from smdpfier.defaults import sum_rewards, mean_rewards reward_agg = sum_rewards # Fast -
Use appropriate max_options:
# Don't over-allocate max_options = 5 # If you typically have 3-5 options # vs max_options = 100 # Wastes memory and computation
Advanced Usage
Can I modify options during execution?
No. Options are executed atomically. However, you can:
- Use dynamic options that change between executions
- Create context-aware generators that consider current state
- Implement early termination via environment design
How do I debug option sequences?
Detailed info inspection:
obs, reward, term, trunc, info = env.step(action)
smdp = info["smdp"]
print("=== Option Execution Details ===")
print(f"Option: {smdp['option']['name']} (ID: {smdp['option']['id']})")
print(f"Actions: {smdp['option']['len']} total")
print(f"Executed: {smdp['k_exec']} steps")
print(f"Duration: {smdp['duration']} ticks (= k_exec)")
print(f"Per-step rewards: {smdp['rewards']}")
print(f"Macro reward: {reward}")
print(f"Early termination: {smdp['terminated_early']}")
if smdp.get('action_mask'):
print(f"Available options next: {smdp['action_mask']}")
Can I nest SMDPfiers?
Technically possible but not recommended. SMDPfier is designed as a single-level abstraction. For multi-level hierarchy, consider:
- Option composition: Create complex options from simple ones
- Custom option generators: Generate hierarchical option sets
- External hierarchy: Use SMDPfier as one level in a larger system
Still have questions? Check the API Reference or examples for more details.