Metadata-Version: 1.0
Name: yarlp
Version: 0.0.9
Summary: yarlp
Home-page: https://github.com/btaba/yarlp
Author: Baruch Tabanpour
Author-email: baruch@tabanpour.info
License: MIT
Description-Content-Type: UNKNOWN
Description: [![Build Status](https://travis-ci.org/btaba/yarlp.svg?branch=master)](https://travis-ci.org/btaba/yarlp)
        
        ## yarlp
        
        **Yet Another Reinforcement Learning Package**
        
        Implementations of [`CEM`](/yarlp/agent/cem_agent.py), [`REINFORCE`](/yarlp/agent/pg_agents.py), [`TRPO`](/yarlp/agent/trpo_agent.py), [`DDQN`](/yarlp/agent/ddqn_agent.py), [`A2C`](/yarlp/agent/a2c_agent.py) with reproducible benchmarks. Experiments are templated using `jsonschema` and are compared to published results. This is meant to be a starting point for working implementations of classic RL algorithms. Unfortunately even implementations from OpenAI baselines are [not always reproducible](https://github.com/openai/baselines/issues/176).
        
        A working Dockerfile with `yarlp` installed can be run with:
        
        * `docker build -t "yarlpd" .`
        * `docker run -it yarlpd bash`
        
        To run a benchmark, simply:
        
        `python yarlp/experiment/experiment.py --help`
        
        
        If you want to run things manually, look in `examples` or look at this:
        
        ```python
        from yarlp.agent.trpo_agent import TRPOAgent
        from yarlp.utils.env_utils import NormalizedGymEnv
        
        env = NormalizedGymEnv('MountainCarContinuous-v0')
        agent = TRPOAgent(env, seed=123)
        agent.train(max_timesteps=1000000)
        ```
        
        ## Benchmarks
        
        We benchmark against published results and Openai [`baselines`](https://github.com/openai/baselines) where available using [`yarlp/experiment/experiment.py`](/yarlp/experiment/experiment.py). Benchmark scripts for Openai `baselines` were made ad-hoc, such as [this one](https://github.com/btaba/baselines/blob/master/baselines/trpo_mpi/run_trpo_experiment.py).
        
        ### Atari10M
        
        ||||
        |---|---|---|
        |![BeamRider](/assets/atari10m/ddqn/beamrider.gif)|![Breakout](/assets/atari10m/ddqn/breakout.gif)|![Pong](/assets/atari10m/ddqn/pong.gif)|
        |![QBert](/assets/atari10m/ddqn/qbert.gif)|![Seaquest](/assets/atari10m/ddqn/seaquest.gif)|![SpaceInvaders](/assets/atari10m/ddqn/spaceinvaders.gif)|
        
        
        #### DDQN with dueling networks and prioritized replay
        
        `python yarlp/experiment/experiment.py run_atari10m_ddqn_benchmark`
        
        
        I trained 6 Atari environments for 10M time-steps (**40M frames**), using 1 random seed, since I only have 1 GPU and limited time on this Earth. I used DDQN with dueling networks, but no prioritized replay (although it's implemented). I compare the final mean 100 episode raw scores for yarlp (with exploration of 0.01) with results from [Hasselt et al, 2015](https://arxiv.org/pdf/1509.06461.pdf) and [Wang et al, 2016](https://arxiv.org/pdf/1511.06581.pdf) which train for **200M frames** and evaluate on 100 episodes (exploration of 0.05).
        
        I don't compare to OpenAI baselines because the OpenAI DDQN implementation is **not** currently able to reproduce published results as of 2018-01-20. See [this github issue](https://github.com/openai/baselines/issues/176), although I found [these benchmark plots](https://github.com/openai/baselines-results/blob/master/dqn_results.ipynb) to be pretty helpful.
        
        |env|yarlp DUEL 40M Frames|Hasselt et al DDQN 200M Frames|Wang et al DUEL 200M Frames|
        |---|---|---|---|
        |BeamRider|8705|7654|12164|
        |Breakout|423.5|375|345|
        |Pong|20.73|21|21|
        |QBert|5410.75|14875|19220.3|
        |Seaquest|5300.5|7995|50245.2|
        |SpaceInvaders|1978.2|3154.6|6427.3|
        
        
        |   |   |   |   |
        |---|---|---|---|
        |![BeamRiderNoFrameskip-v4](/assets/atari10m/ddqn/BeamRiderNoFrameskip-v4.png)|![BreakoutNoFrameskip-v4](/assets/atari10m/ddqn/BreakoutNoFrameskip-v4.png)|![PongNoFrameskip-v4](/assets/atari10m/ddqn/PongNoFrameskip-v4.png)|![QbertNoFrameskip-v4](/assets/atari10m/ddqn/QbertNoFrameskip-v4.png)|
        |![SeaquestNoFrameskip-v4](/assets/atari10m/ddqn/SeaquestNoFrameskip-v4.png)|![SpaceInvadersNoFrameskip-v4](/assets/atari10m/ddqn/SpaceInvadersNoFrameskip-v4.png)||
        
        
        #### A2C
        
        `python yarlp/experiment/experiment.py run_atari10m_a2c_benchmark`
        
        
        A2C on 10M time-steps (**40M frames**) with 1 random seed. Results compared to learning curves from [Mnih et al, 2016](https://arxiv.org/pdf/1602.01783.pdf) extracted at 10M time-steps from Figure 3. You are invited to run for multiple seeds and the full 200M frames for a better comparison.
        
        |env|yarlp A2C 40M|Mnih et al A3C 40M 16-threads|
        |---|---|---|
        |BeamRider|3150|~3000|
        |Breakout|418|~150|
        |Pong|20|~20|
        |QBert|3644|~1000|
        |SpaceInvaders|805|~600|
        
        |   |   |   |   |
        |---|---|---|---|
        |![BeamRiderNoFrameskip-v4](/assets/atari10m/a2c/BeamRiderNoFrameskip-v4.png)|![BreakoutNoFrameskip-v4](/assets/atari10m/a2c/BreakoutNoFrameskip-v4.png)|![PongNoFrameskip-v4](/assets/atari10m/a2c/PongNoFrameskip-v4.png)|![QbertNoFrameskip-v4](/assets/atari10m/a2c/QbertNoFrameskip-v4.png)|
        |![SeaquestNoFrameskip-v4](/assets/atari10m/a2c/SeaquestNoFrameskip-v4.png)|![SpaceInvadersNoFrameskip-v4](/assets/atari10m/a2c/SpaceInvadersNoFrameskip-v4.png)||
        
        Here are some [more plots](https://github.com/openai/baselines-results/blob/master/acktr_ppo_acer_a2c_atari.ipynb) from OpenAI to compare against.
        
        ### Mujoco1M
        
        #### TRPO
        
        `python yarlp/experiment/experiment.py run_mujoco1m_benchmark`
        
        We average over 5 random seeds instead of 3 for both `baselines` and `yarlp`. More seeds probably wouldn't hurt here, we report 95th percent confidence intervals.
        
        |   |   |   |   |
        |---|---|---|---|
        |![Hopper-v1](/assets/mujoco1m/trpo/Hopper-v1.png)|![HalfCheetah-v1](/assets/mujoco1m/trpo/HalfCheetah-v1.png)|![Reacher-v1](/assets/mujoco1m/trpo/Reacher-v1.png)|![Swimmer-v1](/assets/mujoco1m/trpo/Swimmer-v1.png)|
        |![InvertedDoublePendulum-v1](/assets/mujoco1m/trpo/InvertedDoublePendulum-v1.png)|![Walker2d-v1](/assets/mujoco1m/trpo/Walker2d-v1.png)|![InvertedPendulum-v1](/assets/mujoco1m/trpo/InvertedPendulum-v1.png)|
        
        
        ## CLI scripts
        
        CLI convenience scripts will be installed with the package:
        
        * Run a benchmark:
        	* `python yarlp/experiment/experiment.py --help`
        * Plot `yarlp` compared to Openai `baselines` benchmarks:
        	* `compare_benchmark <yarlp-experiment-dir> <baseline-experiment-dir>`
        * Experiments:
        	* Experiments can be defined using json, validated with `jsonschema`. See [here](/experiment_configs) for sample experiment configs. You can do a grid search if multiple parameters are specified, which will run in parallel.
        	* Example: `run_yarlp_experiment --spec-file experiment_configs/trpo_experiment_mult_params.json`
        * Experiment plots:
        	* `make_plots <experiment-dir>`
        
Keywords: reinforcement learning,deep reinforcement learning,experiment,benchmark
Platform: UNKNOWN
