Categorical DQN (C51)

Overview

C51 introduces a distribution of perspective for DQN: instead of learning a single value for an action, C51 learns to predict a distribution of values for the action. With this perspective C51 introduces applies the Bellman's equation to learning the approximate value distribution. Empirically, C51 demonstrates impressive performance in ALE.

Original papers:

A Distributional Perspective on Reinforcement Learning

Implemented Variants

Variants Implemented	Description
`c51_atari.py`, docs	For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
`c51.py`, docs	For classic control tasks like `CartPole-v1`.

Below are our single-file implementations of C51:

`c51_atari.py`

The c51_atari.py has the following features:

For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
Works with the Atari's pixel Box observation space of shape (210, 160, 3)
Works with the Discrete action space

Usage

poetry install -E atari
python cleanrl/c51_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/c51_atari.py --env-id PongNoFrameskip-v4

Explanation of the logged metrics

Running python cleanrl/c51_atari.py will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:

charts/episodic_return: episodic return of the game
charts/SPS: number of steps per second
losses/loss: the cross entropy loss between the \(t\) step state value distribution and the projected \(t+1\) step state value distribution
losses/q_values: implemented as (old_pmfs * q_network.atoms).sum(1), which is the sum of the probabily of getting returns \(x\) (old_pmfs) multiplied by \(x\) (q_network.atoms), averaged over the sample obtained from the replay buffer; useful when gauging if under or over estimation happens

Implementation details

c51_atari.py is based on (Bellemare et al., 2017)¹ but presents a few implementation differences:

(Bellemare et al., 2017)¹ injects stochaticity by doing "on each frame the environment rejects the agent’s selected action with probability \(p = 0.25\)", but c51_atari.py does not do this
c51_atari.py use a self-contained evaluation scheme: c51_atari.py reports the episodic returns obtained throughout training, whereas (Bellemare et al., 2017)¹ is trained with --end-e=0.01 but reported episodic returns using a separate evaluation process with --end-e=0.001 (See "5.2. State-of-the-Art Results" on page 7).
c51_atari.py rescales the gradient so that the norm of the parameters does not exceed 0.5 like done in PPO ( ppo2/model.py#L102-L108).

Experiment results

PR vwxyzjn/cleanrl#124 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/c51.

Below are the average episodic returns for c51_atari.py.

Environment	`c51_atari.py` 10M steps	(Bellemare et al., 2017, Figure 14)¹ 50M steps	(Hessel et al., 2017, Figure 5)³
BreakoutNoFrameskip-v4	467.00 ± 96.11	748	~500 at 10M steps, ~600 at 50M steps
PongNoFrameskip-v4	19.32 ± 0.92	20.9	~20 10M steps, ~20 at 50M steps
BeamRiderNoFrameskip-v4	9986.96 ± 1953.30	14,074	~12000 10M steps, ~14000 at 50M steps

Note that we save computational time by reducing timesteps from 50M to 10M, but our c51_atari.py scores the same or higher than (Mnih et al., 2015)¹ in 10M steps.

Learning curves:

Tracked experiments and game play videos:

`c51.py`

The c51.py has the following features:

Works with the Box observation space of low-level features
Works with the Discrete action space
Works with envs like CartPole-v1

Implementation details

c51.py includes the 11 core implementation details:

Bellemare, M.G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. ICML. ↩↩↩↩↩
[Proposal] Formal API handling of truncation vs termination. https://github.com/openai/gym/issues/2510 ↩
Hessel, M., Modayil, J., Hasselt, H.V., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M.G., & Silver, D. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI. ↩