Categorical DQN (C51)
Overview
C51 introduces a distribution of perspective for DQN: instead of learning a single value for an action, C51 learns to predict a distribution of values for the action. With this perspective C51 introduces applies the Bellman's equation to learning the approximate value distribution. Empirically, C51 demonstrates impressive performance in ALE.
Original papers:
Implemented Variants
Variants Implemented | Description |
---|---|
c51_atari.py , docs |
For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques. |
c51.py , docs |
For classic control tasks like CartPole-v1 . |
Below are our single-file implementations of C51:
c51_atari.py
The c51_atari.py has the following features:
- For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Usage
poetry install -E atari
python cleanrl/c51_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/c51_atari.py --env-id PongNoFrameskip-v4
Explanation of the logged metrics
Running python cleanrl/c51_atari.py
will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:
charts/episodic_return
: episodic return of the gamecharts/SPS
: number of steps per secondlosses/loss
: the cross entropy loss between the \(t\) step state value distribution and the projected \(t+1\) step state value distributionlosses/q_values
: implemented as(old_pmfs * q_network.atoms).sum(1)
, which is the sum of the probabily of getting returns \(x\) (old_pmfs
) multiplied by \(x\) (q_network.atoms
), averaged over the sample obtained from the replay buffer; useful when gauging if under or over estimation happens
Implementation details
c51_atari.py is based on (Bellemare et al., 2017)1 but presents a few implementation differences:
- (Bellemare et al., 2017)1 injects stochaticity by doing "on each frame the environment rejects the agent’s selected action with probability \(p = 0.25\)", but
c51_atari.py
does not do this c51_atari.py
use a self-contained evaluation scheme:c51_atari.py
reports the episodic returns obtained throughout training, whereas (Bellemare et al., 2017)1 is trained with--end-e=0.01
but reported episodic returns using a separate evaluation process with--end-e=0.001
(See "5.2. State-of-the-Art Results" on page 7).c51_atari.py
rescales the gradient so that the norm of the parameters does not exceed0.5
like done in PPO ( ppo2/model.py#L102-L108).
Experiment results
PR vwxyzjn/cleanrl#124 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/c51.
Below are the average episodic returns for c51_atari.py
.
Environment | c51_atari.py 10M steps |
(Bellemare et al., 2017, Figure 14)1 50M steps | (Hessel et al., 2017, Figure 5)3 |
---|---|---|---|
BreakoutNoFrameskip-v4 | 467.00 ± 96.11 | 748 | ~500 at 10M steps, ~600 at 50M steps |
PongNoFrameskip-v4 | 19.32 ± 0.92 | 20.9 | ~20 10M steps, ~20 at 50M steps |
BeamRiderNoFrameskip-v4 | 9986.96 ± 1953.30 | 14,074 | ~12000 10M steps, ~14000 at 50M steps |
Note that we save computational time by reducing timesteps from 50M to 10M, but our c51_atari.py
scores the same or higher than (Mnih et al., 2015)1 in 10M steps.
Learning curves:



Tracked experiments and game play videos:
c51.py
The c51.py has the following features:
- Works with the
Box
observation space of low-level features - Works with the
Discrete
action space - Works with envs like
CartPole-v1
Implementation details
c51.py includes the 11 core implementation details:
-
Bellemare, M.G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. ICML. ↩↩↩↩↩
-
[Proposal] Formal API handling of truncation vs termination. https://github.com/openai/gym/issues/2510 ↩
-
Hessel, M., Modayil, J., Hasselt, H.V., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M.G., & Silver, D. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI. ↩