Low Seaquest avg score compared to A3C #3

beniz · 2017-09-03T10:19:09Z

Looking at a handful A3C implementations and results on Seaquest, they appear to score around 50K:

https://gym.openai.com/evaluations/eval_pjjgc9POQJK4IuVw8nXlBw (ConvNet)
https://gym.openai.com/evaluations/eval_uxYSMnhuTpCNLoPZ7DkxKQ (from https://github.com/dgriff777/rl_a3c_pytorch and with LSTM)

PAAC however, reaches a plateau around 2K according to our tests (similar to your paper). Visual inspection of the policy shows that the submarine does not resurface. While a common difficulty of the game, A3C appears to be able to overcome it (maybe this could be due to a modification in OpenAI Gym since their Atari setup has some differences with ALE).

We've looked at various explorations (e-greedy, boltzmann, bayesian dropout), with no improvement at the moment.

Do you seen any particular reason PAAC would underperform in this case ? LSTM might help, but from the two OpenAI Gym pointers above, it seems it should not be critical for Seaquest.

Alfredvc · 2017-09-04T07:35:51Z

Hi,

Seaquest was part of the "test set". Meaning that only the final algorithm, with the final set of hyperparameters was tested on the game, so I know little about the specifics of the learning process in that game. However, I may be able to give you some avenues of experimentation.

I have heard from other researchers that adding a "delay" between starting the different threads in A3C helps with learning. An analog to that for PAAC would be to add, only at the very beginning of training, a random amount of random actions to each environment before starting to learn. This would lead to the different environments being in different stages of the game at the beginning of training.

Since you are experimenting with different exploration techniques you could also try increasing the policy entropy constant in the loss. Or even starting with a high constant and annealing it over time. This constant regulates how "preferable" having a uniform policy is, relative to higher return. No entropy loss leads to very fast convergence to a near deterministic policy, while a high constant leads to a very uniform policy.

LBerth mentioned this issue Sep 4, 2017

Make the Seaquest submarine resurface jolibrain/manette#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low Seaquest avg score compared to A3C #3

Low Seaquest avg score compared to A3C #3

beniz commented Sep 3, 2017

Alfredvc commented Sep 4, 2017

Low Seaquest avg score compared to A3C #3

Low Seaquest avg score compared to A3C #3

Comments

beniz commented Sep 3, 2017

Alfredvc commented Sep 4, 2017