Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low Seaquest avg score compared to A3C #3

Open
beniz opened this issue Sep 3, 2017 · 1 comment
Open

Low Seaquest avg score compared to A3C #3

beniz opened this issue Sep 3, 2017 · 1 comment

Comments

@beniz
Copy link

beniz commented Sep 3, 2017

Looking at a handful A3C implementations and results on Seaquest, they appear to score around 50K:

PAAC however, reaches a plateau around 2K according to our tests (similar to your paper). Visual inspection of the policy shows that the submarine does not resurface. While a common difficulty of the game, A3C appears to be able to overcome it (maybe this could be due to a modification in OpenAI Gym since their Atari setup has some differences with ALE).

We've looked at various explorations (e-greedy, boltzmann, bayesian dropout), with no improvement at the moment.

Do you seen any particular reason PAAC would underperform in this case ? LSTM might help, but from the two OpenAI Gym pointers above, it seems it should not be critical for Seaquest.

@Alfredvc
Copy link
Owner

Alfredvc commented Sep 4, 2017

Hi,

Seaquest was part of the "test set". Meaning that only the final algorithm, with the final set of hyperparameters was tested on the game, so I know little about the specifics of the learning process in that game. However, I may be able to give you some avenues of experimentation.

I have heard from other researchers that adding a "delay" between starting the different threads in A3C helps with learning. An analog to that for PAAC would be to add, only at the very beginning of training, a random amount of random actions to each environment before starting to learn. This would lead to the different environments being in different stages of the game at the beginning of training.

Since you are experimenting with different exploration techniques you could also try increasing the policy entropy constant in the loss. Or even starting with a high constant and annealing it over time. This constant regulates how "preferable" having a uniform policy is, relative to higher return. No entropy loss leads to very fast convergence to a near deterministic policy, while a high constant leads to a very uniform policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants