I think this one is really hard to train, as it may converge to the loacl optimazation.
I concur.I am trying to write a2c to train this. With large effort, its result does not beat vanilla REINFORCE.
· Sign up or log in to comment