Experience Replay in Deep Deterministic Policy Gradient, how do we learn when we start with empty buffer?

I have a question about the DDPG algorithm. I am trying to understand how the buffer works in the following pseudo-code.

pseudo-code

I don't get how we will sample a minibatch transition if we are in early episodes. In this case, we will be selecting only the recent available transitions, and we will not reduce the correlation, perhaps we will end up memorizing them. Am I right? Could anyone please further explain this concept? Thanks



Read more here: https://stackoverflow.com/questions/67393448/experience-replay-in-deep-deterministic-policy-gradient-how-do-we-learn-when-we

Content Attribution

This content was originally published by Raz at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: