I have a question about the DDPG algorithm. I am trying to understand how the buffer works in the following pseudo-code.
I don't get how we will sample a minibatch transition if we are in early episodes. In this case, we will be selecting only the recent available transitions, and we will not reduce the correlation, perhaps we will end up memorizing them. Am I right? Could anyone please further explain this concept? Thanks