This is from https://github.com/MoritzTaylor/ddpg-pytorch/blob/master/ddpg.py implementation and I guess most of the ddpg implementation are written this way.
self.critic_optimizer.zero_grad() state_action_batch = self.critic(state_batch, action_batch) value_loss = F.mse_loss(state_action_batch, expected_values.detach()) value_loss.backward() self.critic_optimizer.step() # Update the actor network self.actor_optimizer.zero_grad() policy_loss = -self.critic(state_batch, self.actor(state_batch)) policy_loss = policy_loss.mean() policy_loss.backward() self.actor_optimizer.step()
However after policy_loss.backwad(), I think the gradient is left in the critic network with respect to critic parameters. Shouldn't this affect the next update of critic?
If it does, what could be the solution?