I'm trying to implement the TensorFlow version of this gist about reinforcement learning. Based on comments, it uses binary cross entropy from logits. I tried to use `tf.keras.losses.binary_crossentropy`

, but it produces completely different gradients given the same inputs, and initial weights. During training the tensorflow version performs terrible, and not learning at all, so something is definitely wrong with it, but can't figure out what.
See the test that I've put together:

```
x_size = 2
h_size = 3
y_size = 1
rms_discount = 0.99
epsilon = 1e-7
learning_rate = 0.001
x = np.arange(x_size).astype('float32').reshape([1, -1])
y = np.zeros([1, y_size]).astype('float32')
r = np.ones([1, 1]).astype('float32')
wh1 = np.arange(x_size * h_size).astype('float32').reshape([x_size, h_size])
wy1 = np.arange(h_size * y_size).astype('float32').reshape([h_size, y_size])
cache_wh1 = np.zeros_like(wh1)
cache_wy1 = np.zeros_like(wy1)
optimizer = tf.keras.optimizers.RMSprop(learning_rate, rms_discount, epsilon=epsilon)
wh2 = tf.keras.layers.Dense(
h_size,
'relu',
False,
tf.keras.initializers.constant(wh1)
)
wy2 = tf.keras.layers.Dense(
y_size,
None,
False,
tf.keras.initializers.constant(wy1)
)
cache_wh2 = np.zeros_like(wh1)
cache_wy2 = np.zeros_like(wy1)
for i in range(100):
h1 = np.matmul(x, wh1)
h1[h1 < 0] = 0.
y_pred1 = np.matmul(h1, wy1)
dCdy = -(y - y_pred1)
dCdwy = np.matmul(h1.T, dCdy)
dCdh = np.matmul(dCdy, wy1.T)
dCdh[h1 < 0] = 0
dCdwh = np.matmul(x.T, dCdh)
gradients1 = [dCdwh, dCdwy]
cache_wh1 = rms_discount * cache_wh1 + (1 - rms_discount) * dCdwh**2
wh1 -= learning_rate * dCdwh / (np.sqrt(cache_wh1) + epsilon)
cache_wy1 = rms_discount * cache_wy1 + (1 - rms_discount) * dCdwy**2
wy1 -= learning_rate * dCdwy / (np.sqrt(cache_wy1) + epsilon)
with tf.GradientTape() as tape:
h2 = wh2(x)
y_pred2 = wy2(h2)
loss = tf.keras.losses.binary_crossentropy(y, y_pred2, from_logits=True)
gradients2 = tape.gradient(loss, wh2.trainable_variables + wy2.trainable_variables)
cache_wh2 = rms_discount * cache_wh2 + (1 - rms_discount) * gradients2[0]**2
wh2.set_weights(wh2.get_weights() - learning_rate * gradients2[0] / (np.sqrt(cache_wh2) + epsilon))
cache_wy2 = rms_discount * cache_wy2 + (1 - rms_discount) * gradients2[1]**2
wy2.set_weights(wy2.get_weights() - learning_rate * gradients2[1] / (np.sqrt(cache_wy2) + epsilon))
print('1', gradients1[0])
print('1', gradients1[1])
print('2', gradients2[0])
print('2', gradients2[1])
```

The partial derivatives of cost/loss with respect to y(pred) are the same, so the rest should be just standard backpropagation, just with RMSprop. But they are performing different. Why?

Read more here: https://stackoverflow.com/questions/64942434/binary-cross-entropy-backpropagation-with-tensorflow

### Content Attribution

This content was originally published by Gergő Horváth at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.