Why are the models in the tutorials not converging on GPU (but working on CPU)?

I've done a lot of work with TF1 and recently I upgraded to TF2 but I'm running into issues with running TF2 on a GPU as the network isn't converging (even as the same code converges while running on a cpu). Following the CNN tutorials on https://www.tensorflow.org/tutorials I have noticed that the models are failing to learn during training. Any ideas on what is causing this?

Another posts suggested that this may be caused by floating point errors a but I have a hard time believing things are that unstable -- especially across multiple tutorials. I had this problem occur on the following tutorials: Convolutional Neural Network (CNN), Transfer learning and fine tuning, and Transfer learning with TF hub.

I am running:

  • Tensorflow version 2.3.0
  • Cuda compilation tools release 11.2 V11.2.125
  • On a NVIDIA GeForce RTX 3090 or Intel i7-10700K CPU
  • I had some trouble installing things initially but the method described in this answer ended up working -- could that be the root issue?

To demonstrate, I copy/pasted the code from the CNN tutorial into the following script:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # minimize logs

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

RUN_ON_CPU = False

if RUN_ON_CPU:
    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'  # prevent cpu from running to see if that's the issue

print('gpu available', tf.config.list_physical_devices('GPU'))

# load dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

# build model backbone
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
# add dense layers on top
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.summary()

# compile and train
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

plt.figure()
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
# plt.ylim([0.5, 1])
plt.legend(loc='lower right')
if RUN_ON_CPU:
    plt.title('Training on CPU')
else:
    plt.title('Training on GPU')
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
print('test loss and accuracy', test_loss, test_acc)

Which plots the following training curves depending on the RUN_ON_CPU flag: curve on gpu enter image description here

GPU test loss and accuracy 2.302645444869995 0.10000000149011612

CPU test loss and accuracy 0.879743754863739 0.7060999870300293

The tutorial claims that the CNN should achieve a test accuracy of ~70% which the GPU doesn't come close to. To be sure I logged tf.config.list_physical_devices('GPU') and the GPU took 2-3s per epoch whereas CPU took 11-14s. Using os.environ['CUDA_VISIBLE_DEVICES'] = '-1' to turn off the GPU was the only code change between the runs.



Read more here: https://stackoverflow.com/questions/68488742/why-are-the-models-in-the-tutorials-not-converging-on-gpu-but-working-on-cpu

Content Attribution

This content was originally published by Brett S at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: