I've done a lot of work with TF1 and recently I upgraded to TF2 but I'm running into issues with running TF2 on a GPU as the network isn't converging (even as the same code converges while running on a cpu). Following the CNN tutorials on https://www.tensorflow.org/tutorials I have noticed that the models are failing to learn during training. Any ideas on what is causing this?
Another posts suggested that this may be caused by floating point errors a but I have a hard time believing things are that unstable -- especially across multiple tutorials. I had this problem occur on the following tutorials: Convolutional Neural Network (CNN), Transfer learning and fine tuning, and Transfer learning with TF hub.
I am running:
- Tensorflow version 2.3.0
- Cuda compilation tools release 11.2 V11.2.125
- On a NVIDIA GeForce RTX 3090 or Intel i7-10700K CPU
- I had some trouble installing things initially but the method described in this answer ended up working -- could that be the root issue?
To demonstrate, I copy/pasted the code from the CNN tutorial into the following script:
import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # minimize logs import tensorflow as tf from tensorflow.keras import datasets, layers, models import matplotlib.pyplot as plt RUN_ON_CPU = False if RUN_ON_CPU: os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # prevent cpu from running to see if that's the issue print('gpu available', tf.config.list_physical_devices('GPU')) # load dataset (train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data() # Normalize pixel values to be between 0 and 1 train_images, test_images = train_images / 255.0, test_images / 255.0 # build model backbone model = models.Sequential() model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3))) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation='relu')) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation='relu')) # add dense layers on top model.add(layers.Flatten()) model.add(layers.Dense(64, activation='relu')) model.add(layers.Dense(10)) model.summary() # compile and train model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) history = model.fit(train_images, train_labels, epochs=10, validation_data=(test_images, test_labels)) plt.figure() plt.plot(history.history['accuracy'], label='accuracy') plt.plot(history.history['val_accuracy'], label = 'val_accuracy') plt.xlabel('Epoch') plt.ylabel('Accuracy') # plt.ylim([0.5, 1]) plt.legend(loc='lower right') if RUN_ON_CPU: plt.title('Training on CPU') else: plt.title('Training on GPU') test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2) print('test loss and accuracy', test_loss, test_acc)
GPU test loss and accuracy 2.302645444869995 0.10000000149011612
CPU test loss and accuracy 0.879743754863739 0.7060999870300293
The tutorial claims that the CNN should achieve a test accuracy of ~70% which the GPU doesn't come close to. To be sure I logged
tf.config.list_physical_devices('GPU') and the GPU took 2-3s per epoch whereas CPU took 11-14s. Using
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' to turn off the GPU was the only code change between the runs.