Slurm - Job Script Example 08a TensorFlow multi-GPU using virtualenv

Description♯

This is from the TensorFlow tutorial Distributed training with Keras. This shows a simple use of the tf.distribute.Strategy API for distributing training across multiple GPU devices.

Installing TensorFlow in a virtualenv (venv) with Pip♯

This is an alternative to using Conda. Follow the instructions at TensorFlow#Example: Using Python 3.10 to install the latest TensorFlow

Python TensorFlow script♯

Create the Python script and save it as dist_mnist.py:

#!/usr/bin/env python3.10
#
# NAME THIS FILE dist_mnist.py
#
# Example from: https://www.tensorflow.org/tutorials/distribute/keras
#
# The tf.distribute.Strategy API provides an abstraction for distributing your
# training across multiple processing units. It allows you to carry out
# distributed training using existing models and training code with minimal
# changes.
#
# This tutorial demonstrates how to use the tf.distribute.MirroredStrategy
# to perform in-graph replication with synchronous training on many GPUs on
# one machine. The strategy essentially copies all of the model's variables
# to each processor. Then, it uses all-reduce to combine the gradients from
# all processors, and applies the combined value to all copies of the model.
#
# You will use the tf.keras APIs to build the model and Model.fit for
# training it. (To learn about distributed training with a custom training
# loop and the MirroredStrategy, check out this other tutorial.)
#
# MirroredStrategy trains your model on multiple GPUs on a single machine.
# For synchronous training on many GPUs on multiple workers, use the
# tf.distribute.MultiWorkerMirroredStrategy with the Keras Model.fit or a
# custom training loop. For other options, refer to the Distributed training
# guide.
#
# To learn about various other strategies, there is the Distributed training
# with TensorFlow guide.

import tensorflow_datasets as tfds
import tensorflow as tf
import os

print(f'TensorFlow version {tf.__version__}')

datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)

mnist_train, mnist_test = datasets['train'], datasets['test']

strategy = tf.distribute.MirroredStrategy()

print(f'Number of devices: {strategy.num_replicas_in_sync}')

# You can also do info.splits.total_num_examples to get the total
# number of examples in the dataset.

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000
BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255

    return image, label

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10)
    ])

    model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=['accuracy'])

# Define the checkpoint directory to store the checkpoints.
checkpoint_dir = './training_checkpoints'

# Define the name of the checkpoint files.
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

# Define a function for decaying the learning rate.
# You can define any decay function you need.
def decay(epoch):
    if epoch < 3:
        return 1e-3
    elif epoch >= 3 and epoch < 7:
        return 1e-4
    else:
        return 1e-5

# Define a callback for printing the learning rate at the end of each epoch.
class PrintLR(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        print(f'\nLearning rate for epoch {epoch + 1} is {model.optimizer.lr.numpy()}')

# Put all the callbacks together.
callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir='./logs'),
    tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                       save_weights_only=True),
    tf.keras.callbacks.LearningRateScheduler(decay),
    PrintLR()
]

EPOCHS = 12
model.fit(train_dataset, epochs=EPOCHS, callbacks=callbacks)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

eval_loss, eval_acc = model.evaluate(eval_dataset)

print(f'Eval loss: {eval_loss}, Eval accuracy: {eval_acc}')

print()

# Export the graph and the variables to the platform-agnostic SavedModel format
# using Keras Model.save. After your model is saved, you can load it with or
# without the Strategy.scope.

PATH = 'saved_model/'
model.save(PATH, save_format='tf')

# Now, load the model without Strategy.scope:
unreplicated_model = tf.keras.models.load_model(PATH)

unreplicated_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy'])

eval_loss, eval_acc = unreplicated_model.evaluate(eval_dataset)

print(f'Model without Strategy.scope: Eval loss: {eval_loss}, Eval Accuracy: {eval_acc}')

# Load the model with Strategy.scope:
with strategy.scope():
    replicated_model = tf.keras.models.load_model(PATH)
    replicated_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                             optimizer=tf.keras.optimizers.Adam(),
                             metrics=['accuracy'])

    eval_loss, eval_acc = replicated_model.evaluate(eval_dataset)
    print(f'Model with Strategy.scope: Eval loss: {eval_loss}, Eval Accuracy: {eval_acc}')

Job Script♯

Create the following job script and save it as run_mnist.sh. To use this job script, you must have created a Python 3.10 venv named py310-tf-gpu in a directory /ifs/groups/myrsrchGrp/py310-venvs/py310-tf-gpu.

#!/bin/bash -l
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-gpu=12
#SBATCH --mem-per-gpu=42G
#SBATCH --time=8:00:00

module use /ifs/opt_cuda/modulefiles
module load python/gcc/3.10
module load cuda11.2/toolkit cuda11.2/blas cuda11.2/fft tensorrt-cuda11.2 cutensor-cuda11.2

source /ifs/group/myrsrchGrp/py310-venvs/py310-tf-gpu/bin/activate

# check python version
which python3
python3 --version

# check number of GPUs and number of CPU cores
echo "SLURM_CPUS_ON_NODE = $SLURM_CPUS_ON_NODE"
echo "SLURM_GPUS_ON_NODE = $SLURM_GPUS_ON_NODE"
echo "SLURM_STEP_GPUS = $SLURM_STEP_GPUS"

nvidia-smi

python3 dist_mnist.py

Expected Output♯

A lot of the running status output has been edited out -- note the listing of 4 GPU devices:

/ifs/groups/myrsrchGrp/py310-venvs/py310-tf-gpu/bin/python3
Python 3.10.7
2022-11-12 18:33:32.848145: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-12 18:33:37.582422: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-12 18:33:39.592199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30988 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0
2022-11-12 18:33:39.593542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 30988 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
2022-11-12 18:33:39.594712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 30988 MB memory:  -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0
2022-11-12 18:33:39.595875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 30988 MB memory:  -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
2022-11-12 18:33:42.200688: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:547] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
TensorFlow version 2.9.2
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
Number of devices: 4
Epoch 1/12
2022-11-12 18:33:47.840642: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8600
2022-11-12 18:33:48.640983: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8600
2022-11-12 18:33:49.098066: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8600
2022-11-12 18:33:49.370112: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8600
^M  1/235 [..............................] - ETA: 37:33 - loss: 2.3298 - accuracy: 0.0977WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0060s vs `on_train_batch_end` time: 0.0098s). Check your callbacks.
...
Learning rate for epoch 12 is 9.999999747378752e-06
...
Eval loss: 0.05299859121441841, Eval accuracy: 0.9829999804496765
...
Model without Strategy.scope: Eval loss: 0.05299859121441841, Eval Accuracy: 0.9829999804496765
...
Model with Strategy.scope: Eval loss: 0.05299859121441841, Eval Accuracy: 0.9829999804496765

The saved model will be in the directory saved_model.