Skip to content

Slurm - Job Script Example 08 TensorFlow using virtualenv

Description

This is from the TensorFlow tutorial Distributed training with Keras. This shows a simple use of the tf.distribute.Strategy API for distributing training across multiple GPU devices.

N.B. this example does only training, and not inference.

Installing TensorFlow in a virtualenv (venv) with Pip

This is an alternative to using Conda. Follow the instructions at TensorFlow#Example: Using Python 3.10 to install the latest TensorFlow

Python TensorFlow script

Create the Python script and save it as mnist.py:

#!/usr/bin/env python3
import tensorflow as tf
import tensorflow_datasets as tfds

print(f'TensorFlow version {tf.__version__}')

# Show GPU device
print(tf.config.list_physical_devices('GPU'))

#
# Step 1: Create your input pipeline
#

# Load the MNIST dataset with the following arguments:
#
# * shuffle_files=True: The MNIST data is only stored in a single file, but for
#                       larger datasets with multiple files on disk, it's good
#                       practice to shuffle them when training.
#
# * as_supervised=True: Returns a tuple (img, label) instead of a dictionary
#                       {'image': img, 'label': label}.
(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

# Build a training pipeline
def normalize_img(image, label):
    """Normalizes images: `uint8` -> `float32`."""
    return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(normalize_img,
                        num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)

# Build an evaluation pipeline
ds_test = ds_test.map(normalize_img,
                      num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)

#
# Step 2: Create and train the model
#
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

model.fit(
    ds_train,
    epochs=6,
    validation_data=ds_test,
)

# N.B. This example does not do inference

Job Script

Create this script and save it as run_mnist.sh:

#!/bin/bash -l
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --nodes=1
#SBATCH --cpus-per-gpu=12
#SBATCH --time=1:00:00

module use /ifs/opt_cuda/modulefiles
module load python/gcc/3.10
module load cuda11.2/toolkit cuda11.2/blas cuda11.2/fft tensorrt-cuda11.2 cutensor-cuda11.2

source /ifs/group/myrsrchGrp/py310-venvs/py310-tf-gpu/bin/activate

# check python version
which python3
python3 --version

python3 mnist.py

Expected Output

/ifs/groups/myrsrchGrp/py310-venvs/py310-tf-gpu/bin/python3
Python 3.10.7
2022-11-12 19:15:21.317129: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-12 19:15:24.018506: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-12 19:15:24.518121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30988 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
TensorFlow version 2.9.2
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Epoch 1/6
469/469 [==============================] - 3s 2ms/step - loss: 0.3594 - sparse_categorical_accuracy: 0.9018 - val_loss: 0.1933 - val_sparse_categorical_accuracy: 0.9443
Epoch 2/6
469/469 [==============================] - 1s 2ms/step - loss: 0.1637 - sparse_categorical_accuracy: 0.9536 - val_loss: 0.1351 - val_sparse_categorical_accuracy: 0.9604
Epoch 3/6
469/469 [==============================] - 1s 1ms/step - loss: 0.1188 - sparse_categorical_accuracy: 0.9658 - val_loss: 0.1113 - val_sparse_categorical_accuracy: 0.9675
Epoch 4/6
469/469 [==============================] - 1s 2ms/step - loss: 0.0924 - sparse_categorical_accuracy: 0.9735 - val_loss: 0.0969 - val_sparse_categorical_accuracy: 0.9710
Epoch 5/6
469/469 [==============================] - 1s 1ms/step - loss: 0.0748 - sparse_categorical_accuracy: 0.9780 - val_loss: 0.0889 - val_sparse_categorical_accuracy: 0.9739
Epoch 6/6
469/469 [==============================] - 1s 1ms/step - loss: 0.0618 - sparse_categorical_accuracy: 0.9814 - val_loss: 0.0785 - val_sparse_categorical_accuracy: 0.9756