Slurm - Job Script Example 08 TensorFlow using virtualenv
Description
This is from the TensorFlow tutorial Distributed training with Keras. This
shows a simple use of the tf.distribute.Strategy
API for distributing
training across multiple GPU devices.
N.B. this example does only training, and not inference.
Installing TensorFlow in a virtualenv (venv) with Pip
This is an alternative to using Conda. Follow the instructions at TensorFlow#Example: Using Python 3.10 to install the latest TensorFlow
Python TensorFlow script
Create the Python script and save it as mnist.py
:
#!/usr/bin/env python3
import tensorflow as tf
import tensorflow_datasets as tfds
print(f'TensorFlow version {tf.__version__}')
# Show GPU device
print(tf.config.list_physical_devices('GPU'))
#
# Step 1: Create your input pipeline
#
# Load the MNIST dataset with the following arguments:
#
# * shuffle_files=True: The MNIST data is only stored in a single file, but for
# larger datasets with multiple files on disk, it's good
# practice to shuffle them when training.
#
# * as_supervised=True: Returns a tuple (img, label) instead of a dictionary
# {'image': img, 'label': label}.
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)
# Build a training pipeline
def normalize_img(image, label):
"""Normalizes images: `uint8` -> `float32`."""
return tf.cast(image, tf.float32) / 255., label
ds_train = ds_train.map(normalize_img,
num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
# Build an evaluation pipeline
ds_test = ds_test.map(normalize_img,
num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)
#
# Step 2: Create and train the model
#
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
model.fit(
ds_train,
epochs=6,
validation_data=ds_test,
)
# N.B. This example does not do inference
Job Script
Create this script and save it as run_mnist.sh
:
#!/bin/bash -l
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --nodes=1
#SBATCH --cpus-per-gpu=12
#SBATCH --time=1:00:00
module use /ifs/opt_cuda/modulefiles
module load python/gcc/3.10
module load cuda11.2/toolkit cuda11.2/blas cuda11.2/fft tensorrt-cuda11.2 cutensor-cuda11.2
source /ifs/group/myrsrchGrp/py310-venvs/py310-tf-gpu/bin/activate
# check python version
which python3
python3 --version
python3 mnist.py
Expected Output
/ifs/groups/myrsrchGrp/py310-venvs/py310-tf-gpu/bin/python3
Python 3.10.7
2022-11-12 19:15:21.317129: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-12 19:15:24.018506: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-12 19:15:24.518121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30988 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
TensorFlow version 2.9.2
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Epoch 1/6
469/469 [==============================] - 3s 2ms/step - loss: 0.3594 - sparse_categorical_accuracy: 0.9018 - val_loss: 0.1933 - val_sparse_categorical_accuracy: 0.9443
Epoch 2/6
469/469 [==============================] - 1s 2ms/step - loss: 0.1637 - sparse_categorical_accuracy: 0.9536 - val_loss: 0.1351 - val_sparse_categorical_accuracy: 0.9604
Epoch 3/6
469/469 [==============================] - 1s 1ms/step - loss: 0.1188 - sparse_categorical_accuracy: 0.9658 - val_loss: 0.1113 - val_sparse_categorical_accuracy: 0.9675
Epoch 4/6
469/469 [==============================] - 1s 2ms/step - loss: 0.0924 - sparse_categorical_accuracy: 0.9735 - val_loss: 0.0969 - val_sparse_categorical_accuracy: 0.9710
Epoch 5/6
469/469 [==============================] - 1s 1ms/step - loss: 0.0748 - sparse_categorical_accuracy: 0.9780 - val_loss: 0.0889 - val_sparse_categorical_accuracy: 0.9739
Epoch 6/6
469/469 [==============================] - 1s 1ms/step - loss: 0.0618 - sparse_categorical_accuracy: 0.9814 - val_loss: 0.0785 - val_sparse_categorical_accuracy: 0.9756