Slurm - Job Script Example 05 TensorFlow Singularity

This example is derived from a TensorFlow Core Tutorial on basic image classification.<ref name="tf_tut_imgcls'>TensorFlow Core Tutorials - Image classification (Retrieved 2021-03-21) It will run TensorFlow in a Singularity container to train a classifier, and save the model to disk. It does not perform inference. This job should run no longer than 15 minutes.

Code♯

Flowers Dataset♯

The flower images dataset has already been downloaded to Picotte, and is accessible by all users. See: Sample TensorFlow Datasets#Flowers

Obtain an Image for Latest TensorFlow-GPU♯

Some TensorFlow images are available on Picotte in /beegfs/SingularityImages/

Alternatively, pull the appropriate Docker image and build a Singularity image from it:

Create a directory in your home (or group) directory named tfexample
- cd to that directory
Pull the latest tensorflow-gpu Docker image: this cannot be done on the login node; it must be done on any compute node or GPU node. NOTE this may take more than a few minutes.

[juser@picotte001 tfexample]$ srun -p gpu --gres=gpu:1 --mem=16G --time=2:00:00 --pty /bin/bash
[juser@gpu001 tfexample]$ singularity pull docker://tensorflow/tensorflow:latest-gpu
[juser@gpu001 tfexample]$ exit

Once the image (.sif file) is obtained, create the Python script and job script below

Python Script♯

Name this file classify_flowers.py:

#!/usr/bin/env python3
import numpy as np
import os
import sys
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import pathlib

###
### classify_flowers.py
###

dataset_url = "file:///tmp/flower_photos/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)

batch_size = 32
img_height = 180
img_width = 180

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

class_names = train_ds.class_names

AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

normalization_layer = layers.experimental.preprocessing.Rescaling(1./255)

normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
image_batch, labels_batch = next(iter(normalized_ds))

num_classes = 5

data_augmentation = keras.Sequential(
  [
    layers.experimental.preprocessing.RandomFlip("horizontal",
                                                 input_shape=(img_height,
                                                              img_width,
                                                              3)),
    layers.experimental.preprocessing.RandomRotation(0.1),
    layers.experimental.preprocessing.RandomZoom(0.1),
  ]
)

model = Sequential([
  data_augmentation,
  layers.experimental.preprocessing.Rescaling(1./255),
  layers.Conv2D(16, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Dropout(0.2),
  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_classes)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.summary()

epochs = 15
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

### save the model
print("Saving model ...")
model.save('my_model')

sys.exit(0)

### Nothing is executed after sys.exit(0)

### XXX infer
### XXX can't infer because PIL/Pillow is not installed
sunflower_url = "file:///tmp/somewhere/592px-Red_sunflower.jpg"
sunflower_path = tf.keras.utils.get_file('Red_sunflower', origin=sunflower_url)

img = keras.preprocessing.image.load_img(
    sunflower_path, target_size=(img_height, img_width)
)
img_array = keras.preprocessing.image.img_to_array(img)
img_array = tf.expand_dims(img_array, 0) # Create a batch

predictions = model.predict(img_array)
score = tf.nn.softmax(predictions[0])

print(
    "This image most likely belongs to {} with a {:.2f} percent confidence."
    .format(class_names[np.argmax(score)], 100 * np.max(score))
)

Job Script♯

Name this file tf_flowers.sh:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --job-name=tf_flowers
#SBATCH --output=tf_flowers-%A.out
#SBATCH --error=tf_flowers-%A.err
#SBATCH --nodes=1
#SBATCH --cpus-per-gpu=12
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --time=24:00:00
#SBATCH --mem-per-gpu=40G

###
### tf_flowers.sh
###

### Adapted from:
### https://www.tensorflow.org/tutorials/images/classification (2021-03-13)

### Singularity image file downloaded previously
TF_IMG=tensorflow_latest-gpu.sif

### Working directory will be BeeGFS scratch directory (created automatically for every job, and deleted at end of job)
### Copy the Singularity image file, and the Python classifier script to working directory
WORKDIR=$BEEGFS_TMPDIR/tensorflow
mkdir -p $WORKDIR
cp -v $TF_IMG classify_flowers.py $WORKDIR

### confirm that files are where they should be
ls -l $WORKDIR

### Picotte has some sample TensorFlow datasets already downloaded, to avoid repeated downloads from the Internet.
###    Location: /beegfs/Sample_TF_Datasets
### Images will be put in a subdirectory named "flower_photos" of local scratch (given by the $TMP environment variable)
FLOWER_PHOTOS_DIR=$TMP/flower_photos
mkdir -p $FLOWER_PHOTOS_DIR
cp -v /beegfs/Sample_TF_Datasets/Flowers/flower_photos.tgz $FLOWER_PHOTOS_DIR

### Run the classifier
### - binds $BEEGFS_TMPDIR to the /home directory in the Singularity container
### - binds $TMP (local scratch) to the /tmp directory in the Singularity container
### - above, we had copied the classify_flowers.py script to what is now /home/tensorflow in the Singularity container
singularity exec --home $BEEGFS_TMPDIR:/home --bind $TMP:/tmp --nv $TF_IMG python /home/tensorflow/classify_flowers.py

### The TensorFlow Singularity image does not come with PIL installed, so we cannot do inference.
### Instead, we save the model, and then copy it back to the directory where we did 'sbatch'
/bin/cp -fr $BEEGFS_TMPDIR/my_model $SLURM_SUBMIT_DIR

Run Job♯

[juser@picotte001 tfexample]$ sbatch tf_flowers.sh

Total run time should be about 45 seconds.

Output♯

In the file tf_flowers-NNNNN.out -- the many "^M" characters can be ignored:

'tensorflow_latest-gpu.sif' -> '/beegfs/scratch/NNNNN/tensorflow/tensorflow_latest-gpu.sif'
'classify_flowers.py' -> '/beegfs/scratch/NNNNN/tensorflow/classify_flowers.py'
total 2426556
-rw-rw-r-- 1 juser juser       3154 Mar 13 19:04 classify_flowers.py
-rwxrwxr-x 1 juser juser 2484789248 Mar 13 19:04 tensorflow_latest-gpu.sif
'/beegfs/Sample_TF_Datasets/Flowers/flower_photos.tgz' -> '/local/scratch/NNNNN/flower_photos/flower_photos.tgz'
Downloading data from file:///tmp/flower_photos/flower_photos.tgz
^M^M^M^M^M^M^M228818944/228813984 [==============================] - 0s 0us/step
Found 3670 files belonging to 5 classes.
Using 2936 files for training.
Found 3670 files belonging to 5 classes.
Using 734 files for validation.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
sequential (Sequential)      (None, 180, 180, 3)       0
_________________________________________________________________
rescaling_1 (Rescaling)      (None, 180, 180, 3)       0
_________________________________________________________________
conv2d (Conv2D)              (None, 180, 180, 16)      448
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 90, 90, 16)        0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 90, 90, 32)        4640
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 45, 45, 32)        0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 45, 45, 64)        18496
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 22, 22, 64)        0
_________________________________________________________________
dropout (Dropout)            (None, 22, 22, 64)        0
_________________________________________________________________
flatten (Flatten)            (None, 30976)             0
_________________________________________________________________
dense (Dense)                (None, 128)               3965056
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 645
=================================================================
Total params: 3,989,285
Trainable params: 3,989,285
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
^M^M 6/^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 11s 32ms/step - loss: 1.6514 - accuracy: 0.2963 - val_loss: 1.1671 - val_accuracy: 0.5000
Epoch 2/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 1.1165 - accuracy: 0.5469 - val_loss: 1.1161 - val_accuracy: 0.5722
Epoch 3/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.9942 - accuracy: 0.6221 - val_loss: 1.1081 - val_accuracy: 0.5722
Epoch 4/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.8751 - accuracy: 0.6516 - val_loss: 0.9247 - val_accuracy: 0.6294
Epoch 5/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.8136 - accuracy: 0.6842 - val_loss: 0.8682 - val_accuracy: 0.6676
Epoch 6/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.7599 - accuracy: 0.7129 - val_loss: 0.8457 - val_accuracy: 0.6635
Epoch 7/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.7372 - accuracy: 0.7166 - val_loss: 0.8609 - val_accuracy: 0.6798
Epoch 8/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.7117 - accuracy: 0.7269 - val_loss: 0.7796 - val_accuracy: 0.6962
Epoch 9/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.6514 - accuracy: 0.7530 - val_loss: 0.7430 - val_accuracy: 0.7098
Epoch 10/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.6667 - accuracy: 0.7534 - val_loss: 0.7548 - val_accuracy: 0.6935
Epoch 11/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.6143 - accuracy: 0.7650 - val_loss: 0.7697 - val_accuracy: 0.7030
Epoch 12/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5986 - accuracy: 0.7772 - val_loss: 0.7782 - val_accuracy: 0.7275
Epoch 13/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5737 - accuracy: 0.7672 - val_loss: 0.7545 - val_accuracy: 0.7302
Epoch 14/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5251 - accuracy: 0.7994 - val_loss: 0.7316 - val_accuracy: 0.7153
Epoch 15/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5423 - accuracy: 0.7952 - val_loss: 0.6866 - val_accuracy: 0.7534

Saved Model♯

The saved model will be in the directory named "my_model" contained in the directory where the job was submitted:

[juser@picotte001 tfexample]$ ls -lF my_model
total 736
drwxr-sr-x 2 juser someGrp      0 Mar 13 18:44 assets/
-rw-rw-r-- 1 juser someGrp 400142 Mar 13 19:05 saved_model.pb
drwxr-sr-x 2 juser someGrp     80 Mar 13 18:44 variables/

Files♯

Files for this example, except for the Singularity container image, are in:

/ifs/opt/Examples/Example_05_TensorFlow_Singularity