Skip to content

Slurm - Job Script Example 05 TensorFlow Singularity

This example is derived from a TensorFlow Core Tutorial on basic image classification.<ref name="tf_tut_imgcls'>TensorFlow Core Tutorials - Image classification (Retrieved 2021-03-21) It will run TensorFlow in a Singularity container to train a classifier, and save the model to disk. It does not perform inference. This job should run no longer than 15 minutes.

Code

Flowers Dataset

The flower images dataset has already been downloaded to Picotte, and is accessible by all users. See: Sample TensorFlow Datasets#Flowers

Obtain an Image for Latest TensorFlow-GPU

Some TensorFlow images are available on Picotte in /beegfs/SingularityImages/

Alternatively, pull the appropriate Docker image and build a Singularity image from it:

  • Create a directory in your home (or group) directory named tfexample
    • cd to that directory
  • Pull the latest tensorflow-gpu Docker image: this cannot be done on the login node; it must be done on any compute node or GPU node. NOTE this may take more than a few minutes.
[juser@picotte001 tfexample]$ srun -p gpu --gres=gpu:1 --mem=16G --time=2:00:00 --pty /bin/bash
[juser@gpu001 tfexample]$ singularity pull docker://tensorflow/tensorflow:latest-gpu
[juser@gpu001 tfexample]$ exit

Once the image (.sif file) is obtained, create the Python script and job script below

Python Script

Name this file classify_flowers.py:

#!/usr/bin/env python3
import numpy as np
import os
import sys
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import pathlib

###
### classify_flowers.py
###

dataset_url = "file:///tmp/flower_photos/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)

batch_size = 32
img_height = 180
img_width = 180

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

class_names = train_ds.class_names

AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

normalization_layer = layers.experimental.preprocessing.Rescaling(1./255)

normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
image_batch, labels_batch = next(iter(normalized_ds))

num_classes = 5

data_augmentation = keras.Sequential(
  [
    layers.experimental.preprocessing.RandomFlip("horizontal",
                                                 input_shape=(img_height,
                                                              img_width,
                                                              3)),
    layers.experimental.preprocessing.RandomRotation(0.1),
    layers.experimental.preprocessing.RandomZoom(0.1),
  ]
)

model = Sequential([
  data_augmentation,
  layers.experimental.preprocessing.Rescaling(1./255),
  layers.Conv2D(16, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Dropout(0.2),
  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_classes)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.summary()

epochs = 15
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

### save the model
print("Saving model ...")
model.save('my_model')

sys.exit(0)

### Nothing is executed after sys.exit(0)

### XXX infer
### XXX can't infer because PIL/Pillow is not installed
sunflower_url = "file:///tmp/somewhere/592px-Red_sunflower.jpg"
sunflower_path = tf.keras.utils.get_file('Red_sunflower', origin=sunflower_url)

img = keras.preprocessing.image.load_img(
    sunflower_path, target_size=(img_height, img_width)
)
img_array = keras.preprocessing.image.img_to_array(img)
img_array = tf.expand_dims(img_array, 0) # Create a batch

predictions = model.predict(img_array)
score = tf.nn.softmax(predictions[0])

print(
    "This image most likely belongs to {} with a {:.2f} percent confidence."
    .format(class_names[np.argmax(score)], 100 * np.max(score))
)

Job Script

Name this file tf_flowers.sh:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --job-name=tf_flowers
#SBATCH --output=tf_flowers-%A.out
#SBATCH --error=tf_flowers-%A.err
#SBATCH --nodes=1
#SBATCH --cpus-per-gpu=12
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --time=24:00:00
#SBATCH --mem-per-gpu=40G

###
### tf_flowers.sh
###

### Adapted from:
### https://www.tensorflow.org/tutorials/images/classification (2021-03-13)

### Singularity image file downloaded previously
TF_IMG=tensorflow_latest-gpu.sif

### Working directory will be BeeGFS scratch directory (created automatically for every job, and deleted at end of job)
### Copy the Singularity image file, and the Python classifier script to working directory
WORKDIR=$BEEGFS_TMPDIR/tensorflow
mkdir -p $WORKDIR
cp -v $TF_IMG classify_flowers.py $WORKDIR

### confirm that files are where they should be
ls -l $WORKDIR

### Picotte has some sample TensorFlow datasets already downloaded, to avoid repeated downloads from the Internet.
###    Location: /beegfs/Sample_TF_Datasets
### Images will be put in a subdirectory named "flower_photos" of local scratch (given by the $TMP environment variable)
FLOWER_PHOTOS_DIR=$TMP/flower_photos
mkdir -p $FLOWER_PHOTOS_DIR
cp -v /beegfs/Sample_TF_Datasets/Flowers/flower_photos.tgz $FLOWER_PHOTOS_DIR

### Run the classifier
### - binds $BEEGFS_TMPDIR to the /home directory in the Singularity container
### - binds $TMP (local scratch) to the /tmp directory in the Singularity container
### - above, we had copied the classify_flowers.py script to what is now /home/tensorflow in the Singularity container
singularity exec --home $BEEGFS_TMPDIR:/home --bind $TMP:/tmp --nv $TF_IMG python /home/tensorflow/classify_flowers.py

### The TensorFlow Singularity image does not come with PIL installed, so we cannot do inference.
### Instead, we save the model, and then copy it back to the directory where we did 'sbatch'
/bin/cp -fr $BEEGFS_TMPDIR/my_model $SLURM_SUBMIT_DIR

Run Job

[juser@picotte001 tfexample]$ sbatch tf_flowers.sh

Total run time should be about 45 seconds.

Output

In the file tf_flowers-NNNNN.out -- the many "^M" characters can be ignored:

'tensorflow_latest-gpu.sif' -> '/beegfs/scratch/NNNNN/tensorflow/tensorflow_latest-gpu.sif'
'classify_flowers.py' -> '/beegfs/scratch/NNNNN/tensorflow/classify_flowers.py'
total 2426556
-rw-rw-r-- 1 juser juser       3154 Mar 13 19:04 classify_flowers.py
-rwxrwxr-x 1 juser juser 2484789248 Mar 13 19:04 tensorflow_latest-gpu.sif
'/beegfs/Sample_TF_Datasets/Flowers/flower_photos.tgz' -> '/local/scratch/NNNNN/flower_photos/flower_photos.tgz'
Downloading data from file:///tmp/flower_photos/flower_photos.tgz
^M^M^M^M^M^M^M228818944/228813984 [==============================] - 0s 0us/step
Found 3670 files belonging to 5 classes.
Using 2936 files for training.
Found 3670 files belonging to 5 classes.
Using 734 files for validation.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
sequential (Sequential)      (None, 180, 180, 3)       0
_________________________________________________________________
rescaling_1 (Rescaling)      (None, 180, 180, 3)       0
_________________________________________________________________
conv2d (Conv2D)              (None, 180, 180, 16)      448
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 90, 90, 16)        0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 90, 90, 32)        4640
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 45, 45, 32)        0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 45, 45, 64)        18496
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 22, 22, 64)        0
_________________________________________________________________
dropout (Dropout)            (None, 22, 22, 64)        0
_________________________________________________________________
flatten (Flatten)            (None, 30976)             0
_________________________________________________________________
dense (Dense)                (None, 128)               3965056
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 645
=================================================================
Total params: 3,989,285
Trainable params: 3,989,285
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
^M^M 6/^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 11s 32ms/step - loss: 1.6514 - accuracy: 0.2963 - val_loss: 1.1671 - val_accuracy: 0.5000
Epoch 2/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 1.1165 - accuracy: 0.5469 - val_loss: 1.1161 - val_accuracy: 0.5722
Epoch 3/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.9942 - accuracy: 0.6221 - val_loss: 1.1081 - val_accuracy: 0.5722
Epoch 4/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.8751 - accuracy: 0.6516 - val_loss: 0.9247 - val_accuracy: 0.6294
Epoch 5/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.8136 - accuracy: 0.6842 - val_loss: 0.8682 - val_accuracy: 0.6676
Epoch 6/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.7599 - accuracy: 0.7129 - val_loss: 0.8457 - val_accuracy: 0.6635
Epoch 7/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.7372 - accuracy: 0.7166 - val_loss: 0.8609 - val_accuracy: 0.6798
Epoch 8/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.7117 - accuracy: 0.7269 - val_loss: 0.7796 - val_accuracy: 0.6962
Epoch 9/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.6514 - accuracy: 0.7530 - val_loss: 0.7430 - val_accuracy: 0.7098
Epoch 10/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.6667 - accuracy: 0.7534 - val_loss: 0.7548 - val_accuracy: 0.6935
Epoch 11/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.6143 - accuracy: 0.7650 - val_loss: 0.7697 - val_accuracy: 0.7030
Epoch 12/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5986 - accuracy: 0.7772 - val_loss: 0.7782 - val_accuracy: 0.7275
Epoch 13/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5737 - accuracy: 0.7672 - val_loss: 0.7545 - val_accuracy: 0.7302
Epoch 14/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5251 - accuracy: 0.7994 - val_loss: 0.7316 - val_accuracy: 0.7153
Epoch 15/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5423 - accuracy: 0.7952 - val_loss: 0.6866 - val_accuracy: 0.7534

Saved Model

The saved model will be in the directory named "my_model" contained in the directory where the job was submitted:

[juser@picotte001 tfexample]$ ls -lF my_model
total 736
drwxr-sr-x 2 juser someGrp      0 Mar 13 18:44 assets/
-rw-rw-r-- 1 juser someGrp 400142 Mar 13 19:05 saved_model.pb
drwxr-sr-x 2 juser someGrp     80 Mar 13 18:44 variables/

Files

Files for this example, except for the Singularity container image, are in:

/ifs/opt/Examples/Example_05_TensorFlow_Singularity

See Also

References