Slurm - Job Script Example 05 TensorFlow Singularity
This example is derived from a TensorFlow Core Tutorial on basic image classification.<ref name="tf_tut_imgcls'>TensorFlow Core Tutorials - Image classification (Retrieved 2021-03-21) It will run TensorFlow in a Singularity container to train a classifier, and save the model to disk. It does not perform inference. This job should run no longer than 15 minutes.
Code♯
Flowers Dataset♯
The flower images dataset has already been downloaded to Picotte, and is accessible by all users. See: Sample TensorFlow Datasets#Flowers
Obtain an Image for Latest TensorFlow-GPU♯
Some TensorFlow images are available on Picotte in
/beegfs/SingularityImages/
Alternatively, pull the appropriate Docker image and build a Singularity image from it:
- Create a directory in your home (or group) directory named
tfexample
cd
to that directory
- Pull the latest tensorflow-gpu Docker image: this cannot be done on the login node; it must be done on any compute node or GPU node. NOTE this may take more than a few minutes.
[juser@picotte001 tfexample]$ srun -p gpu --gres=gpu:1 --mem=16G --time=2:00:00 --pty /bin/bash
[juser@gpu001 tfexample]$ singularity pull docker://tensorflow/tensorflow:latest-gpu
[juser@gpu001 tfexample]$ exit
Once the image (.sif
file) is obtained, create the Python script and
job script below
Python Script♯
Name this file classify_flowers.py
:
#!/usr/bin/env python3
import numpy as np
import os
import sys
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import pathlib
###
### classify_flowers.py
###
dataset_url = "file:///tmp/flower_photos/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)
batch_size = 32
img_height = 180
img_width = 180
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
class_names = train_ds.class_names
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
normalization_layer = layers.experimental.preprocessing.Rescaling(1./255)
normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
image_batch, labels_batch = next(iter(normalized_ds))
num_classes = 5
data_augmentation = keras.Sequential(
[
layers.experimental.preprocessing.RandomFlip("horizontal",
input_shape=(img_height,
img_width,
3)),
layers.experimental.preprocessing.RandomRotation(0.1),
layers.experimental.preprocessing.RandomZoom(0.1),
]
)
model = Sequential([
data_augmentation,
layers.experimental.preprocessing.Rescaling(1./255),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Dropout(0.2),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.summary()
epochs = 15
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=epochs
)
### save the model
print("Saving model ...")
model.save('my_model')
sys.exit(0)
### Nothing is executed after sys.exit(0)
### XXX infer
### XXX can't infer because PIL/Pillow is not installed
sunflower_url = "file:///tmp/somewhere/592px-Red_sunflower.jpg"
sunflower_path = tf.keras.utils.get_file('Red_sunflower', origin=sunflower_url)
img = keras.preprocessing.image.load_img(
sunflower_path, target_size=(img_height, img_width)
)
img_array = keras.preprocessing.image.img_to_array(img)
img_array = tf.expand_dims(img_array, 0) # Create a batch
predictions = model.predict(img_array)
score = tf.nn.softmax(predictions[0])
print(
"This image most likely belongs to {} with a {:.2f} percent confidence."
.format(class_names[np.argmax(score)], 100 * np.max(score))
)
Job Script♯
Name this file tf_flowers.sh
:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --job-name=tf_flowers
#SBATCH --output=tf_flowers-%A.out
#SBATCH --error=tf_flowers-%A.err
#SBATCH --nodes=1
#SBATCH --cpus-per-gpu=12
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --time=24:00:00
#SBATCH --mem-per-gpu=40G
###
### tf_flowers.sh
###
### Adapted from:
### https://www.tensorflow.org/tutorials/images/classification (2021-03-13)
### Singularity image file downloaded previously
TF_IMG=tensorflow_latest-gpu.sif
### Working directory will be BeeGFS scratch directory (created automatically for every job, and deleted at end of job)
### Copy the Singularity image file, and the Python classifier script to working directory
WORKDIR=$BEEGFS_TMPDIR/tensorflow
mkdir -p $WORKDIR
cp -v $TF_IMG classify_flowers.py $WORKDIR
### confirm that files are where they should be
ls -l $WORKDIR
### Picotte has some sample TensorFlow datasets already downloaded, to avoid repeated downloads from the Internet.
### Location: /beegfs/Sample_TF_Datasets
### Images will be put in a subdirectory named "flower_photos" of local scratch (given by the $TMP environment variable)
FLOWER_PHOTOS_DIR=$TMP/flower_photos
mkdir -p $FLOWER_PHOTOS_DIR
cp -v /beegfs/Sample_TF_Datasets/Flowers/flower_photos.tgz $FLOWER_PHOTOS_DIR
### Run the classifier
### - binds $BEEGFS_TMPDIR to the /home directory in the Singularity container
### - binds $TMP (local scratch) to the /tmp directory in the Singularity container
### - above, we had copied the classify_flowers.py script to what is now /home/tensorflow in the Singularity container
singularity exec --home $BEEGFS_TMPDIR:/home --bind $TMP:/tmp --nv $TF_IMG python /home/tensorflow/classify_flowers.py
### The TensorFlow Singularity image does not come with PIL installed, so we cannot do inference.
### Instead, we save the model, and then copy it back to the directory where we did 'sbatch'
/bin/cp -fr $BEEGFS_TMPDIR/my_model $SLURM_SUBMIT_DIR
Run Job♯
[juser@picotte001 tfexample]$ sbatch tf_flowers.sh
Total run time should be about 45 seconds.
Output♯
In the file tf_flowers-NNNNN.out
-- the many "^M" characters can be
ignored:
'tensorflow_latest-gpu.sif' -> '/beegfs/scratch/NNNNN/tensorflow/tensorflow_latest-gpu.sif'
'classify_flowers.py' -> '/beegfs/scratch/NNNNN/tensorflow/classify_flowers.py'
total 2426556
-rw-rw-r-- 1 juser juser 3154 Mar 13 19:04 classify_flowers.py
-rwxrwxr-x 1 juser juser 2484789248 Mar 13 19:04 tensorflow_latest-gpu.sif
'/beegfs/Sample_TF_Datasets/Flowers/flower_photos.tgz' -> '/local/scratch/NNNNN/flower_photos/flower_photos.tgz'
Downloading data from file:///tmp/flower_photos/flower_photos.tgz
^M^M^M^M^M^M^M228818944/228813984 [==============================] - 0s 0us/step
Found 3670 files belonging to 5 classes.
Using 2936 files for training.
Found 3670 files belonging to 5 classes.
Using 734 files for validation.
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
sequential (Sequential) (None, 180, 180, 3) 0
_________________________________________________________________
rescaling_1 (Rescaling) (None, 180, 180, 3) 0
_________________________________________________________________
conv2d (Conv2D) (None, 180, 180, 16) 448
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 90, 90, 16) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 90, 90, 32) 4640
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 45, 45, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 45, 45, 64) 18496
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 22, 22, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 22, 22, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 30976) 0
_________________________________________________________________
dense (Dense) (None, 128) 3965056
_________________________________________________________________
dense_1 (Dense) (None, 5) 645
=================================================================
Total params: 3,989,285
Trainable params: 3,989,285
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
^M^M 6/^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 11s 32ms/step - loss: 1.6514 - accuracy: 0.2963 - val_loss: 1.1671 - val_accuracy: 0.5000
Epoch 2/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 1.1165 - accuracy: 0.5469 - val_loss: 1.1161 - val_accuracy: 0.5722
Epoch 3/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.9942 - accuracy: 0.6221 - val_loss: 1.1081 - val_accuracy: 0.5722
Epoch 4/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.8751 - accuracy: 0.6516 - val_loss: 0.9247 - val_accuracy: 0.6294
Epoch 5/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.8136 - accuracy: 0.6842 - val_loss: 0.8682 - val_accuracy: 0.6676
Epoch 6/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.7599 - accuracy: 0.7129 - val_loss: 0.8457 - val_accuracy: 0.6635
Epoch 7/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.7372 - accuracy: 0.7166 - val_loss: 0.8609 - val_accuracy: 0.6798
Epoch 8/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.7117 - accuracy: 0.7269 - val_loss: 0.7796 - val_accuracy: 0.6962
Epoch 9/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.6514 - accuracy: 0.7530 - val_loss: 0.7430 - val_accuracy: 0.7098
Epoch 10/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.6667 - accuracy: 0.7534 - val_loss: 0.7548 - val_accuracy: 0.6935
Epoch 11/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.6143 - accuracy: 0.7650 - val_loss: 0.7697 - val_accuracy: 0.7030
Epoch 12/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5986 - accuracy: 0.7772 - val_loss: 0.7782 - val_accuracy: 0.7275
Epoch 13/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5737 - accuracy: 0.7672 - val_loss: 0.7545 - val_accuracy: 0.7302
Epoch 14/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5251 - accuracy: 0.7994 - val_loss: 0.7316 - val_accuracy: 0.7153
Epoch 15/15
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M92/92 [==============================] - 1s 12ms/step - loss: 0.5423 - accuracy: 0.7952 - val_loss: 0.6866 - val_accuracy: 0.7534
Saved Model♯
The saved model will be in the directory named "my_model
" contained in
the directory where the job was submitted:
[juser@picotte001 tfexample]$ ls -lF my_model
total 736
drwxr-sr-x 2 juser someGrp 0 Mar 13 18:44 assets/
-rw-rw-r-- 1 juser someGrp 400142 Mar 13 19:05 saved_model.pb
drwxr-sr-x 2 juser someGrp 80 Mar 13 18:44 variables/
Files♯
Files for this example, except for the Singularity container image, are in:
/ifs/opt/Examples/Example_05_TensorFlow_Singularity
See Also♯
- Slurm - Job Script Example 05a TensorFlow With Anaconda Python
- Slurm - Job Script Example 08 TensorFlow using virtualenv
- Slurm - Job Script Example 08a TensorFlow multi-GPU using virtualenv