GPU Memory Limits for BERT

Code♯

This example is derived from GPU Benchmarks for Fine-Tuning BERT [1], which focus on the choice of training batch size. We will use the Anaconda Python installed on Picotte to run this example on GPU.

Python script♯

The Python script is a file named gpu_memory_limit_parameter_search.py It runs many different values of batch size and appends the result into a table. We use the information collected from the table to analyze how the training batch size affects the memory usage.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""GPU Memory Limit Parameter Search.ipynb
Automatically generated by Colaboratory.
Original file is located at
    https://colab.research.google.com/drive/15OdPZPVx9OFGK_4eRp8PLOt2yb3Ma5dX
# Find GPU Memory Limits for BERT
By Chris McCormick
This is the Notebook used to gather experimental results for the parameters plot in [GPU Benchmarks for Fine-Tuning BERT.ipynb](https://colab.research.google.com/drive/1Gzm0mBTWQLtI5q8azxwudmZGBWyiq6II#scrollTo=K2He8biUzcij&uniqifier=1)
Here's the quick overview of the experiment:
1. Load BERT-base onto the GPU
2. Create a Tensor with the parameters you want:
    * Number of rows is batch size.
    * Number of columns is sequence length.
3. Train on the random data for 3 steps.
4. Record the peak GPU memory usage.
    * If training fails due to "out-of-memory", record the memory used as -1.
I've stored the results from all of my runs in a `.csv` file on Google Drive.

## S1. Setup
### 1.1. Connect to GPU
"""
import torch
# Tell PyTorch to use the GPU.
device = torch.device("cuda")
print('There are %d GPU(s) available.' % torch.cuda.device_count())
# Look up what kind of GPU we're given.
gpu_name = torch.cuda.get_device_name(0)
print('We will use the GPU:', gpu_name)
"""### 1.2. Install `transformers`
Next, let's install the [transformers](https://github.com/huggingface/transformers) package from Hugging Face which will give us a pytorch interface for working with BERT.
"""

import transformers
"""### 1.3. `check_gpu_mem`
Function to measure GPU memory useage.
"""
import os
import pandas as pd
import csv
def check_gpu_mem():
    '''
    Uses Nvidia's SMI tool to check the current GPU memory usage.
    '''
    # Run the command line tool and get the results.
    buf = os.popen('nvidia-smi --query-gpu=memory.total,memory.used --format=csv')
    # Use csv module to read and parse the result.
    reader = csv.reader(buf, delimiter=',')
    # Use a pandas table just for nice formatting.
    df = pd.DataFrame(reader)
    # Use the first row as the column headers.
    new_header = df.iloc[0] #grab the first row for the header
    df = df[1:] #take the data less the header row
    df.columns = new_header #set the header row as the df header
    # Display the formatted table.
    #display(df)
    return df
"""Let's run the function so you can see the output."""
df = check_gpu_mem()
df
"""**NOTE!** - The `first_time` flag is set the first time this Notebook is run. On subsequent runs (within the same runtime), you should start from 2.1."""
# Flag used to determine whether to create a new table or append to existing.
first_time = True

"""## S2. Run Experiment
### 2.1. Specify Parameters (Re-Run From Here)
Set the key parameters and timestamp the experiment.
"""
import time
from datetime import datetime
# The dataset is pre-tokenized, so this only specifies which copy of the dataset
# to fetch.
max_len = 240
batch_size = 0
while (batch_size < 100):
    # Record when we started.
    timestamp = datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
    # All parameters and results are placed in a table at the very end of the
    # notebook.
    from transformers import BertForSequenceClassification, AdamW, BertConfig
    # Load BertForSequenceClassification, the pretrained BERT model with a single
    # linear classification layer on top.
    model = BertForSequenceClassification.from_pretrained(
        "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
        num_labels = 20, # For our 20 newsgroups!
        output_attentions = False, # Whether the model returns attentions weights.
        output_hidden_states = False, # Whether the model returns all hidden-states.
    )
    # Tell pytorch to run this model on the GPU.
    desc = model.cuda()
    check_gpu_mem()
    """### 2.2. Random Data
    Generate some random data that looks like BERT tokens.
    """
    import torch
    # Create the bogus input batch.
    x = torch.randint(low=0, high=100, size=(batch_size, max_len))
    # Assign random labels (0 or 1)
    labels = torch.randint(low=0, high=1, size=(batch_size, 1))
    print('x:     ', x.size())
    print('labels:', labels.size())
    # Copy to the GPU
    x = x.to(device)
    labels = labels.to(device)
    """### 2.3. Forward & Backward Training Pass
    Run the fake data through the model, and perform the gradient calculation, just to see the memory usage.
    I did this for three steps simply because, in my experience, the GPU doesn't seem to hit its peak memory usage until after the third step.
    """
    # Use a flag to capture whether the experiment either succeeds or runs out of
    # memory.
    success = False
    # It seems like the peak memory use isn't hit until after a few steps, so run
    # this through three times.
    for i in range(0, 3):
        # Put the model into training mode. Don't be mislead--the call to
        # `train` just changes the *mode*, it doesn't *perform* the training.
        # `dropout` and `batchnorm` layers behave differently during training
        # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
        model.train()
        df = check_gpu_mem()
        print('    Before forward-pass: {:}'.format(df.iloc[0, 1]))
        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because
        # accumulating the gradients is "convenient while training RNNs".
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()
        # Perform a forward pass (evaluate the model on this training batch).
        # The documentation for this `model` function is here:
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        # It returns different numbers of parameters depending on what arguments
        # arge given and what flags are set. For our useage here, it returns
        # the loss (because we provided labels) and the "logits"--the model
        # outputs prior to activation.
        loss, logits = model(x,
                            token_type_ids=None,
                            labels=labels)
        # Report GPU memory use for the first couple steps.
        df = check_gpu_mem()
        print('     After forward-pass: {:}'.format(df.iloc[0, 1]))
        # Perform a backward pass to calculate the gradients.
        loss.backward()
        # Report GPU memory use for the first couple steps.
        df = check_gpu_mem()
        print('    After gradient calculation: {:}'.format(df.iloc[0, 1]))
    # If we made it here, we didn't run out of memory. Success!
    success = True

    """### 2.4. Record Results
    Record all of the parameters and results from this experiment.
    After putting these into a table row with pandas below, I've been copying them over to a `.csv` file.
    """
    # Record the peak memory use and the total available.
    # Record the memory usage.
    if success:
        mem_use = int(df.iloc[0, 1][:-4])
    # If it ran out of memory, use the value -1.
    else:
        mem_use = -1
    # Total memory this GPU.
    mem_total = int(df.iloc[0, 0][:-4])
    new_row = {'timestamp': timestamp, 'batch_size': batch_size, 'max_len': max_len, 'gpu': gpu_name, 'mem_use': mem_use, 'mem_total': mem_total }
    # If this is the first experiment performed within this runtime, then the table
    # doesn't exist yet.
    if first_time:
        # Create the table.
        df_res = pd.DataFrame(data=[new_row])
        first_time = False
    else:
        # Append to the table.
        df_res = df_res.append(new_row, ignore_index=True)
    # Display the accumulated results.
    df_res
    # Dump the table to CSV and I'll manually copy it over to my aggregated results
    # file.
    print("TABLE OF RESULT:\n")
    print(df_res.to_csv())
    batch_size = batch_size + 20
    """## End"""

Job script♯

The job script is a file named bert.sh, in the same directory as gpu_memory_limit_parameter_search.py:

#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --time=01:00:00
#SBATCH --mem-per-gpu=48GB

module load python/anaconda3
. ~/.bashrc
conda activate pytorch
python3 gpu_memory_limit_parameter_search.py

Job submission♯

Submit the job script:

[juser@picotte001 ~]$ sbatch bert.sh

Output♯

800px|Image: 800 pixels

With the GPU: Tesla V100-SXM2-32GB, the bigger the value of training batch size, the bigger the amount of memory used.

When the batch size reaches 100, it produces error:

RuntimeError: CUDA out of memory. Tried to allocate 282.00 MiB (GPU 0; 31.75 GiB total capacity; 30.15 GiB already allocated; 90.00 MiB free; 30.37 GiB reserved in total by PyTorch)

References♯

[1] GPU Benchmarks for Fine-Tuning BERT