Job Script Example 03 Apache Spark

This is an example of using Apache Spark.

The example script and data file are in:

/mnt/HA/opt/Examples/Spark/Wordcount

Basic Example - Wordcount

This simple example does some basic statistics on a text file. It uses the Python API.[1]

The example code and data file are located in:

/mnt/HA/examples/Spark/Wordcount

The data is the complete works of Shakespeare as encoded by the Gutenberg Project.

The Python Spark script:

#!/usr/bin/env python3
import sys, os
import numpy as np
import re
from pyspark import SparkContext

def removePunctuation(text):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        text (str): A string.

    Returns:
        str: The cleaned up string.
    """
    return re.sub('_', '', re.sub('[^a-zA-Z0-9\s]', '', text.lower()).strip())

print("hello, spark")

sc = SparkContext("local", "This SparkApp")

shaketext = sc.textFile("shakespeare.txt", 8).cache()
shakewordsRDD = (shaketext
                 .map(removePunctuation)
                 .flatMap(lambda l: l.split())
                 .filter(lambda w: len(w) > 0)
                 )

print("")
print("No. of words = {}".format(shakewordsRDD.count()))
print("")

print("bye, spark")

The job script:

#!/bin/bash
#$ -S /bin/bash
#$ -P myPrj
#$ -M myname@drexel.edu
#$ -m a
#$ -j y
#$ -cwd
#$ -R y
#$ -jc spark2.intel
#$ -l exclusive
#$ -l vendor=intel
#$ -l ua=sandybridge
#$ -pe spark2.intel 64
#$ -l h_rt=0:30:00
#$ -l h_vmem=4G
#$ -l m_mem_free=2g

. /etc/profile.d/modules.sh
module load shared
module load proteus
module load gcc
module load sge/univa
module load git
module load apache/spark/2.2.0

export SPARK_CONF_DIR=${SGE_O_WORKDIR}/conf.${JOB_ID}
. ${SPARK_CONF_DIR}/spark-env.sh

echo "Starting master on ${SPARK_MASTER_HOST} ..."
start-master.sh
echo "Done starting master."

echo "Starting slaves..."
start-slaves.sh
echo "Done starting slaves."

echo "Submitting job..."
spark-submit --master ${SPARK_MASTER_URL} wordcount.py
echo "Done job."

echo "Stopping slaves..."
stop-slaves.sh
echo "Done stopping slaves."

echo "Stopping master..."
stop-master.sh
echo "Done stopping master."

### Optionally clean up all files.
echo "Cleanup..."
# wait for all processes to really finish up
sleep 10
/bin/rm -rf ${SPARK_CONF_DIR}
/bin/rm -rf ${SPARK_LOG_DIR}
/bin/rm -rf ${SPARK_WORKER_DIR}
/bin/rm ${SPARK_SLAVES}
echo "...Done."

References

[1] pyspark package documentation