Job Script Example 03 Apache Spark
This is an example of using Apache Spark.
The example script and data file are in:
/mnt/HA/opt/Examples/Spark/Wordcount
Basic Example - Wordcount
This simple example does some basic statistics on a text file. It uses the Python API.[1]
The example code and data file are located in:
/mnt/HA/examples/Spark/Wordcount
The data is the complete works of Shakespeare as encoded by the Gutenberg Project.
The Python Spark script:
#!/usr/bin/env python3
import sys, os
import numpy as np
import re
from pyspark import SparkContext
def removePunctuation(text):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
text (str): A string.
Returns:
str: The cleaned up string.
"""
return re.sub('_', '', re.sub('[^a-zA-Z0-9\s]', '', text.lower()).strip())
print("hello, spark")
sc = SparkContext("local", "This SparkApp")
shaketext = sc.textFile("shakespeare.txt", 8).cache()
shakewordsRDD = (shaketext
.map(removePunctuation)
.flatMap(lambda l: l.split())
.filter(lambda w: len(w) > 0)
)
print("")
print("No. of words = {}".format(shakewordsRDD.count()))
print("")
print("bye, spark")
The job script:
#!/bin/bash
#$ -S /bin/bash
#$ -P myPrj
#$ -M myname@drexel.edu
#$ -m a
#$ -j y
#$ -cwd
#$ -R y
#$ -jc spark2.intel
#$ -l exclusive
#$ -l vendor=intel
#$ -l ua=sandybridge
#$ -pe spark2.intel 64
#$ -l h_rt=0:30:00
#$ -l h_vmem=4G
#$ -l m_mem_free=2g
. /etc/profile.d/modules.sh
module load shared
module load proteus
module load gcc
module load sge/univa
module load git
module load apache/spark/2.2.0
export SPARK_CONF_DIR=${SGE_O_WORKDIR}/conf.${JOB_ID}
. ${SPARK_CONF_DIR}/spark-env.sh
echo "Starting master on ${SPARK_MASTER_HOST} ..."
start-master.sh
echo "Done starting master."
echo "Starting slaves..."
start-slaves.sh
echo "Done starting slaves."
echo "Submitting job..."
spark-submit --master ${SPARK_MASTER_URL} wordcount.py
echo "Done job."
echo "Stopping slaves..."
stop-slaves.sh
echo "Done stopping slaves."
echo "Stopping master..."
stop-master.sh
echo "Done stopping master."
### Optionally clean up all files.
echo "Cleanup..."
# wait for all processes to really finish up
sleep 10
/bin/rm -rf ${SPARK_CONF_DIR}
/bin/rm -rf ${SPARK_LOG_DIR}
/bin/rm -rf ${SPARK_WORKER_DIR}
/bin/rm ${SPARK_SLAVES}
echo "...Done."