A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
3. Outline
› Part I: Setup Environment
› Ubuntu / Mac / Windows
› Part II: Introduction to Spark
› History / Features / Examples
› Part III: Hands-On Training
› Core / SQL / MLlib
5. Setting Up Docker on Ubuntu
› $ apt-get update
› $ apt-get -y install docker.io
› $ ln -sf /usr/bin/docker.io /usr/local/bin/docker
› $ sed -i '$acomplete -F _docker docker' /etc/bash_completion.d/docker.io
› $ update-rc.d docker.io defaults
› $ docker pull jupyter/pyspark-notebook:latest
› $ docker run -d --name workshop -p 8888:8888 jupyter/pyspark-notebook:latest
› # open browser and visit the address `localhost:8888`
› # click `New` and then `Python 2`
› # rename notebook, from `Untitled` to `Workshop`
6. Setting Up Docker on Windows Mac
› download `Docker Toolbox`
› install `Docker Toolbox`, with default settings
› open `Docker Quickstart Terminal`
› click `Yes` on the `User Account Control` window, if it appears
› write down the `IP` address (e.g. 192.168.99.100), and then type:
› $ docker pull jupyter/pyspark-notebook
› $ docker run -d --name workshop -p 8888:8888 jupyter/pyspark-notebook
› open browser and visit the aforementioned `IP` address, e.g. `192.168.99.100:8888`
› click `New` and then `Python 2`
› rename notebook, from `Untitled` to `Workshop`
8. Useful Links
› install Docker on other platforms:
› Ubuntu: https://www.youtube.com/watch?v=V9AKvZZCWLc
› Mac: https://www.youtube.com/watch?v=lNkVxDSRo7M
› Windows: https://www.youtube.com/watch?v=S7NVloq0EBc
› download the course material:
› datasets: http://bit.ly/23hdtq9
› notebooks: http://bit.ly/23hdsCO
› presenations: http://bit.ly/1TFttfE
› complete the course survey:
› http://bit.ly/1MC4xUF
› read the Apache Spark documentation:
› http://bit.ly/1UQBgrP
9. Simple Examples
plist = sc.parallelize(range(10000)) # from python list
path = "/home/jovyan/work/datasets/" # set datasets' path
tfile = sc.textFile(path+"hamlet.txt") # from text file
print(tfile.count()) # count lines
print(plist.count()) # count elements
plist.takeSample(False, 5) # sample and collect elements
fv = plist.filter(lambda x: x < 10) # filter elements
print(fv.count()) # count filtered elements
print(fv.collect()) # collect filtered elements
fv.reduce(lambda l,r: l + r) # merge filtered elements with an associative function
fv.saveAsTextFile(path+"filtered-elements.txt") # write filtered elements to local file system
11. Spark in a Nutshell
› general cluster computing platform:
› distributed in-memory computational framework
› SQL, Machine Learning, Stream Processing, etc.
› easy to use, powerful, high-level API:
› Scala, Java, Python and R
12. Limitations of MapReduce
› MapReduce use cases showed two major limitations:
› difficulty of programming directly in MapReduce
› performance bottlenecks, or batch not fitting the use cases
› in short, MR doesn’t compose well for large applications
› therefore, people built specialized systems as workarounds
14. Advantages of Spark
› handles batch, interactive, and real-time within a single framework
› native integration with Java, Python, Scala, R
› programming at a higher level of abstraction
› more general: map/reduce is just one set of supported constructs
15. Advantages (cont.): Generalized MapReduce
› unlike the various specialized systems, Spark’s goal was to generalize MapReduce to
support new apps within same engine
› two reasonably small additions are enough to express the previous models:
› fast data sharing
› general DAGs
› this allows for an approach which is more efficient for the engine, and much simpler
for the end users
18. High Performance
› in-memory cluster computing
› ideal for iterative algorithms
› faster than Hadoop:
› 10x on disk
› 100x in memory
19. Brief History
› originally developed in 2009, UC Berkeley AMP Lab
› open-sourced in 2010
› as of 2014, Spark is a top-level Apache project
› fastest open-source engine for sorting 100 TB:
› won the 2014 Daytona GraySort contest
› throughput: 4.27 TB/min
20. End Users
› Data Scientists:
› analyze and model data
› data transformations and prototyping
› statistics and machine learning
› Data Engineers:
› implement production data processing systems
› require a reasonable API for distributed processing
› reliable, high performance, easy to monitor platform
21. partitions
Resilient Distributed Dataset
› RDD is an immutable and partitioned collection. RDD comes from the acronym:
› resilient: it can be recreated, when data in memory is lost
› distributed: stored in memory across the cluster
› dataset: data that comes from file or created programmatically
RDD
22. Resilient Distributed Dataset (cont.)
› RDD feels like coding using typical Scala collections; RDD can be build:
› directly from a datasource (e.g., text file, HDFS, etc.),
› or by applying a transformation to another RDDs
› main features:
› RDDs are computed lazily
› automatically rebuild on failure
› persistence for reuse (RAM and/or disk)
23. MappedRDD
func = _.split(...)
FilteredRDD
func = _.contains(...)
HadoopRDD
path = hdfs://...
messages = textFile(“file.log”).filter(_.contains(“error”)).map(_.split(‘t’)(2))
RDD Fault Tolerance
› RDDs are the primary abstraction in Spark; a fault-tolerant collection of elements
that can be operated on in parallel
› RDDs track the series of transformations used to build them; their lineage to
recompute lost data
24. Loading and Saving RDDs
› File Systems: Local FS, Amazon S3 and HDFS
› Supported formats: Text files, JSON, Hadoop sequence files, parquet files, protocol
buffers and object files
› Structured data with Spark SQL: Hive, JSON, JDBC, Cassandra, HBase and
ElasticSearch
27. The Spark Context
› first thing that a Spark program does is create a SparkContext object, which tells
Spark how to access a cluster
› in the shell for either Scala or Python, this is the sc variable, which is created
automatically
› other programs must use a constructor to instantiate a new SparkContext
› then in turn SparkContext gets used to create other variables
28. master description
local
run Spark locally with one worker thread (i.
e. no parallelism at all)
local[*]
run Spark locally with as many worker
threads as logical cores on your machine
spark://HOST:PORT
connect to the given Spark standalone
cluster master (port 7077 by default)
mesos://HOST:PORT
connect to the given Mesos cluster
(port 5050 by default)
yarn
connect to a YARN cluster in client or
cluster mode (YARN_CONF_DIR variable)
The Spark Master
› the master parameter for a SparkContext determines which cluster to use:
29. SparkContext
cacheExecutor
tasktask
Worker Node
cacheExecutor
tasktask
Worker Node
Driver Program Cluster Management
The Spark Master (cont.)
› connects to a cluster manager which allocate resources across applications
› acquires executors on cluster nodes – worker processes to run computations
and store data
› sends app code to the executors
› sends tasks for the executors to run
30. Word Count
› What is the goal? Count often each each word word appears appears count how
how often In of text text documents.
› Why is this so popular? Simple program provides a good test case for parallel
processing, since it:
› requires a minimal amount of code
› demonstrates use of both symbolic and numeric values
› isn’t many steps away from search indexing
› serves as a “Hello World” for big data applications
› Why should I care? A distributed computing framework that can run Word Count
efficiently in parallel at scale can likely handle much larger and more interesting
compute problems.
36. Spark Deconstructed (cont.) 3/3
# count requests based on status code
print('with status “200”: %d' % splittedRDD.filter(lambda x: u'" 200 ' in x).count())
action value
38. RDD Operations
› two types of operations on RDDs: transformations and actions
› transformations are lazy (not computed immediately)
› the transformed RDD gets recomputed when an action is run on it (default)
› however, an RDD can be persisted into storage in memory or disk
39. base RDD new RDD
value
RDD Operations (cont.)
› Transformations: define new RDDs based on current
one, e.g., filter, map, reduce, groupBy, etc.
base RDD
› Actions: return values, e.g., count, sum, collect, etc.
40. Transformations Vs. Actions: Basic Examples
# transformation 1: create RDD lazyly
nums = sc.parallelize((1, 2, 3, 4, 5))
# transformation 2: pass each element through a function
squares = nums.map(lambda x: x * x)
# transformation 3: keep elements passing a predicate
evens = squares.filter(lambda x: x % 2 == 0)
# transformation 4: map each element to zero or more others
flats = nums.flatMap(lambda x: range(1, x+1))
# action 1: collect 'nums'
print(nums.collect())
# action 2: collect 'evens'
print(evens.collect())
42. Transformations Vs. Actions: K,V Examples
# transformation 1: create RDD lazyly
petsAll = sc.parallelize((("cat", 1), ("dog", 1), ("cat", 2)))
# transformation 2: filter by key
petsCat = petsAll.filter(lambda (k,v): k == "cat")
# action 1: increase values by 1, then collect
petsAll.map(lambda (k,v): (k, v+1)).collect() # ver.1
petsAll.mapValues(lambda v: v+1).collect() # ver.2
# action 2: sum values by key, then collect
petsAll.reduceByKey(lambda l,r: l+r).collect()
# action 3: group by key, then collect
petsAll.groupByKey().map(lambda (k,v): (k, list(v))).collect()
# action 4: sort by key, then collect
print(petsAll.sortByKey().collect())
43. Transformations Vs. Actions: Join Examples
# transformation 1: RDD[(date, user, clicks)]
clk = sc.textFile(path+"clk.tsv").map(lambda x: x.split("t"))
# transformation 2: RDD[(date, user, id, lat, lon)]
reg = sc.textFile(path+"reg.tsv").map(lambda x: x.split("t"))
# transformation 3: RDD[(user, (date, clicks))]
clk_reordered = clk.map(lambda (date, user, clicks): (user, (date, clicks)))
# transformation 4: RDD[(user, (date, id, lat, lon))]
reg_reordered = reg.map(lambda (date, user, id, lat, lon): (user, (date, id, lat, lon)))
# transformation 5: RDD[(user, ((date, clicks), (date, id, lat, lon)))]
joined = clk_reordered.join(reg_reordered)
print(joined.count()) # action 1: print total number of successful joins
print(joined.first()) # action 2: print first element of newly-joined RDD
44. Units of Execution Model
› Job:
› work required to compute an RDD.
› Stage:
› each job is divided to stages.
› Task:
› unit of work within a stage.
› corresponds to one RDD partition.
Job
Stage 0
Task 0 Task 1
...
Stage 1
Task 0 Task 1
... ...
50. Persistence
› when we use the same RDD multiple times:
› Spark will recompute the RDD.
› expensive to iterative algorithms.
› Spark can persist RDDs, avoiding re-computations.
› each node stores in memory any slices of it that it computes and reuses them in
other actions on that dataset – often making future actions more than 10x faster.
› the cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created it.
51. Levels of Persistence
# how to persist an RDD
result = input.map(<ExpensiveComputation>)
result.persist(LEVEL)
LEVEL SPACE CPU IN-MEMORY ON-DISK
MEMORY_ONLY (default) HIGH LOW YES NO
MEMORY_ONLY_SER LOW HIGH YES NO
MEMORY_AND_DISK HIGH MEDIUM SOME SOME
MEMORY_AND_DISK_SER LOW HIGH SOME SOME
DISK_ONLY LOW HIGH NO YES
52. Persistence Behaviour
› each node will store its computed partition.
› in case of a failure, Spark recomputes the missing partitions.
› least recently used cache policy:
› memory-only: recompute partitions.
› memory-and-disk: recompute and write to disk.
› manually remove from cache: unpersist()
53. Shared Variables
› Accumulators: aggregate values from worker nodes back to the driver program.
› Broadcast Variables: distribute values to all worker nodes.
54. Broadcast Variables
› closures and the variables they use are send separately to each task. we may want
to share some variable (e.g., a map) across tasks/operations. this can efficiently done
with broadcast variables:
› broadcast variables let programmer keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks.
› for example, to give every node a copy of a large input dataset efficiently.
› Spark also attempts to distribute broadcast variables using efficient broadcast
algorithms to reduce communication cost.
55. Example Without Broadcast Variables
# dict(user: (date, id, lat, lon))
regDict = dict(reg_reordered.collect())
# CAUTION: regDict is sent along with every task!
joined = clk_reordered.
map(lambda (user, (date, clicks)): (user, ((date, clicks), regDict[user])))
# let's have a look on the output, transformed dataset
print(joined.first())
print(joined.count())
56. Example With Broadcast Variables
# dict(user: (date, id, lat, lon))
regDict = dict(reg_reordered.collect())
bcDict = sc.broadcast(regDict)
# bcDict is a read-only variable, cached on each machine
joined = clk_reordered.
map(lambda (user, (date, clicks)): (user, ((date, clicks), bcDict.value[user])))
# let's have a look on the output, transformed dataset
print(joined.first())
print(joined.count())
57. Accumulators
› accumulators are variables that can only be “added” to through an associative
operation.
› used to implement counters and sums, efficiently in parallel.
› Spark natively supports accumulators of numeric value types and standard
mutable collections, and programmers can extend for new types.
› only the driver program can read an accumulator’s value, not the tasks.
59. Accumulators and Fault Tolerance
› Safe: Updates inside actions will only applied once.
› Unsafe: Updates inside transformation may applied more than once!!!
66. Create DataFrame
# create DataFrame from JSON file
df = sqlContext.read.json(path+"people.json")
# display the schema of the DataFrame
df.schema
# display the schema in a tree format
df.printSchema()
# display the content of the DataFrame
df.show()
67. DataFrame Operations
# # select only the "name" column
df.select("name").show()
# select everybody but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# select people older than 21
df.filter(df['age'] > 21).show()
# count people by age
df.groupBy("age").count().show()
68. Infer the Schema with Reflection
# infer the schema and register the DataFrame as a table
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")
69. Run SQL Queries Programmatically
# run SQL over DataFrames that have been registered as a table
teenagers = sqlContext.
sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
70. DataFrames Interoperating with RDDs
# the results of SQL queries are RDDs and support all the normal RDD operations
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
print(teenName)
71. Parquet Support via DataFrame Interface
# display the schema of the DataFrame
schemaPeople.schema
# display the schema in a tree format
schemaPeople.printSchema()
# display the content of the DataFrame
schemaPeople.show()
# DataFrames can be saved as Parquet files maintaining the schema information
schemaPeople.write.parquet(path+"people.parquet")
# Parquet files are self-describing so the schema is preserved; the result is also a DataFrame
parquetFile = sqlContext.read.parquet(path+"people.parquet")
72. Regression
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
def prepareDump(row):
return LabeledPoint(row[0],Vectors.dense((row[1],row[2],...,row[10],row[11])))
# dummy split into train and test set
trainSet = dump.filter(lambda x: x.features[9] <= 4000)
testSet = dump.filter(lambda x: x.features[9] > 4000)
# build regression model: without such a small step size, the algorithm would diverge
model = LinearRegressionWithSGD.train(data=trainSet, iterations=100, step=0.000000001)
73. Regression (cont.)
# evaluate regression model
valuesANDpredictions = testSet.
map(lambda p: (p.label, model.predict(p.features)))
# print simple statistics about the model
mse = (valuesANDpredictions.
map(lambda (v , p): (v - p) * (v - p)).
sum()) / float(valuesANDpredictions.count())
print("mean squared error is: %.3f" % mse)
print("root mean squared error is: %.3f" % sqrt(mse))
74. Classification
# import required libraries
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.tree import DecisionTree
# prepare dump
dump = (dump.
map(lambda line: prepareDump(line)).
map(lambda line: LabeledPoint(line[0],Vectors.dense(line[1]))))
75. Classification (cont.)
# build classification model
categoricalFeaturesInfo = {}
model = DecisionTree.trainClassifier(
dump, # dump file
2, # number of classes
categoricalFeaturesInfo, # all features are continuous
"gini", # impurity
5, # max depth
32) # max bins
# evaluate model
actual = dump.map(lambda x: x.label)
predicted = model.predict(dump.map(lambda x: x.features))
actualANDpredicted = actual.zip(predicted)
76. Clustering
# import required libraries
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.clustering import KMeans
# convert original data points into dence format
dump = dump.map(lambda line: Vectors.dense(line))
clusters = 2
iterations = 20
model = KMeans.train(dump, clusters, maxIterations=iterations)
# get the centers of the 2 clusters
_2_centers = [tuple(c) for c in model.clusterCenters]
77. Recommendations
# import required libraries
from pyspark.mllib.recommendation import Rating, ALS
# dummy split into three sets, namely train, validation and test
train = (ratings.map(lambda x: parseRatings1(x)).
filter(lambda x: (((x[3] % 10) < 6))).
map(lambda x: parseRatings2(x)))
validation = (ratings.map(lambda x: parseRatings1(x)).
filter(lambda x: (((x[3] % 10) >= 6) and ((x[3] % 10) < 8))).
map(lambda x: parseRatings2(x)))
test = (ratings.map(lambda x: parseRatings1(x)).
filter(lambda x: (((x[3] % 10) >= 8))).
map(lambda x: parseRatings2(x)))
78. Recommendations (cont.)
# build model
rank = 10; iterations = 20
model = ALS.train(train,rank,iterations=iterations)
# make predictions
predictions = model.
predictAll(validation.map(lambda (user,product,rating): (user,product)))
# join validation set with predictions
ratingsANDpredictions = ((validation.
map(lambda (user,product,rating): ((user,product),rating))).
join(predictions.map(lambda (user,product,rating): ((user,product),rating))))
# evaluate the performance of the predictor
mse = (ratingsANDpredictions.
map(lambda ((user,product),(rating,prediction)):
(rating - prediction) * (rating - prediction)).sum()) /
float(ratingsANDpredictions.count())
80. › user@spark.apache.org
› usage questions, help, announcements.
› dev@spark.apache.org
› for people who want to contribute code!
Get Help and Contribute
81. › Introduction to Spark (edX), Apr. 14, 2016
› Big Data Analysis with Spark (edX), May 19, 2016
› Distributed Machine Learning with Spark (edX), Jun. 2016
› Adv. Distributed Machine Learning with Spark (edX), Aug. 2016
› Adv. Spark for Data Science & Data Engineering (edX), Oct. 2016
› Data Science & Engineering with Spark (edX), TBA
Courses and Certifications