SlideShare uma empresa Scribd logo
1 de 83
Apache Spark
Workshop
Samos, 02/04/2016
HELLO!
I am Euangelos Linardos
Data Scientist at Pollfish
Outline
› Part I: Setup Environment
› Ubuntu / Mac / Windows
› Part II: Introduction to Spark
› History / Features / Examples
› Part III: Hands-On Training
› Core / SQL / MLlib
Part I:
Setup Environment
(...in seven easy steps!)
Setting Up Docker on Ubuntu
› $ apt-get update
› $ apt-get -y install docker.io
› $ ln -sf /usr/bin/docker.io /usr/local/bin/docker
› $ sed -i '$acomplete -F _docker docker' /etc/bash_completion.d/docker.io
› $ update-rc.d docker.io defaults
› $ docker pull jupyter/pyspark-notebook:latest
› $ docker run -d --name workshop -p 8888:8888 jupyter/pyspark-notebook:latest
› # open browser and visit the address `localhost:8888`
› # click `New` and then `Python 2`
› # rename notebook, from `Untitled` to `Workshop`
Setting Up Docker on Windows  Mac
› download `Docker Toolbox`
› install `Docker Toolbox`, with default settings
› open `Docker Quickstart Terminal`
› click `Yes` on the `User Account Control` window, if it appears
› write down the `IP` address (e.g. 192.168.99.100), and then type:
› $ docker pull jupyter/pyspark-notebook
› $ docker run -d --name workshop -p 8888:8888 jupyter/pyspark-notebook
› open browser and visit the aforementioned `IP` address, e.g. `192.168.99.100:8888`
› click `New` and then `Python 2`
› rename notebook, from `Untitled` to `Workshop`
Validate Setup
# import required libraries
from pyspark import SparkConf, SparkContext
# create spark context
sc = SparkContext(conf=(SparkConf().setMaster("local[*]")))
# print spark context
print(sc)
# print spark configuration
print(sc._conf.getAll())
Useful Links
› install Docker on other platforms:
› Ubuntu: https://www.youtube.com/watch?v=V9AKvZZCWLc
› Mac: https://www.youtube.com/watch?v=lNkVxDSRo7M
› Windows: https://www.youtube.com/watch?v=S7NVloq0EBc
› download the course material:
› datasets: http://bit.ly/23hdtq9
› notebooks: http://bit.ly/23hdsCO
› presenations: http://bit.ly/1TFttfE
› complete the course survey:
› http://bit.ly/1MC4xUF
› read the Apache Spark documentation:
› http://bit.ly/1UQBgrP
Simple Examples
plist = sc.parallelize(range(10000)) # from python list
path = "/home/jovyan/work/datasets/" # set datasets' path
tfile = sc.textFile(path+"hamlet.txt") # from text file
print(tfile.count()) # count lines
print(plist.count()) # count elements
plist.takeSample(False, 5) # sample and collect elements
fv = plist.filter(lambda x: x < 10) # filter elements
print(fv.count()) # count filtered elements
print(fv.collect()) # collect filtered elements
fv.reduce(lambda l,r: l + r) # merge filtered elements with an associative function
fv.saveAsTextFile(path+"filtered-elements.txt") # write filtered elements to local file system
Part II:
Introduction to Spark
(section 1: get to know spark)
Spark in a Nutshell
› general cluster computing platform:
› distributed in-memory computational framework
› SQL, Machine Learning, Stream Processing, etc.
› easy to use, powerful, high-level API:
› Scala, Java, Python and R
Limitations of MapReduce
› MapReduce use cases showed two major limitations:
› difficulty of programming directly in MapReduce
› performance bottlenecks, or batch not fitting the use cases
› in short, MR doesn’t compose well for large applications
› therefore, people built specialized systems as workarounds
MapReduce
Giraph
Tez
Pregel
S4 Pig
GraphLabImpala
Dremel Drill
Storm
General Batch Processing Specialized Systems
(iterative, interactive, streaming, graph, etc.)
Limitations (cont.): Specialized Systems
Advantages of Spark
› handles batch, interactive, and real-time within a single framework
› native integration with Java, Python, Scala, R
› programming at a higher level of abstraction
› more general: map/reduce is just one set of supported constructs
Advantages (cont.): Generalized MapReduce
› unlike the various specialized systems, Spark’s goal was to generalize MapReduce to
support new apps within same engine
› two reasonably small additions are enough to express the previous models:
› fast data sharing
› general DAGs
› this allows for an approach which is more efficient for the engine, and much simpler
for the end users
Code Size
same functionality yet
in the form of libraries
Standalone YARN Mesos
Spark Core
Spark SQL Spark GraphXSpark MLlibSpark Streaming
Unified Stack
High Performance
› in-memory cluster computing
› ideal for iterative algorithms
› faster than Hadoop:
› 10x on disk
› 100x in memory
Brief History
› originally developed in 2009, UC Berkeley AMP Lab
› open-sourced in 2010
› as of 2014, Spark is a top-level Apache project
› fastest open-source engine for sorting 100 TB:
› won the 2014 Daytona GraySort contest
› throughput: 4.27 TB/min
End Users
› Data Scientists:
› analyze and model data
› data transformations and prototyping
› statistics and machine learning
› Data Engineers:
› implement production data processing systems
› require a reasonable API for distributed processing
› reliable, high performance, easy to monitor platform
partitions
Resilient Distributed Dataset
› RDD is an immutable and partitioned collection. RDD comes from the acronym:
› resilient: it can be recreated, when data in memory is lost
› distributed: stored in memory across the cluster
› dataset: data that comes from file or created programmatically
RDD
Resilient Distributed Dataset (cont.)
› RDD feels like coding using typical Scala collections; RDD can be build:
› directly from a datasource (e.g., text file, HDFS, etc.),
› or by applying a transformation to another RDDs
› main features:
› RDDs are computed lazily
› automatically rebuild on failure
› persistence for reuse (RAM and/or disk)
MappedRDD
func = _.split(...)
FilteredRDD
func = _.contains(...)
HadoopRDD
path = hdfs://...
messages = textFile(“file.log”).filter(_.contains(“error”)).map(_.split(‘t’)(2))
RDD Fault Tolerance
› RDDs are the primary abstraction in Spark; a fault-tolerant collection of elements
that can be operated on in parallel
› RDDs track the series of transformations used to build them; their lineage to
recompute lost data
Loading and Saving RDDs
› File Systems: Local FS, Amazon S3 and HDFS
› Supported formats: Text files, JSON, Hadoop sequence files, parquet files, protocol
buffers and object files
› Structured data with Spark SQL: Hive, JSON, JDBC, Cassandra, HBase and
ElasticSearch
Part II:
Introduction to Spark
(section 2: spark under the hood)
Spark Shell
The Spark Context
› first thing that a Spark program does is create a SparkContext object, which tells
Spark how to access a cluster
› in the shell for either Scala or Python, this is the sc variable, which is created
automatically
› other programs must use a constructor to instantiate a new SparkContext
› then in turn SparkContext gets used to create other variables
master description
local
run Spark locally with one worker thread (i.
e. no parallelism at all)
local[*]
run Spark locally with as many worker
threads as logical cores on your machine
spark://HOST:PORT
connect to the given Spark standalone
cluster master (port 7077 by default)
mesos://HOST:PORT
connect to the given Mesos cluster
(port 5050 by default)
yarn
connect to a YARN cluster in client or
cluster mode (YARN_CONF_DIR variable)
The Spark Master
› the master parameter for a SparkContext determines which cluster to use:
SparkContext
cacheExecutor
tasktask
Worker Node
cacheExecutor
tasktask
Worker Node
Driver Program Cluster Management
The Spark Master (cont.)
› connects to a cluster manager which allocate resources across applications
› acquires executors on cluster nodes – worker processes to run computations
and store data
› sends app code to the executors
› sends tasks for the executors to run
Word Count
› What is the goal? Count often each each word word appears appears count how
how often In of text text documents.
› Why is this so popular? Simple program provides a good test case for parallel
processing, since it:
› requires a minimal amount of code
› demonstrates use of both symbolic and numeric values
› isn’t many steps away from search indexing
› serves as a “Hello World” for big data applications
› Why should I care? A distributed computing framework that can run Word Count
efficiently in parallel at scale can likely handle much larger and more interesting
compute problems.
Word Count (cont.)
# calculate word frequencies
counts = (tfile.
flatMap(lambda x: x.split(' ')).
filter(lambda x: len(x) > 0).
map(lambda x: (x, 1)).
reduceByKey(lambda l,r: l + r).
sortBy(lambda x: x[1], ascending=False))
# print (word,count) sample
print(counts.take(5))
Mining Logs
# base RDD
logRDD = sc.textFile(path+"logs.txt")
# transformed RDDs
filteredRDD = logRDD.filter(lambda x: u' "GET ' in x)
splittedRDD = filteredRDD.map(lambda x: x.split(u' "GET ')).map(lambda x: x[1])
# count requests based on status code
print('with status “200”: %d' % splittedRDD.filter(lambda x: u'" 200 ' in x).count())
print('without status “200”: %d' % splittedRDD.filter(lambda x: u'" 200 ' not in x).count())
transform
ation(s)
action valueRDD
Spark Deconstructed
› Looking at the RDD transformations and actions from another perspective:
Spark Deconstructed (cont.) 1/3
# base RDD
logRDD = sc.textFile(path+"logs.txt")
RDD
Spark Deconstructed (cont.) 2/3
# transformed RDDs
filteredRDD = logRDD.filter(lambda x: u' "GET ' in x)
splittedRDD = filteredRDD.map(lambda x: x.split(u' "GET ')).map(lambda x: x[1])
transform
ation(s)
RDD
Spark Deconstructed (cont.) 3/3
# count requests based on status code
print('with status “200”: %d' % splittedRDD.filter(lambda x: u'" 200 ' in x).count())
action value
Rich, High-level API
Transformations
map
filter
flatMap
sample
union
distinct
groupByKey
reduceByKey
sortByKey
join
...
Actions
reduce
collect
count
first
take
takeSample
saveAsTextFile
countByKey
foreach
saveAsSequenceFile
...
RDD Operations
› two types of operations on RDDs: transformations and actions
› transformations are lazy (not computed immediately)
› the transformed RDD gets recomputed when an action is run on it (default)
› however, an RDD can be persisted into storage in memory or disk
base RDD new RDD
value
RDD Operations (cont.)
› Transformations: define new RDDs based on current
one, e.g., filter, map, reduce, groupBy, etc.
base RDD
› Actions: return values, e.g., count, sum, collect, etc.
Transformations Vs. Actions: Basic Examples
# transformation 1: create RDD lazyly
nums = sc.parallelize((1, 2, 3, 4, 5))
# transformation 2: pass each element through a function
squares = nums.map(lambda x: x * x)
# transformation 3: keep elements passing a predicate
evens = squares.filter(lambda x: x % 2 == 0)
# transformation 4: map each element to zero or more others
flats = nums.flatMap(lambda x: range(1, x+1))
# action 1: collect 'nums'
print(nums.collect())
# action 2: collect 'evens'
print(evens.collect())
Transformations: Examples Illustrated
nums
flats
evens
squares
ParallelCollectionRDD
FlatMappedRDD MappedRDD
FilteredRDD
value 2
value 1
nums.flatMap(...) nums.map(...)
squares.filter(...)
collect()
collect()
Transformations Vs. Actions: K,V Examples
# transformation 1: create RDD lazyly
petsAll = sc.parallelize((("cat", 1), ("dog", 1), ("cat", 2)))
# transformation 2: filter by key
petsCat = petsAll.filter(lambda (k,v): k == "cat")
# action 1: increase values by 1, then collect
petsAll.map(lambda (k,v): (k, v+1)).collect() # ver.1
petsAll.mapValues(lambda v: v+1).collect() # ver.2
# action 2: sum values by key, then collect
petsAll.reduceByKey(lambda l,r: l+r).collect()
# action 3: group by key, then collect
petsAll.groupByKey().map(lambda (k,v): (k, list(v))).collect()
# action 4: sort by key, then collect
print(petsAll.sortByKey().collect())
Transformations Vs. Actions: Join Examples
# transformation 1: RDD[(date, user, clicks)]
clk = sc.textFile(path+"clk.tsv").map(lambda x: x.split("t"))
# transformation 2: RDD[(date, user, id, lat, lon)]
reg = sc.textFile(path+"reg.tsv").map(lambda x: x.split("t"))
# transformation 3: RDD[(user, (date, clicks))]
clk_reordered = clk.map(lambda (date, user, clicks): (user, (date, clicks)))
# transformation 4: RDD[(user, (date, id, lat, lon))]
reg_reordered = reg.map(lambda (date, user, id, lat, lon): (user, (date, id, lat, lon)))
# transformation 5: RDD[(user, ((date, clicks), (date, id, lat, lon)))]
joined = clk_reordered.join(reg_reordered)
print(joined.count()) # action 1: print total number of successful joins
print(joined.first()) # action 2: print first element of newly-joined RDD
Units of Execution Model
› Job:
› work required to compute an RDD.
› Stage:
› each job is divided to stages.
› Task:
› unit of work within a stage.
› corresponds to one RDD partition.
Job
Stage 0
Task 0 Task 1
...
Stage 1
Task 0 Task 1
... ...
Execution Model
SparkContext
cacheExecutor
tasktask
Worker Node
cacheExecutor
tasktask
Worker Node
Driver Program
Lineage Graph
# calculate word frequencies
counts = (sc.textFile(path+"hamlet.txt"). # MappedRDD[1], HadoopRDD[0]
flatMap(lambda x: x.split(' ')). # FlatMappedRDD[2]
map(lambda x: (x, 1)). # MappedRDD[3]
reduceByKey(lambda l,r: l + r)) # ShuffledRDD[4]
# print lineage graph representation
print(counts.toDebugString())
[0] [1] [2] [3] [4]
HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
Lineage Graph (cont.)
# calculate word frequencies
counts = (sc.textFile(path+"hamlet.txt"). # MappedRDD[1], HadoopRDD[0]
flatMap(lambda x: x.split(' ')). # FlatMappedRDD[2]
map(lambda x: (x, 1)). # MappedRDD[3]
reduceByKey(lambda l,r: l + r)) # ShuffledRDD[4]
# print lineage graph representation
print(counts.toDebugString())
[0] [1] [2] [3] [4]
HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
[0] [1] [2] [3] [4]
Execution Plan
# calculate word frequencies
counts = (sc.textFile(path+"hamlet.txt"). # MappedRDD[1], HadoopRDD[0]
flatMap(lambda x: x.split(' ')). # FlatMappedRDD[2]
map(lambda x: (x, 1)). # MappedRDD[3]
reduceByKey(lambda l,r: l + r)) # ShuffledRDD[4]
# print lineage graph representation
print(counts.toDebugString())
[0] [1] [2] [3] [4]
HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
[0] [1] [2] [3] [4]
Stage 1 Stage 2
Part II:
Introduction to Spark
(section 3: advanced features)
Persistence
› when we use the same RDD multiple times:
› Spark will recompute the RDD.
› expensive to iterative algorithms.
› Spark can persist RDDs, avoiding re-computations.
› each node stores in memory any slices of it that it computes and reuses them in
other actions on that dataset – often making future actions more than 10x faster.
› the cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created it.
Levels of Persistence
# how to persist an RDD
result = input.map(<ExpensiveComputation>)
result.persist(LEVEL)
LEVEL SPACE CPU IN-MEMORY ON-DISK
MEMORY_ONLY (default) HIGH LOW YES NO
MEMORY_ONLY_SER LOW HIGH YES NO
MEMORY_AND_DISK HIGH MEDIUM SOME SOME
MEMORY_AND_DISK_SER LOW HIGH SOME SOME
DISK_ONLY LOW HIGH NO YES
Persistence Behaviour
› each node will store its computed partition.
› in case of a failure, Spark recomputes the missing partitions.
› least recently used cache policy:
› memory-only: recompute partitions.
› memory-and-disk: recompute and write to disk.
› manually remove from cache: unpersist()
Shared Variables
› Accumulators: aggregate values from worker nodes back to the driver program.
› Broadcast Variables: distribute values to all worker nodes.
Broadcast Variables
› closures and the variables they use are send separately to each task. we may want
to share some variable (e.g., a map) across tasks/operations. this can efficiently done
with broadcast variables:
› broadcast variables let programmer keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks.
› for example, to give every node a copy of a large input dataset efficiently.
› Spark also attempts to distribute broadcast variables using efficient broadcast
algorithms to reduce communication cost.
Example Without Broadcast Variables
# dict(user: (date, id, lat, lon))
regDict = dict(reg_reordered.collect())
# CAUTION: regDict is sent along with every task!
joined = clk_reordered.
map(lambda (user, (date, clicks)): (user, ((date, clicks), regDict[user])))
# let's have a look on the output, transformed dataset
print(joined.first())
print(joined.count())
Example With Broadcast Variables
# dict(user: (date, id, lat, lon))
regDict = dict(reg_reordered.collect())
bcDict = sc.broadcast(regDict)
# bcDict is a read-only variable, cached on each machine
joined = clk_reordered.
map(lambda (user, (date, clicks)): (user, ((date, clicks), bcDict.value[user])))
# let's have a look on the output, transformed dataset
print(joined.first())
print(joined.count())
Accumulators
› accumulators are variables that can only be “added” to through an associative
operation.
› used to implement counters and sums, efficiently in parallel.
› Spark natively supports accumulators of numeric value types and standard
mutable collections, and programmers can extend for new types.
› only the driver program can read an accumulator’s value, not the tasks.
Example with Accumulators
# initialize accumulators
acc_sum = sc.accumulator(0)
acc_cnt = sc.accumulator(0)
# define auxiliary functions
def acc(size):
acc_sum.add(size)
acc_cnt.add(1)
# increase accumulators: values are stored on driver
(splittedRDD.
filter(lambda x: len(x) > 0).
flatMap(lambda x: x.split(" ")).
map(lambda x: len(x)).
foreach(lambda x: acc(x)))
Accumulators and Fault Tolerance
› Safe: Updates inside actions will only applied once.
› Unsafe: Updates inside transformation may applied more than once!!!
Part III:
Hands-On Training
(present Core, SQL and MLlib APIs)
Basic Summary Statistics
# define auxiliary functions
def computeStats(column):
return [round(column.count(),0),
round(column.sum(),3),
round(column.max(),3),
round(column.min(),3),
round(column.mean(),3),
round(computeMedian(column),3),
round(column.stdev(),3),
round(column.variance(),3)]
Basic Summary Statistics (cont.)
# print stats about the dump columns
dat = []
idx = []
for i,h in enumerate(header):
dat.append(computeStats(dump.map(lambda r: r[i])))
idx.append(h)
col = ["count", "sum", "max", "min", "mean", "median", "stdev", "variance"]
Correlation Between Series
# import required libraries
from pyspark.mllib.stat import Statistics
# simple example #1
ts_a = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
ts_b = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
corr = Statistics.corr(ts_a, ts_b, "pearson")
print("correlation between 'a' and 'b' is: %f" % corr)
# simple example #2
ts_a = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
ts_b = sc.parallelize([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
corr = Statistics.corr(ts_a, ts_b, "pearson")
print("correlation between 'a' and 'b' is: %f" % corr)
Correlation Between Series (cont.)
# advanced example
dat = zeros((len(header), len(header)))
for ((index1, header1), (index2, header2)) in combinations(enumerate(header), 2):
(property1, property2) =
(dump.map(lambda v: v[index1]), dump.map(lambda v: v[index2]))
dat[index1][index2] = Statistics.corr(property1, property2, "pearson")
Create SQL Context
# import required libraries
from pyspark.sql import SQLContext, Row
# create sql context
sqlContext = SQLContext(sc)
Create DataFrame
# create DataFrame from JSON file
df = sqlContext.read.json(path+"people.json")
# display the schema of the DataFrame
df.schema
# display the schema in a tree format
df.printSchema()
# display the content of the DataFrame
df.show()
DataFrame Operations
# # select only the "name" column
df.select("name").show()
# select everybody but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# select people older than 21
df.filter(df['age'] > 21).show()
# count people by age
df.groupBy("age").count().show()
Infer the Schema with Reflection
# infer the schema and register the DataFrame as a table
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")
Run SQL Queries Programmatically
# run SQL over DataFrames that have been registered as a table
teenagers = sqlContext.
sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
DataFrames Interoperating with RDDs
# the results of SQL queries are RDDs and support all the normal RDD operations
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
print(teenName)
Parquet Support via DataFrame Interface
# display the schema of the DataFrame
schemaPeople.schema
# display the schema in a tree format
schemaPeople.printSchema()
# display the content of the DataFrame
schemaPeople.show()
# DataFrames can be saved as Parquet files maintaining the schema information
schemaPeople.write.parquet(path+"people.parquet")
# Parquet files are self-describing so the schema is preserved; the result is also a DataFrame
parquetFile = sqlContext.read.parquet(path+"people.parquet")
Regression
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
def prepareDump(row):
return LabeledPoint(row[0],Vectors.dense((row[1],row[2],...,row[10],row[11])))
# dummy split into train and test set
trainSet = dump.filter(lambda x: x.features[9] <= 4000)
testSet = dump.filter(lambda x: x.features[9] > 4000)
# build regression model: without such a small step size, the algorithm would diverge
model = LinearRegressionWithSGD.train(data=trainSet, iterations=100, step=0.000000001)
Regression (cont.)
# evaluate regression model
valuesANDpredictions = testSet.
map(lambda p: (p.label, model.predict(p.features)))
# print simple statistics about the model
mse = (valuesANDpredictions.
map(lambda (v , p): (v - p) * (v - p)).
sum()) / float(valuesANDpredictions.count())
print("mean squared error is: %.3f" % mse)
print("root mean squared error is: %.3f" % sqrt(mse))
Classification
# import required libraries
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.tree import DecisionTree
# prepare dump
dump = (dump.
map(lambda line: prepareDump(line)).
map(lambda line: LabeledPoint(line[0],Vectors.dense(line[1]))))
Classification (cont.)
# build classification model
categoricalFeaturesInfo = {}
model = DecisionTree.trainClassifier(
dump, # dump file
2, # number of classes
categoricalFeaturesInfo, # all features are continuous
"gini", # impurity
5, # max depth
32) # max bins
# evaluate model
actual = dump.map(lambda x: x.label)
predicted = model.predict(dump.map(lambda x: x.features))
actualANDpredicted = actual.zip(predicted)
Clustering
# import required libraries
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.clustering import KMeans
# convert original data points into dence format
dump = dump.map(lambda line: Vectors.dense(line))
clusters = 2
iterations = 20
model = KMeans.train(dump, clusters, maxIterations=iterations)
# get the centers of the 2 clusters
_2_centers = [tuple(c) for c in model.clusterCenters]
Recommendations
# import required libraries
from pyspark.mllib.recommendation import Rating, ALS
# dummy split into three sets, namely train, validation and test
train = (ratings.map(lambda x: parseRatings1(x)).
filter(lambda x: (((x[3] % 10) < 6))).
map(lambda x: parseRatings2(x)))
validation = (ratings.map(lambda x: parseRatings1(x)).
filter(lambda x: (((x[3] % 10) >= 6) and ((x[3] % 10) < 8))).
map(lambda x: parseRatings2(x)))
test = (ratings.map(lambda x: parseRatings1(x)).
filter(lambda x: (((x[3] % 10) >= 8))).
map(lambda x: parseRatings2(x)))
Recommendations (cont.)
# build model
rank = 10; iterations = 20
model = ALS.train(train,rank,iterations=iterations)
# make predictions
predictions = model.
predictAll(validation.map(lambda (user,product,rating): (user,product)))
# join validation set with predictions
ratingsANDpredictions = ((validation.
map(lambda (user,product,rating): ((user,product),rating))).
join(predictions.map(lambda (user,product,rating): ((user,product),rating))))
# evaluate the performance of the predictor
mse = (ratingsANDpredictions.
map(lambda ((user,product),(rating,prediction)):
(rating - prediction) * (rating - prediction)).sum()) /
float(ratingsANDpredictions.count())
ML Libraries on Spark
› user@spark.apache.org
› usage questions, help, announcements.
› dev@spark.apache.org
› for people who want to contribute code!
Get Help and Contribute
› Introduction to Spark (edX), Apr. 14, 2016
› Big Data Analysis with Spark (edX), May 19, 2016
› Distributed Machine Learning with Spark (edX), Jun. 2016
› Adv. Distributed Machine Learning with Spark (edX), Aug. 2016
› Adv. Spark for Data Science & Data Engineering (edX), Oct. 2016
› Data Science & Engineering with Spark (edX), TBA
Courses and Certifications
Books and Tutorials
THANKS!
Any questions?
You can find me at: @eualin

Mais conteúdo relacionado

Mais procurados

Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Alluxio, Inc.
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceBlueData, Inc.
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBlueData, Inc.
 
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud EraModernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud EraAlluxio, Inc.
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMAlluxio, Inc.
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio, Inc.
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
 
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizonHadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizonDataWorks Summit/Hadoop Summit
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersDataWorks Summit
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsDataWorks Summit/Hadoop Summit
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is DistributedAlluxio, Inc.
 

Mais procurados (20)

Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
 
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud EraModernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata Services
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
 
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizonHadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 

Destaque

Data at Pollfish
Data at PollfishData at Pollfish
Data at PollfishPollfish
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshopPawel Szulc
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 

Destaque (6)

Data at Pollfish
Data at PollfishData at Pollfish
Data at Pollfish
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshop
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 

Semelhante a Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 

Semelhante a Apache Spark Workshop, Apr. 2016, Euangelos Linardos (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Spark 101
Spark 101Spark 101
Spark 101
 

Último

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Último (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

  • 2. HELLO! I am Euangelos Linardos Data Scientist at Pollfish
  • 3. Outline › Part I: Setup Environment › Ubuntu / Mac / Windows › Part II: Introduction to Spark › History / Features / Examples › Part III: Hands-On Training › Core / SQL / MLlib
  • 4. Part I: Setup Environment (...in seven easy steps!)
  • 5. Setting Up Docker on Ubuntu › $ apt-get update › $ apt-get -y install docker.io › $ ln -sf /usr/bin/docker.io /usr/local/bin/docker › $ sed -i '$acomplete -F _docker docker' /etc/bash_completion.d/docker.io › $ update-rc.d docker.io defaults › $ docker pull jupyter/pyspark-notebook:latest › $ docker run -d --name workshop -p 8888:8888 jupyter/pyspark-notebook:latest › # open browser and visit the address `localhost:8888` › # click `New` and then `Python 2` › # rename notebook, from `Untitled` to `Workshop`
  • 6. Setting Up Docker on Windows Mac › download `Docker Toolbox` › install `Docker Toolbox`, with default settings › open `Docker Quickstart Terminal` › click `Yes` on the `User Account Control` window, if it appears › write down the `IP` address (e.g. 192.168.99.100), and then type: › $ docker pull jupyter/pyspark-notebook › $ docker run -d --name workshop -p 8888:8888 jupyter/pyspark-notebook › open browser and visit the aforementioned `IP` address, e.g. `192.168.99.100:8888` › click `New` and then `Python 2` › rename notebook, from `Untitled` to `Workshop`
  • 7. Validate Setup # import required libraries from pyspark import SparkConf, SparkContext # create spark context sc = SparkContext(conf=(SparkConf().setMaster("local[*]"))) # print spark context print(sc) # print spark configuration print(sc._conf.getAll())
  • 8. Useful Links › install Docker on other platforms: › Ubuntu: https://www.youtube.com/watch?v=V9AKvZZCWLc › Mac: https://www.youtube.com/watch?v=lNkVxDSRo7M › Windows: https://www.youtube.com/watch?v=S7NVloq0EBc › download the course material: › datasets: http://bit.ly/23hdtq9 › notebooks: http://bit.ly/23hdsCO › presenations: http://bit.ly/1TFttfE › complete the course survey: › http://bit.ly/1MC4xUF › read the Apache Spark documentation: › http://bit.ly/1UQBgrP
  • 9. Simple Examples plist = sc.parallelize(range(10000)) # from python list path = "/home/jovyan/work/datasets/" # set datasets' path tfile = sc.textFile(path+"hamlet.txt") # from text file print(tfile.count()) # count lines print(plist.count()) # count elements plist.takeSample(False, 5) # sample and collect elements fv = plist.filter(lambda x: x < 10) # filter elements print(fv.count()) # count filtered elements print(fv.collect()) # collect filtered elements fv.reduce(lambda l,r: l + r) # merge filtered elements with an associative function fv.saveAsTextFile(path+"filtered-elements.txt") # write filtered elements to local file system
  • 10. Part II: Introduction to Spark (section 1: get to know spark)
  • 11. Spark in a Nutshell › general cluster computing platform: › distributed in-memory computational framework › SQL, Machine Learning, Stream Processing, etc. › easy to use, powerful, high-level API: › Scala, Java, Python and R
  • 12. Limitations of MapReduce › MapReduce use cases showed two major limitations: › difficulty of programming directly in MapReduce › performance bottlenecks, or batch not fitting the use cases › in short, MR doesn’t compose well for large applications › therefore, people built specialized systems as workarounds
  • 13. MapReduce Giraph Tez Pregel S4 Pig GraphLabImpala Dremel Drill Storm General Batch Processing Specialized Systems (iterative, interactive, streaming, graph, etc.) Limitations (cont.): Specialized Systems
  • 14. Advantages of Spark › handles batch, interactive, and real-time within a single framework › native integration with Java, Python, Scala, R › programming at a higher level of abstraction › more general: map/reduce is just one set of supported constructs
  • 15. Advantages (cont.): Generalized MapReduce › unlike the various specialized systems, Spark’s goal was to generalize MapReduce to support new apps within same engine › two reasonably small additions are enough to express the previous models: › fast data sharing › general DAGs › this allows for an approach which is more efficient for the engine, and much simpler for the end users
  • 16. Code Size same functionality yet in the form of libraries
  • 17. Standalone YARN Mesos Spark Core Spark SQL Spark GraphXSpark MLlibSpark Streaming Unified Stack
  • 18. High Performance › in-memory cluster computing › ideal for iterative algorithms › faster than Hadoop: › 10x on disk › 100x in memory
  • 19. Brief History › originally developed in 2009, UC Berkeley AMP Lab › open-sourced in 2010 › as of 2014, Spark is a top-level Apache project › fastest open-source engine for sorting 100 TB: › won the 2014 Daytona GraySort contest › throughput: 4.27 TB/min
  • 20. End Users › Data Scientists: › analyze and model data › data transformations and prototyping › statistics and machine learning › Data Engineers: › implement production data processing systems › require a reasonable API for distributed processing › reliable, high performance, easy to monitor platform
  • 21. partitions Resilient Distributed Dataset › RDD is an immutable and partitioned collection. RDD comes from the acronym: › resilient: it can be recreated, when data in memory is lost › distributed: stored in memory across the cluster › dataset: data that comes from file or created programmatically RDD
  • 22. Resilient Distributed Dataset (cont.) › RDD feels like coding using typical Scala collections; RDD can be build: › directly from a datasource (e.g., text file, HDFS, etc.), › or by applying a transformation to another RDDs › main features: › RDDs are computed lazily › automatically rebuild on failure › persistence for reuse (RAM and/or disk)
  • 23. MappedRDD func = _.split(...) FilteredRDD func = _.contains(...) HadoopRDD path = hdfs://... messages = textFile(“file.log”).filter(_.contains(“error”)).map(_.split(‘t’)(2)) RDD Fault Tolerance › RDDs are the primary abstraction in Spark; a fault-tolerant collection of elements that can be operated on in parallel › RDDs track the series of transformations used to build them; their lineage to recompute lost data
  • 24. Loading and Saving RDDs › File Systems: Local FS, Amazon S3 and HDFS › Supported formats: Text files, JSON, Hadoop sequence files, parquet files, protocol buffers and object files › Structured data with Spark SQL: Hive, JSON, JDBC, Cassandra, HBase and ElasticSearch
  • 25. Part II: Introduction to Spark (section 2: spark under the hood)
  • 27. The Spark Context › first thing that a Spark program does is create a SparkContext object, which tells Spark how to access a cluster › in the shell for either Scala or Python, this is the sc variable, which is created automatically › other programs must use a constructor to instantiate a new SparkContext › then in turn SparkContext gets used to create other variables
  • 28. master description local run Spark locally with one worker thread (i. e. no parallelism at all) local[*] run Spark locally with as many worker threads as logical cores on your machine spark://HOST:PORT connect to the given Spark standalone cluster master (port 7077 by default) mesos://HOST:PORT connect to the given Mesos cluster (port 5050 by default) yarn connect to a YARN cluster in client or cluster mode (YARN_CONF_DIR variable) The Spark Master › the master parameter for a SparkContext determines which cluster to use:
  • 29. SparkContext cacheExecutor tasktask Worker Node cacheExecutor tasktask Worker Node Driver Program Cluster Management The Spark Master (cont.) › connects to a cluster manager which allocate resources across applications › acquires executors on cluster nodes – worker processes to run computations and store data › sends app code to the executors › sends tasks for the executors to run
  • 30. Word Count › What is the goal? Count often each each word word appears appears count how how often In of text text documents. › Why is this so popular? Simple program provides a good test case for parallel processing, since it: › requires a minimal amount of code › demonstrates use of both symbolic and numeric values › isn’t many steps away from search indexing › serves as a “Hello World” for big data applications › Why should I care? A distributed computing framework that can run Word Count efficiently in parallel at scale can likely handle much larger and more interesting compute problems.
  • 31. Word Count (cont.) # calculate word frequencies counts = (tfile. flatMap(lambda x: x.split(' ')). filter(lambda x: len(x) > 0). map(lambda x: (x, 1)). reduceByKey(lambda l,r: l + r). sortBy(lambda x: x[1], ascending=False)) # print (word,count) sample print(counts.take(5))
  • 32. Mining Logs # base RDD logRDD = sc.textFile(path+"logs.txt") # transformed RDDs filteredRDD = logRDD.filter(lambda x: u' "GET ' in x) splittedRDD = filteredRDD.map(lambda x: x.split(u' "GET ')).map(lambda x: x[1]) # count requests based on status code print('with status “200”: %d' % splittedRDD.filter(lambda x: u'" 200 ' in x).count()) print('without status “200”: %d' % splittedRDD.filter(lambda x: u'" 200 ' not in x).count())
  • 33. transform ation(s) action valueRDD Spark Deconstructed › Looking at the RDD transformations and actions from another perspective:
  • 34. Spark Deconstructed (cont.) 1/3 # base RDD logRDD = sc.textFile(path+"logs.txt") RDD
  • 35. Spark Deconstructed (cont.) 2/3 # transformed RDDs filteredRDD = logRDD.filter(lambda x: u' "GET ' in x) splittedRDD = filteredRDD.map(lambda x: x.split(u' "GET ')).map(lambda x: x[1]) transform ation(s) RDD
  • 36. Spark Deconstructed (cont.) 3/3 # count requests based on status code print('with status “200”: %d' % splittedRDD.filter(lambda x: u'" 200 ' in x).count()) action value
  • 38. RDD Operations › two types of operations on RDDs: transformations and actions › transformations are lazy (not computed immediately) › the transformed RDD gets recomputed when an action is run on it (default) › however, an RDD can be persisted into storage in memory or disk
  • 39. base RDD new RDD value RDD Operations (cont.) › Transformations: define new RDDs based on current one, e.g., filter, map, reduce, groupBy, etc. base RDD › Actions: return values, e.g., count, sum, collect, etc.
  • 40. Transformations Vs. Actions: Basic Examples # transformation 1: create RDD lazyly nums = sc.parallelize((1, 2, 3, 4, 5)) # transformation 2: pass each element through a function squares = nums.map(lambda x: x * x) # transformation 3: keep elements passing a predicate evens = squares.filter(lambda x: x % 2 == 0) # transformation 4: map each element to zero or more others flats = nums.flatMap(lambda x: range(1, x+1)) # action 1: collect 'nums' print(nums.collect()) # action 2: collect 'evens' print(evens.collect())
  • 41. Transformations: Examples Illustrated nums flats evens squares ParallelCollectionRDD FlatMappedRDD MappedRDD FilteredRDD value 2 value 1 nums.flatMap(...) nums.map(...) squares.filter(...) collect() collect()
  • 42. Transformations Vs. Actions: K,V Examples # transformation 1: create RDD lazyly petsAll = sc.parallelize((("cat", 1), ("dog", 1), ("cat", 2))) # transformation 2: filter by key petsCat = petsAll.filter(lambda (k,v): k == "cat") # action 1: increase values by 1, then collect petsAll.map(lambda (k,v): (k, v+1)).collect() # ver.1 petsAll.mapValues(lambda v: v+1).collect() # ver.2 # action 2: sum values by key, then collect petsAll.reduceByKey(lambda l,r: l+r).collect() # action 3: group by key, then collect petsAll.groupByKey().map(lambda (k,v): (k, list(v))).collect() # action 4: sort by key, then collect print(petsAll.sortByKey().collect())
  • 43. Transformations Vs. Actions: Join Examples # transformation 1: RDD[(date, user, clicks)] clk = sc.textFile(path+"clk.tsv").map(lambda x: x.split("t")) # transformation 2: RDD[(date, user, id, lat, lon)] reg = sc.textFile(path+"reg.tsv").map(lambda x: x.split("t")) # transformation 3: RDD[(user, (date, clicks))] clk_reordered = clk.map(lambda (date, user, clicks): (user, (date, clicks))) # transformation 4: RDD[(user, (date, id, lat, lon))] reg_reordered = reg.map(lambda (date, user, id, lat, lon): (user, (date, id, lat, lon))) # transformation 5: RDD[(user, ((date, clicks), (date, id, lat, lon)))] joined = clk_reordered.join(reg_reordered) print(joined.count()) # action 1: print total number of successful joins print(joined.first()) # action 2: print first element of newly-joined RDD
  • 44. Units of Execution Model › Job: › work required to compute an RDD. › Stage: › each job is divided to stages. › Task: › unit of work within a stage. › corresponds to one RDD partition. Job Stage 0 Task 0 Task 1 ... Stage 1 Task 0 Task 1 ... ...
  • 46. Lineage Graph # calculate word frequencies counts = (sc.textFile(path+"hamlet.txt"). # MappedRDD[1], HadoopRDD[0] flatMap(lambda x: x.split(' ')). # FlatMappedRDD[2] map(lambda x: (x, 1)). # MappedRDD[3] reduceByKey(lambda l,r: l + r)) # ShuffledRDD[4] # print lineage graph representation print(counts.toDebugString()) [0] [1] [2] [3] [4] HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
  • 47. Lineage Graph (cont.) # calculate word frequencies counts = (sc.textFile(path+"hamlet.txt"). # MappedRDD[1], HadoopRDD[0] flatMap(lambda x: x.split(' ')). # FlatMappedRDD[2] map(lambda x: (x, 1)). # MappedRDD[3] reduceByKey(lambda l,r: l + r)) # ShuffledRDD[4] # print lineage graph representation print(counts.toDebugString()) [0] [1] [2] [3] [4] HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD [0] [1] [2] [3] [4]
  • 48. Execution Plan # calculate word frequencies counts = (sc.textFile(path+"hamlet.txt"). # MappedRDD[1], HadoopRDD[0] flatMap(lambda x: x.split(' ')). # FlatMappedRDD[2] map(lambda x: (x, 1)). # MappedRDD[3] reduceByKey(lambda l,r: l + r)) # ShuffledRDD[4] # print lineage graph representation print(counts.toDebugString()) [0] [1] [2] [3] [4] HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD [0] [1] [2] [3] [4] Stage 1 Stage 2
  • 49. Part II: Introduction to Spark (section 3: advanced features)
  • 50. Persistence › when we use the same RDD multiple times: › Spark will recompute the RDD. › expensive to iterative algorithms. › Spark can persist RDDs, avoiding re-computations. › each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster. › the cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
  • 51. Levels of Persistence # how to persist an RDD result = input.map(<ExpensiveComputation>) result.persist(LEVEL) LEVEL SPACE CPU IN-MEMORY ON-DISK MEMORY_ONLY (default) HIGH LOW YES NO MEMORY_ONLY_SER LOW HIGH YES NO MEMORY_AND_DISK HIGH MEDIUM SOME SOME MEMORY_AND_DISK_SER LOW HIGH SOME SOME DISK_ONLY LOW HIGH NO YES
  • 52. Persistence Behaviour › each node will store its computed partition. › in case of a failure, Spark recomputes the missing partitions. › least recently used cache policy: › memory-only: recompute partitions. › memory-and-disk: recompute and write to disk. › manually remove from cache: unpersist()
  • 53. Shared Variables › Accumulators: aggregate values from worker nodes back to the driver program. › Broadcast Variables: distribute values to all worker nodes.
  • 54. Broadcast Variables › closures and the variables they use are send separately to each task. we may want to share some variable (e.g., a map) across tasks/operations. this can efficiently done with broadcast variables: › broadcast variables let programmer keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. › for example, to give every node a copy of a large input dataset efficiently. › Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
  • 55. Example Without Broadcast Variables # dict(user: (date, id, lat, lon)) regDict = dict(reg_reordered.collect()) # CAUTION: regDict is sent along with every task! joined = clk_reordered. map(lambda (user, (date, clicks)): (user, ((date, clicks), regDict[user]))) # let's have a look on the output, transformed dataset print(joined.first()) print(joined.count())
  • 56. Example With Broadcast Variables # dict(user: (date, id, lat, lon)) regDict = dict(reg_reordered.collect()) bcDict = sc.broadcast(regDict) # bcDict is a read-only variable, cached on each machine joined = clk_reordered. map(lambda (user, (date, clicks)): (user, ((date, clicks), bcDict.value[user]))) # let's have a look on the output, transformed dataset print(joined.first()) print(joined.count())
  • 57. Accumulators › accumulators are variables that can only be “added” to through an associative operation. › used to implement counters and sums, efficiently in parallel. › Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types. › only the driver program can read an accumulator’s value, not the tasks.
  • 58. Example with Accumulators # initialize accumulators acc_sum = sc.accumulator(0) acc_cnt = sc.accumulator(0) # define auxiliary functions def acc(size): acc_sum.add(size) acc_cnt.add(1) # increase accumulators: values are stored on driver (splittedRDD. filter(lambda x: len(x) > 0). flatMap(lambda x: x.split(" ")). map(lambda x: len(x)). foreach(lambda x: acc(x)))
  • 59. Accumulators and Fault Tolerance › Safe: Updates inside actions will only applied once. › Unsafe: Updates inside transformation may applied more than once!!!
  • 60. Part III: Hands-On Training (present Core, SQL and MLlib APIs)
  • 61. Basic Summary Statistics # define auxiliary functions def computeStats(column): return [round(column.count(),0), round(column.sum(),3), round(column.max(),3), round(column.min(),3), round(column.mean(),3), round(computeMedian(column),3), round(column.stdev(),3), round(column.variance(),3)]
  • 62. Basic Summary Statistics (cont.) # print stats about the dump columns dat = [] idx = [] for i,h in enumerate(header): dat.append(computeStats(dump.map(lambda r: r[i]))) idx.append(h) col = ["count", "sum", "max", "min", "mean", "median", "stdev", "variance"]
  • 63. Correlation Between Series # import required libraries from pyspark.mllib.stat import Statistics # simple example #1 ts_a = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) ts_b = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) corr = Statistics.corr(ts_a, ts_b, "pearson") print("correlation between 'a' and 'b' is: %f" % corr) # simple example #2 ts_a = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) ts_b = sc.parallelize([9, 8, 7, 6, 5, 4, 3, 2, 1, 0]) corr = Statistics.corr(ts_a, ts_b, "pearson") print("correlation between 'a' and 'b' is: %f" % corr)
  • 64. Correlation Between Series (cont.) # advanced example dat = zeros((len(header), len(header))) for ((index1, header1), (index2, header2)) in combinations(enumerate(header), 2): (property1, property2) = (dump.map(lambda v: v[index1]), dump.map(lambda v: v[index2])) dat[index1][index2] = Statistics.corr(property1, property2, "pearson")
  • 65. Create SQL Context # import required libraries from pyspark.sql import SQLContext, Row # create sql context sqlContext = SQLContext(sc)
  • 66. Create DataFrame # create DataFrame from JSON file df = sqlContext.read.json(path+"people.json") # display the schema of the DataFrame df.schema # display the schema in a tree format df.printSchema() # display the content of the DataFrame df.show()
  • 67. DataFrame Operations # # select only the "name" column df.select("name").show() # select everybody but increment the age by 1 df.select(df['name'], df['age'] + 1).show() # select people older than 21 df.filter(df['age'] > 21).show() # count people by age df.groupBy("age").count().show()
  • 68. Infer the Schema with Reflection # infer the schema and register the DataFrame as a table schemaPeople = sqlContext.createDataFrame(people) schemaPeople.registerTempTable("people")
  • 69. Run SQL Queries Programmatically # run SQL over DataFrames that have been registered as a table teenagers = sqlContext. sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
  • 70. DataFrames Interoperating with RDDs # the results of SQL queries are RDDs and support all the normal RDD operations teenNames = teenagers.map(lambda p: "Name: " + p.name) for teenName in teenNames.collect(): print(teenName)
  • 71. Parquet Support via DataFrame Interface # display the schema of the DataFrame schemaPeople.schema # display the schema in a tree format schemaPeople.printSchema() # display the content of the DataFrame schemaPeople.show() # DataFrames can be saved as Parquet files maintaining the schema information schemaPeople.write.parquet(path+"people.parquet") # Parquet files are self-describing so the schema is preserved; the result is also a DataFrame parquetFile = sqlContext.read.parquet(path+"people.parquet")
  • 72. Regression from pyspark.mllib.linalg import Vectors from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.regression import LinearRegressionWithSGD def prepareDump(row): return LabeledPoint(row[0],Vectors.dense((row[1],row[2],...,row[10],row[11]))) # dummy split into train and test set trainSet = dump.filter(lambda x: x.features[9] <= 4000) testSet = dump.filter(lambda x: x.features[9] > 4000) # build regression model: without such a small step size, the algorithm would diverge model = LinearRegressionWithSGD.train(data=trainSet, iterations=100, step=0.000000001)
  • 73. Regression (cont.) # evaluate regression model valuesANDpredictions = testSet. map(lambda p: (p.label, model.predict(p.features))) # print simple statistics about the model mse = (valuesANDpredictions. map(lambda (v , p): (v - p) * (v - p)). sum()) / float(valuesANDpredictions.count()) print("mean squared error is: %.3f" % mse) print("root mean squared error is: %.3f" % sqrt(mse))
  • 74. Classification # import required libraries from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.linalg import Vectors from pyspark.mllib.tree import DecisionTree # prepare dump dump = (dump. map(lambda line: prepareDump(line)). map(lambda line: LabeledPoint(line[0],Vectors.dense(line[1]))))
  • 75. Classification (cont.) # build classification model categoricalFeaturesInfo = {} model = DecisionTree.trainClassifier( dump, # dump file 2, # number of classes categoricalFeaturesInfo, # all features are continuous "gini", # impurity 5, # max depth 32) # max bins # evaluate model actual = dump.map(lambda x: x.label) predicted = model.predict(dump.map(lambda x: x.features)) actualANDpredicted = actual.zip(predicted)
  • 76. Clustering # import required libraries from pyspark.mllib.linalg import Vectors from pyspark.mllib.clustering import KMeans # convert original data points into dence format dump = dump.map(lambda line: Vectors.dense(line)) clusters = 2 iterations = 20 model = KMeans.train(dump, clusters, maxIterations=iterations) # get the centers of the 2 clusters _2_centers = [tuple(c) for c in model.clusterCenters]
  • 77. Recommendations # import required libraries from pyspark.mllib.recommendation import Rating, ALS # dummy split into three sets, namely train, validation and test train = (ratings.map(lambda x: parseRatings1(x)). filter(lambda x: (((x[3] % 10) < 6))). map(lambda x: parseRatings2(x))) validation = (ratings.map(lambda x: parseRatings1(x)). filter(lambda x: (((x[3] % 10) >= 6) and ((x[3] % 10) < 8))). map(lambda x: parseRatings2(x))) test = (ratings.map(lambda x: parseRatings1(x)). filter(lambda x: (((x[3] % 10) >= 8))). map(lambda x: parseRatings2(x)))
  • 78. Recommendations (cont.) # build model rank = 10; iterations = 20 model = ALS.train(train,rank,iterations=iterations) # make predictions predictions = model. predictAll(validation.map(lambda (user,product,rating): (user,product))) # join validation set with predictions ratingsANDpredictions = ((validation. map(lambda (user,product,rating): ((user,product),rating))). join(predictions.map(lambda (user,product,rating): ((user,product),rating)))) # evaluate the performance of the predictor mse = (ratingsANDpredictions. map(lambda ((user,product),(rating,prediction)): (rating - prediction) * (rating - prediction)).sum()) / float(ratingsANDpredictions.count())
  • 80. › user@spark.apache.org › usage questions, help, announcements. › dev@spark.apache.org › for people who want to contribute code! Get Help and Contribute
  • 81. › Introduction to Spark (edX), Apr. 14, 2016 › Big Data Analysis with Spark (edX), May 19, 2016 › Distributed Machine Learning with Spark (edX), Jun. 2016 › Adv. Distributed Machine Learning with Spark (edX), Aug. 2016 › Adv. Spark for Data Science & Data Engineering (edX), Oct. 2016 › Data Science & Engineering with Spark (edX), TBA Courses and Certifications
  • 83. THANKS! Any questions? You can find me at: @eualin