Applications of Machine Learning at USC presentation by Alex Tellez
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
2. AGENDA
1. Introduction to Big Data / ML
2. What is H2O.ai?
3. Use Cases:
4. Data Science Competition
a) Beat Bill Belichick
b) Fight Crime in Chicago
c) Whiskey Recommendation Engine
d) Bordeaux Wine Vintage
3. 1. INTROTO BIG DATA / ML
BIG DATA IS LIKE TEENAGE SEX:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is
doing it, so everyone claims
they are doing it…
Dan Ariely, Prof. @ Duke
4. BIGVS. SMALL DATA
When you try to open
file in excel, excel
CRASHES
SMALL = Data fits in RAM
BIG = Data does NOT fit in RAM
Basically…
Big Data is data too big
to process using conventional
methods
(e.g. excel, access)
5. V +V +V
Today, we have access to more data than we know what to do with!
1) Wearables (fitbit, iWatch, etc)
2) Click streams from web visitors
3. Sensor readings
4. Social Media Outlets (e.g. twitter, facebook, etc)
Volume - Data volumes are becoming unmanageable
Variety - More data types being captured
Velocity - Data arrives rapidly and must
be processed / stored
6. THE HOPE OF BIG DATA
1. Data contains information of great business / personal value
Examples:
a) Predicting future stock movements = $$$
b) Netflix movie recommendations = Better experience = $$$
2. IF you can extract those insights from the data, you can make better
decisions
Enter, Machine Learning (ML)…
So how the hell do you do it?
7. MACHINE LEARNING
The Wikipedia Definition:
…a scientific discipline that explores the construction and study
of algorithms that can learn from data. Such algorithms operate
by building a model…. ZZZzzzzzZZZzzzzzz
My Definition:
The development, analysis, and application of algorithms that enable
machines to: make predictions and / or better understand data
2 Types of Learning:
SUPERVISED + UNSUPERVISED
8. SUPERVISED LEARNING
What is it?
Examples of supervised learning tasks:
1. ClassificationTasks - Benign / Malignant tumor
2. RegressionTasks - Predicting future stock market prices
3. Image Recognition - Highlighting faces in pictures
Methods that infer a function from labeled training data. Key task:
Predicting ________ . (Insert your task here)
9. UNSUPERVISED LEARNING
What is it?
Examples of unsupervised learning tasks:
1. Clustering - Discovering customer segments
2.Topic Extraction - What topics are people tweeting about?
3. Information Retrieval - IBM Watson: Question + Answer
Methods to understand the general structure of input data where
no predictions is needed.
4.Anomaly Detection - Detecting irregular heart-beats
NO CURATION NEEDED!
10. 2.WHAT IS H2O?
What is H2O? (water, duh!)
It is ALSO an open-source, parallel processing engine for machine
learning.
What makes H2O different?
Cutting-edge algorithms + parallel architecture + ease-of-use
=
Happy Data Scientists / Analysts
13. TRY IT!
Don’t take my word for it…www.h2o.ai
Simple Instructions
1. CD to Download Location
2. unzip h2o file
3. java -jar h2o.jar
4. Point browser to: localhost:54321
GUI
R
15. TB + BB
Bill Belichick Tom Brady
+ =
15 years together
3 Super Bowls
16. PASS OR RUN?
On any given offensive play…
Coach Bill can either call a PASS or a RUN
What determines this?
Game situation
Opposing team
Time remaining, etc, etc
Yards to go (until 1st down)
Basically, LOTS of stuff.
Personnel
17. BUT WHAT IF??
Question:
Can we try to predict whether the next play will be PASS or RUN
using historical data?
Approach:
Download every offensive play from Belichick-Brady era since 2000
Use various Machine Learning approaches to model PASS / RUN
Disclaimer: I’m not a Seahawks fan!
Extract known features to build model inputs
18. DATA COLLECTION
Data:
13 years of data (2002 -2013 season)
194 games total
14,547 total offensive plays (excludes punts, kickoffs, returns)
Response Variable: PASS / RUN
Model Inputs:
Quarter, Minutes, Seconds, OpposingTeam, Down, Distance,
Line of Scrimmage, NE-Score, OpposingTeam Score, Season,
Formation, Game Status (is NE losing / winning / tied)
23. SPARK + H2O
Weather CrimesCensusWeatherWeather
Data munging
Spark SQL join
Deep
Learning
Evaluate models
GOAL:
For a given crime,
predict if an
arrest is
more / less
likely to be made!
26. SPLIT DATA INTOTEST/TRAIN SETS
training set arrest rate test set arrest rate
train model on this segment, 80% of data
validate the model on this segment (remaining 20%)
~40% of crimes lead to arrest
27. DEEP LEARNING
Problem:
For a given crime, is an arrest more / less likely?
Deep Learning:
A multi-layer feed-forward
neural network that starts
w/ an input layer
(crime + weather data)
followed by
multiple layers of
non-linear transformations
29. SINGLE-MALT SCOTCH
Single-Malt Scotch
A whiskey made at one particular distillery from a mash that only uses
malted grain (barley)
Solid Standards:
Must be aged at least 3 years in oak casks
Many famous distilleries produced in northern regions of Scotland
30. OF COURSE,THERE’S A
DATASET FORTHAT!
THE Single Malt Dataset
85 distilleries from Northern Scotland
12 descriptor features:
E.g. Sweetness, Smoky,Tobacco, Honey, Spicy, Malty, etc
Each descriptor rated 0 (weak) to 4 (strong)
Problem:
Can we build a whiskey recommendation engine based on whiskeys I
have tried (and liked!) already?
31. DIMENSIONALITY
REDUCTION + K-MEANS
First, let’s reduce the 12 features to a lower dimensional space using a
linear transformation (Principal Components Analysis)
7 principal components explain ~ 85% of the variance in dataset
Then let’s use a clustering algorithm to determine unique whiskeys
using the new PCA’d dataset
11 clusters are appropriate
Pipe out the cluster assignments and start buying whiskey!
36. BORDEAUX WINE
Largest wine-growing region in France
+ 700 Million bottles of wine produced / year !
Some years better than others: Great ($$$) vs.Typical ($)
Last Great years: 2010, 2009, 2005, 2000
37. GREATVS.TYPICALVINTAGE?
Question:
Can we study weather patterns in Bordeaux
leading up to harvest to identify ‘anomalous’ weather years >>
correlates to Great ($$$) vs.Typical ($)Vintage?
The Bordeaux Dataset (1952 - 2014 Yearly)
Amount of Winter Rain (Oct > Apr of harvest year)
Average Summer Temp (Apr > Sept of harvest year)
Rain during Harvest (Aug > Sept)
Years since last Great Vintage
38. AUTOENCODER + ANOMALY
DETECTION
ML Workflow:
1)Train autoencoder to learn ‘typical’ vintage weather pattern
2) Append ‘great’ vintage year weather data to original dataset
3) IF great vintage year weather data does NOT match learned
weather pattern, autoencoder will produce high reconstruction
error (MSE)
‘en primeur of en primeur’ - Can we use weather patterns to identify
anomalous years >> indicates great vintage quality?
Goal: