SlideShare a Scribd company logo
1 of 38
Tools and
Methods for Big
Data Analytics
One hour of everything you need to know to
navigate the data science jungle
by Dahl Winters, RTI International
Overview
•
•
•
•

What is Big Data Analytics
What Tools to Use When
Most Common Hadoop Use Cases
Geospatial Analytics
o NoSQL and Graph Databases
o Machine Learning
• Classification
• Clustering
o Deep Learning
Resources
http://www.scoop.it/u/dahl-winters
Big Data Analytics
Statistics
Machine
Learning

Data
Science

Analytics

Descriptive

Predictive

Image Processing

Geospatial Analytics

Network Analytics

Software Engineering
Lots of Complex Data
Prescriptive

Text Analytics

Sentiment Analysis

Social Media Analytics
What is Hadoop Good For
• Essentially, anything involving complex data and/or
multiple data sources requiring batch processing, parallel
execution, spreading data over a cluster of servers, or
taking the computation to the data because the data are
too big to move

• Text mining, index building, graph creation/analysis,
pattern recognition, collaborative filtering, prediction
models, sentiment analysis, risk assessment
• If your data are small, Hadoop will be slow – use
something else (scikit-learn, R, etc.)
What is Hadoop?
When to Use What
• Depends on whether you need real-time analysis or not
o Affects what products, tools, hardware, data sources, and data frequency you will
need to handle

• Data frequency and size
o Determine the storage mechanism, storage format, and necessary preprocessing
tools
o Examples: on-demand (social media data), continuous real-time feed (weather
data, transaction data), time-based data (time series)

• Type of data
o Structured (RDBMS)
o Unstructured (audio, video, images)
o Semi-structured
Decision Tree
How big is your data?
Less than 10 GB
Small Data
Methods

10 GB < x < 200 GB

More than 200 GB

What size queries?

Single element
at a time

One pass over
all the data

Big Storage

Streaming

Multiple passes
over big chunks

Response time?
Less than 100 s
Impala, Drill,
Titan

Don’t care, just do it
Batch
Processing
Big Data Considerations

http://www.ibm.com/developerworks/library/bd-archpatterns1/
Survey of Use Cases
9 general use cases for big data tools and methods
2 real-time analytics tools
8 MapReduce use cases – what you can use Hadoop for

1 geospatial use case
Use Cases
1. Utilities want to predict power consumption
o Use machine-generated data
o Smart meters generate huge volumes of data to analyze and power grid contains
numerous sensors monitoring voltage, current, frequency, etc.

2. Banks and insurance companies want to understand
risk
o Use machine-generated, human-generated, and transaction data from credit
card records, call recordings, chat sessions, emails, and banking activity
o Want to build a comprehensive data picture using sentiment analysis, graph
creation, and pattern recognition

3. Fraud detection
o Machine-generated, human-generated, and transaction data
o Requires real-time or near real-time transaction analysis and the generation of
recommendations for immediate action
Use Cases
4. Marketing departments want to understand customers
o Use web and social data such as Twitter feeds
o Conduct sentiment analysis to learn what users are saying about the company
and its products/services; sentiment must be integrated with customer profile
data to derive meaningful results.
o Customer feedback may vary according to demographics, which are
geographically uneven and thus have a geospatial component

5. They also want to understand customer churn
o Use web and social data, along with transaction data
o Build behavioral models including social media and transaction data to predict
and manage churn by analyzing customer activity. Graph creation/traversal and
pattern recognition may be involved.

6. They may also just want to get insights from the data
o Use Hadoop to try out different analyses on the data to find potential new
patterns/relationships that yield additional value
Use Cases
7. Recommendations
o If you bought this item, what other items might you buy?
o Collaborative filtering = using information from users to predict what similar users
might like.
o Requires batch processing across large, distributed datasets

8. Location-Based Ad Targeting
o Uses web and social data, perhaps also biometrics for facial recognition; also
machine-generated data (GPS) and transaction data
o Predictive behavioral targeting and personalized messaging – companies can
use facial recognition technology in combination with a photo from social media
to make personalized offers based on buying behavior and location
o Serious privacy concerns

9. Threat Analysis
o Pattern recognition to identify anomalies
Real-Time Analytics
• Streaming data management is the only technology
available to deliver low-latency analytics at large scale
• Scale by adding more servers
• Twitter Storm – can be used with any programming
language. For online machine learning or continuous
computation. Can process more than a million tuples
per second per node.
• LinkedIn Samza – built on top of LinkedIn’s Kafka
messaging system
MapReduce Use Cases
1. Counting and Summing
o N documents, each with a set of terms and we want to calculate a total number
of occurrences of each term in all N documents

2. Collating
o A set of items each have a property and we want to save all items with that
property into one file or perform some computation requiring all propertycontaining items to be processed as a group (i.e. building inverted indices)

3. Filtering, Parsing, and Data Validation
o We want to collect all records that meet some condition or transform each record
into another representation (i.e. text parsing, value extraction, conversion from
one format to another)

4. Distributed Task Execution
o Any large computational problem that can be divided into multiple parts and
results from all parts can be combined into a final result
MapReduce Use Cases
5. Sorting
o We want to sort records by some rule or process the records in a certain order

6. Iterative Message Passing (Graph Processing)
o Given a network of entities and relationships between them, calculate each
entity’s state based on the properties of surrounding entities

7. Distinct Values (Unique Items Counting)
o A set of records contain fields A and B, and we want to count the total number of
unique values of field A, grouped by B

8. Cross-Correlation
o Given a list of items bought by customers, for each pair of items calculate the
frequency that customers bought those items together.
Geospatial Analytics
• Question: What defines a community?
• Tools and Methods
o Graph Databases
o Classification Algorithms to Identify Characteristics of Community Members
o Clustering Algorithms to Identify Community Boundaries

• Base Dataset
o Synthetic Population Household Viewer
o https://www.epimodels.org/midas/synthpopviewer_index.do
Graph Databases
• Think of nodes as points, edges as lines connecting the
points
• Nodes can have attributes (properties); edges can have
labels
• In the Hadoop ecosystem: Giraph, Titan, Faunus
• Giraph: in-memory, lots of Java code
• Titan: database allowing fast querying of large,
distributed graphs; choice of 3 storage backends
• Faunus: graph analytics engine performing batch
processing of large graphs; fastest with breadth-first
searches
Identify This
Synthetic Population
Household Viewer
http://portaldev.rti.org/10_Midas_Docs/SynthPop/portal.html
http://portaldev.rti.org/10_Midas_Docs/SynthPop/portal.html
Machine Learning
Algorithm Roadmap

http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
Classification Algorithms
• kNN, Naïve Bayes, Logistic Regression, Decision Trees,
Random Forests, Support Vector Machines, Neural
Networks, oh my! How to decide?
• Look at the size of your training set
o Small: high bias/low variance classifiers like Naïve Bayes are better since the
others will overfit, but high bias classifiers aren’t powerful enough to provide
accurate models.
o Large: low bias/high variance classifiers such as kNN or logistic regression are
better because they have lower asymptotic error

• When to use kNN
o Personalization tasks – might employ kNN to find similar customers and base an
offer on their purchase behaviors
o Have to decide what k to use – vary k, calculate the accuracy against a holdout
set, and plot the results
Classification Algorithms
• When to use Naïve Bayes
o When you don’t have much training data; Naïve Bayes converges quicker than
discriminative models like logistic regression
o Any time – this should be a first thing to try especially if your features are
independent (no correlation between them)

• When to use Logistic Regression
o When you don’t have to worry much about features being correlated
o When you want a nice probabilistic interpretation, which you won’t get with
decision trees or SVMs, in order to adjust classification thresholds or get
confidence intervals
o When you want to easily update the model to take in new data (using gradient
descent), again unlike decision trees or SVMs

• When to use Decision Trees
o They are easy to interpret and explain, but easy to overfit. To solve that problem,
use random forests instead.
Classification Algorithms
• When to use Random Forests
o Whenever you think about using decision trees (random forests almost always
have lower classification error and better f-scores, and almost always perform as
well or better than SVMs but are far easier to understand).
o If your data are very uneven with many missing variables
o If you want to know which features in the data set are important
o If you want something that will train fast and that will be scalable
o Logistic Regression vs. Random Forests: both are fast and scalable; the latter
tends to beat the former in terms of accuracy

• When to use SVMs
o When working with text classification or any situation where high-dimensional
spaces are common
o Advantage: high accuracy, generally superior in classifying complex patterns.
Disadvantage: memory intensive. Unsuitable for large training sets.
Classification Algorithms
• When to Use Neural Networks
o Slow to converge, hard to set parameters, but good at capturing fairly complex
patterns. Slow to train but fast to use; unlike SVMs the execution speed is
independent of the size of the data it was trained on.
o MLP neural network – well-suited for complex real-world problems – on average,
superior to both SVM and Naïve Bayes. However, cannot easily understand the
model built for classifying.

• General Points
o Better data often beats better algorithms – designing good features goes a long
way.
o With a huge dataset, choice of classification algorithm might not really affect
performance much, so choose based on speed or ease of use instead.
o If accuracy is paramount, try many different classifiers and select the best one by
cross-validation, or use an ensemble method to choose them all.
Clustering Algorithms

http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
Clustering Algorithms
• Canopy clustering
o Pre-clustering algorithm, often used prior to k-means or hierarchical in order to
speed up clustering operations on large data sets and potentially improve
clustering results

• DBSCAN/OPTICS
• Density-based spatial clustering of applications with noise – finds
density-based clusters in spatial data
• OPTICS – ordering points to identify the clustering structure generalization of DBSCAN to multiple ranges so meaningful
clusters can be found in areas of varying density

• Hierarchical clustering
• K-means clustering
o Most common

• Spectral clustering
o Dimensionality reduction before clustering in fewer dimensions
Clustering Decision Tree
Do you want to define the number of clusters beforehand?
no

yes

Do your points have
varying densities?

How many clusters would you have?
A few
Spectral
clustering

no

yes

DBSCAN

OPTICS

A medium
number
K-means

A large number
Hierarchical
clustering
Deep Learning
• Why?
o Computers can learn without being taught
o Can adapt to experience rather than being dependent on a human programmer
o Think of the baby that learns sounds, then words, then sentences – must start at
low-level features and graduate to higher-level representations

• What?
o Essentially, layers of neural networks
o Restricted Boltzmann Machines, Deep Belief Networks, Auto-Encoders
o http://www.meetup.com/Chicago-Machine-Learning-Study-Group/files/

• Examples
o Word2vec – pre-packaged deep learning software that can recognize the
similarities among words (countries in Europe) as well as how they’re related to
other words (countries and capitals)
o AlchemyAPI – for image recognition of common objects
http://www.youtube.com/watch?v=n1ViNeWhC24
http://portaldev.rti.org/10_Midas_Docs/SynthPop/portal.html
Hadoop Connectors
• R: rmr2 allows MapReduce jobs from R environment;
bridges in-memory and HDFS
o Non-Hadoop R for Big Data: pbdR (programming with big data in R) – allows R
to use large HPC platforms with thousands of cores by providing an interface to
MPI, NetCDF4, and more

• MongoDB and Hadoop: Mongo-Hadoop 1.1
• Pattern: migrating predictive models from SAS,
Microstrategy, SQL Server, etc. to Hadoop via PMML
(XML standard for predictive model markup)
• .NET MapReduce API for Hadoop
• Python for Hadoop
Python-Hadoop Options

http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
Python-Hadoop
Benchmarks

http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
Questions?
Dahl Winters, dahlwinters@gmail.com

http://www.scoop.it/u/dahl-winters

More Related Content

What's hot

Big data analytics
Big data analyticsBig data analytics
Big data analyticsRavi Teja
 
Big Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBig Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBernard Marr
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big datakk1718
 
Real time analytics of big data
Real time analytics of big dataReal time analytics of big data
Real time analytics of big dataDeependra Jyoti
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification303Computing
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBernard Marr
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challengesfazail amin
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - IntroductionAlex Meadows
 

What's hot (20)

Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Big Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBig Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business Needs
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Real time analytics of big data
Real time analytics of big dataReal time analytics of big data
Real time analytics of big data
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Big data 101
Big data 101Big data 101
Big data 101
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challenges
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
Big Data Hadoop
Big Data HadoopBig Data Hadoop
Big Data Hadoop
 
Big Data analytics
Big Data analyticsBig Data analytics
Big Data analytics
 

Similar to Tools and Methods for Big Data Analytics: Classification, Clustering, Geospatial Analysis

Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Chapter-1 - Notes.pptx
Chapter-1 - Notes.pptxChapter-1 - Notes.pptx
Chapter-1 - Notes.pptxDATASCIENCE41
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applicationsPadma Metta
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdferamfatima43
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introductionBasma Gamal
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesKimberley Mitchell
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introductionamiyadash
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Toolsijsrd.com
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7Rohit Mittal
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoTShivam Singh
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 

Similar to Tools and Methods for Big Data Analytics: Classification, Clustering, Geospatial Analysis (20)

Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Chapter-1 - Notes.pptx
Chapter-1 - Notes.pptxChapter-1 - Notes.pptx
Chapter-1 - Notes.pptx
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applications
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
CLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptxCLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptx
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoT
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Tools and Methods for Big Data Analytics: Classification, Clustering, Geospatial Analysis

  • 1. Tools and Methods for Big Data Analytics One hour of everything you need to know to navigate the data science jungle by Dahl Winters, RTI International
  • 2. Overview • • • • What is Big Data Analytics What Tools to Use When Most Common Hadoop Use Cases Geospatial Analytics o NoSQL and Graph Databases o Machine Learning • Classification • Clustering o Deep Learning
  • 4. Big Data Analytics Statistics Machine Learning Data Science Analytics Descriptive Predictive Image Processing Geospatial Analytics Network Analytics Software Engineering Lots of Complex Data Prescriptive Text Analytics Sentiment Analysis Social Media Analytics
  • 5. What is Hadoop Good For • Essentially, anything involving complex data and/or multiple data sources requiring batch processing, parallel execution, spreading data over a cluster of servers, or taking the computation to the data because the data are too big to move • Text mining, index building, graph creation/analysis, pattern recognition, collaborative filtering, prediction models, sentiment analysis, risk assessment • If your data are small, Hadoop will be slow – use something else (scikit-learn, R, etc.)
  • 7. When to Use What • Depends on whether you need real-time analysis or not o Affects what products, tools, hardware, data sources, and data frequency you will need to handle • Data frequency and size o Determine the storage mechanism, storage format, and necessary preprocessing tools o Examples: on-demand (social media data), continuous real-time feed (weather data, transaction data), time-based data (time series) • Type of data o Structured (RDBMS) o Unstructured (audio, video, images) o Semi-structured
  • 8. Decision Tree How big is your data? Less than 10 GB Small Data Methods 10 GB < x < 200 GB More than 200 GB What size queries? Single element at a time One pass over all the data Big Storage Streaming Multiple passes over big chunks Response time? Less than 100 s Impala, Drill, Titan Don’t care, just do it Batch Processing
  • 10. Survey of Use Cases 9 general use cases for big data tools and methods 2 real-time analytics tools 8 MapReduce use cases – what you can use Hadoop for 1 geospatial use case
  • 11. Use Cases 1. Utilities want to predict power consumption o Use machine-generated data o Smart meters generate huge volumes of data to analyze and power grid contains numerous sensors monitoring voltage, current, frequency, etc. 2. Banks and insurance companies want to understand risk o Use machine-generated, human-generated, and transaction data from credit card records, call recordings, chat sessions, emails, and banking activity o Want to build a comprehensive data picture using sentiment analysis, graph creation, and pattern recognition 3. Fraud detection o Machine-generated, human-generated, and transaction data o Requires real-time or near real-time transaction analysis and the generation of recommendations for immediate action
  • 12. Use Cases 4. Marketing departments want to understand customers o Use web and social data such as Twitter feeds o Conduct sentiment analysis to learn what users are saying about the company and its products/services; sentiment must be integrated with customer profile data to derive meaningful results. o Customer feedback may vary according to demographics, which are geographically uneven and thus have a geospatial component 5. They also want to understand customer churn o Use web and social data, along with transaction data o Build behavioral models including social media and transaction data to predict and manage churn by analyzing customer activity. Graph creation/traversal and pattern recognition may be involved. 6. They may also just want to get insights from the data o Use Hadoop to try out different analyses on the data to find potential new patterns/relationships that yield additional value
  • 13. Use Cases 7. Recommendations o If you bought this item, what other items might you buy? o Collaborative filtering = using information from users to predict what similar users might like. o Requires batch processing across large, distributed datasets 8. Location-Based Ad Targeting o Uses web and social data, perhaps also biometrics for facial recognition; also machine-generated data (GPS) and transaction data o Predictive behavioral targeting and personalized messaging – companies can use facial recognition technology in combination with a photo from social media to make personalized offers based on buying behavior and location o Serious privacy concerns 9. Threat Analysis o Pattern recognition to identify anomalies
  • 14. Real-Time Analytics • Streaming data management is the only technology available to deliver low-latency analytics at large scale • Scale by adding more servers • Twitter Storm – can be used with any programming language. For online machine learning or continuous computation. Can process more than a million tuples per second per node. • LinkedIn Samza – built on top of LinkedIn’s Kafka messaging system
  • 15. MapReduce Use Cases 1. Counting and Summing o N documents, each with a set of terms and we want to calculate a total number of occurrences of each term in all N documents 2. Collating o A set of items each have a property and we want to save all items with that property into one file or perform some computation requiring all propertycontaining items to be processed as a group (i.e. building inverted indices) 3. Filtering, Parsing, and Data Validation o We want to collect all records that meet some condition or transform each record into another representation (i.e. text parsing, value extraction, conversion from one format to another) 4. Distributed Task Execution o Any large computational problem that can be divided into multiple parts and results from all parts can be combined into a final result
  • 16. MapReduce Use Cases 5. Sorting o We want to sort records by some rule or process the records in a certain order 6. Iterative Message Passing (Graph Processing) o Given a network of entities and relationships between them, calculate each entity’s state based on the properties of surrounding entities 7. Distinct Values (Unique Items Counting) o A set of records contain fields A and B, and we want to count the total number of unique values of field A, grouped by B 8. Cross-Correlation o Given a list of items bought by customers, for each pair of items calculate the frequency that customers bought those items together.
  • 17. Geospatial Analytics • Question: What defines a community? • Tools and Methods o Graph Databases o Classification Algorithms to Identify Characteristics of Community Members o Clustering Algorithms to Identify Community Boundaries • Base Dataset o Synthetic Population Household Viewer o https://www.epimodels.org/midas/synthpopviewer_index.do
  • 18. Graph Databases • Think of nodes as points, edges as lines connecting the points • Nodes can have attributes (properties); edges can have labels • In the Hadoop ecosystem: Giraph, Titan, Faunus • Giraph: in-memory, lots of Java code • Titan: database allowing fast querying of large, distributed graphs; choice of 3 storage backends • Faunus: graph analytics engine performing batch processing of large graphs; fastest with breadth-first searches
  • 20.
  • 21.
  • 25. Classification Algorithms • kNN, Naïve Bayes, Logistic Regression, Decision Trees, Random Forests, Support Vector Machines, Neural Networks, oh my! How to decide? • Look at the size of your training set o Small: high bias/low variance classifiers like Naïve Bayes are better since the others will overfit, but high bias classifiers aren’t powerful enough to provide accurate models. o Large: low bias/high variance classifiers such as kNN or logistic regression are better because they have lower asymptotic error • When to use kNN o Personalization tasks – might employ kNN to find similar customers and base an offer on their purchase behaviors o Have to decide what k to use – vary k, calculate the accuracy against a holdout set, and plot the results
  • 26. Classification Algorithms • When to use Naïve Bayes o When you don’t have much training data; Naïve Bayes converges quicker than discriminative models like logistic regression o Any time – this should be a first thing to try especially if your features are independent (no correlation between them) • When to use Logistic Regression o When you don’t have to worry much about features being correlated o When you want a nice probabilistic interpretation, which you won’t get with decision trees or SVMs, in order to adjust classification thresholds or get confidence intervals o When you want to easily update the model to take in new data (using gradient descent), again unlike decision trees or SVMs • When to use Decision Trees o They are easy to interpret and explain, but easy to overfit. To solve that problem, use random forests instead.
  • 27. Classification Algorithms • When to use Random Forests o Whenever you think about using decision trees (random forests almost always have lower classification error and better f-scores, and almost always perform as well or better than SVMs but are far easier to understand). o If your data are very uneven with many missing variables o If you want to know which features in the data set are important o If you want something that will train fast and that will be scalable o Logistic Regression vs. Random Forests: both are fast and scalable; the latter tends to beat the former in terms of accuracy • When to use SVMs o When working with text classification or any situation where high-dimensional spaces are common o Advantage: high accuracy, generally superior in classifying complex patterns. Disadvantage: memory intensive. Unsuitable for large training sets.
  • 28. Classification Algorithms • When to Use Neural Networks o Slow to converge, hard to set parameters, but good at capturing fairly complex patterns. Slow to train but fast to use; unlike SVMs the execution speed is independent of the size of the data it was trained on. o MLP neural network – well-suited for complex real-world problems – on average, superior to both SVM and Naïve Bayes. However, cannot easily understand the model built for classifying. • General Points o Better data often beats better algorithms – designing good features goes a long way. o With a huge dataset, choice of classification algorithm might not really affect performance much, so choose based on speed or ease of use instead. o If accuracy is paramount, try many different classifiers and select the best one by cross-validation, or use an ensemble method to choose them all.
  • 30. Clustering Algorithms • Canopy clustering o Pre-clustering algorithm, often used prior to k-means or hierarchical in order to speed up clustering operations on large data sets and potentially improve clustering results • DBSCAN/OPTICS • Density-based spatial clustering of applications with noise – finds density-based clusters in spatial data • OPTICS – ordering points to identify the clustering structure generalization of DBSCAN to multiple ranges so meaningful clusters can be found in areas of varying density • Hierarchical clustering • K-means clustering o Most common • Spectral clustering o Dimensionality reduction before clustering in fewer dimensions
  • 31. Clustering Decision Tree Do you want to define the number of clusters beforehand? no yes Do your points have varying densities? How many clusters would you have? A few Spectral clustering no yes DBSCAN OPTICS A medium number K-means A large number Hierarchical clustering
  • 32. Deep Learning • Why? o Computers can learn without being taught o Can adapt to experience rather than being dependent on a human programmer o Think of the baby that learns sounds, then words, then sentences – must start at low-level features and graduate to higher-level representations • What? o Essentially, layers of neural networks o Restricted Boltzmann Machines, Deep Belief Networks, Auto-Encoders o http://www.meetup.com/Chicago-Machine-Learning-Study-Group/files/ • Examples o Word2vec – pre-packaged deep learning software that can recognize the similarities among words (countries in Europe) as well as how they’re related to other words (countries and capitals) o AlchemyAPI – for image recognition of common objects
  • 35. Hadoop Connectors • R: rmr2 allows MapReduce jobs from R environment; bridges in-memory and HDFS o Non-Hadoop R for Big Data: pbdR (programming with big data in R) – allows R to use large HPC platforms with thousands of cores by providing an interface to MPI, NetCDF4, and more • MongoDB and Hadoop: Mongo-Hadoop 1.1 • Pattern: migrating predictive models from SAS, Microstrategy, SQL Server, etc. to Hadoop via PMML (XML standard for predictive model markup) • .NET MapReduce API for Hadoop • Python for Hadoop