Mais conteúdo relacionado Semelhante a Hortonworks & Bilot Data Driven Transformations with Hadoop (20) Hortonworks & Bilot Data Driven Transformations with Hadoop2. Page 2 © Hortonworks Inc. 2014
Traditional systems under pressure
Challenges
• Constrains data to app
• Can’t manage new data
• Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
2012
2.8 Zettabytes
2020
40 Zettabytes
LAGGARDS
INDUSTRY
LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional
3. Page 3 © Hortonworks Inc. 2014
Modern Data Architecture emerges to unify data & processing
Modern Data Architecture
• Enable applications to have access to
all your enterprise data through an
efficient centralized platform
• Supported with a centralized approach
governance, security and operations
• Versatile to handle any applications
and datasets no matter the size or type
Clickstream Web
&
Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
Existing Systems
ERP CRM SCM
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMP
P
EDW
4. Page 4 © Hortonworks Inc. 2014
Hortonworks Data Platform
powered by Apache Hadoop
Hortonworks Data Platform
powered by Apache Hadoop
Enrich
Context
Store Data
and Metadata
Internet
of Anything
Hortonworks DataFlow
powered by Apache NiFi
Perishable
Insights
Historical
Insights
Hortonworks DataFlow Adds to Hadoop Capabilities
Hortonworks DataFlow and Hortonworks Data Platform
deliver the industry’s most complete solution Big Data management
5. Page 5 © Hortonworks Inc. 2014
Only Hortonworks Delivers Open Enterprise Hadoop
H O R T O N W O R K S D ATA P L AT F O R M
YARN: Data Operating System
CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATION SERVERLOG
Batch Interactive Search Streaming Machine Learning
EXISTING
6. Page 6 © Hortonworks Inc. 2014
YA R N
D A T A O P E R A T I N G S Y S T E M
OPERATI ONS SECURI TY
GOVERNANCE
STORAGE
STORAGE
Machine
Learning
Batch
StreamingInteractive
Search
Centralized Platform
for operations, governance and security
Diverse Applications
run simultaneously on a single cluster
Maximum Data Ingest
including existing and new sources,
regardless of raw format
Shared Big Data Assets
across business groups, functions and users
Centralized Platform with YARN-Based Architecture
7. Page 7 © Hortonworks Inc. 2014
Offering You the Most Flexibility
A N Y D ATA
Existing and new datasets
A N Y A P P L I C AT I O N
Multiple engines for data analysis
A N Y W H E R E
Complete range of deployment options
Batch
Interactive
Search
Streaming
Machine Learning
Click-
stream
Sensor
Social Mobile
Geo-
Location
Server
Log Linux Windows
CloudOn-Premise
8. Page 8 © Hortonworks Inc. 2014
Hortonworks Capabilities
The Data Flow Thing
Process
and
Analyze
Collect
Store & Integrate
9. Page 9 © Hortonworks Inc. 2014
Hadoop Driver: Cost optimization
Archive Data off EDW
Move rarely used data to Hadoop as active
archive, store more data longer
Offload costly ETL process
Free your EDW to perform high-value functions like
analytics & operations, not ETL
Enrich the value of your EDW
Use Hadoop to refine new data sources, such as
web and machine data for new analytical context
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
HDP helps you reduce costs and optimize the value associated with your EDW
ANALYTICSDATA SYSTEMS
Data
Marts
Business
Analytics
Visualization
& Dashboards
HDP 2.3
ELT
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data,
Deeper Archive
& New Sources
Enterprise Data
Warehouse
Hot
MPP
In-Memory
Clickstream Web
&
Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
Existing Systems
ERP CRM SCM
SOURCES
10. Page 10 © Hortonworks Inc. 2014
Single View
Improve acquisition and retention
Predictive Analytics
Identify your next best action
Data Discovery
Uncover new findings
Financial Services
New Account Risk Screens Trading Risk Insurance Underwriting
Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service
Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement
Telecom
Unified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse
Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis
Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers
Retail
360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase
Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs
Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior
Manufacturing
Supply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data
Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance
Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields
Healthcare
Electronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials
Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste
Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service
Oil & Gas
Unify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration
DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells
Government
Single View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness
Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting
Hadoop Driver: Advanced analytic applications
11. Page 11 © Hortonworks Inc. 2014
Hortonworks Data Platform
Hortonworks Data Platform 2.3
Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture
of core enterprise services, for any application and any data.
Open & Enterprise
• HDP incorporates every element
required of an enterprise data
platform: data storage, data access,
governance, security, operations
• All components are developed in
open source and then rigorously
tested, certified, and delivered as an
integrated open source platform that’s
easy to consume and use by the
enterprise and ecosystem.
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Apache Pig
° °
° °
° ° °
° ° °
HDFS
(Hadoop Distributed File System)
INTEGRATION
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Apache Falcon
Apache Hive
Apache Slider
Apache HBase
Apache Accumulo
Apache Solr
Apache Spark
Apache Storm
Apache Sqoop
Apache Flume
Apache Kafka
SECURITY
Apache Ranger
Apache Knox
Apache Falcon
OPERATIONS
Apache Ambari
Apache
Zookeeper
Apache Oozie
Apache Atlas
Apache Atlas Cloudbreak
12. Page 12 © Hortonworks Inc. 2014
HDP: Any Data, Any Application, Anywhere
Any Application
• Deep integration with ecosystem
partners to extend existing
investments and skills
• Broadest set of applications through
the stable of YARN-Ready applications
Any Data
Deploy applications fueled by clickstream, sensor,
social, mobile, geo-location, server log, and other new
paradigm datasets with existing legacy datasets.
Anywhere
Implement HDP naturally across the complete
range of deployment options
Clickstream Web
&
Social
Geolocation Internet
of
Things
Server
Logs
Files,
emailsERP CRM SCM
hybrid
commodity appliance cloud
Over 70 Hortonworks Certified YARN Apps
14. Page 14 © Hortonworks Inc. 2014
What is a Data Lake?
§It is a PLATFORM for your data. (NOT a database)
§Multipurpose open PLATFORM to land all data in a single place and
interact with it many ways (Stream, Batch, Interactive Query).
§A platform that allows for the ecosystem to provide higher level services
(SAP, SAS, Microsoft, Teradata, etc..)
§Provides first class APIs and frameworks to enable integration
§Provides first class data management capabilities (metadata
management, security, governance, transformation pipelines, replication,
retention, etc..)
Page 14
15. Spotify Use Case
Full presentation available at:
http://www.slideshare.net/JoshBaer/how-apache-drives-music-
recommendations-at-spotify?related=1
16. Page 16 © Hortonworks Inc. 2014Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data
Discovery
and
Predictive
Analytics
Elefante
Wine
Inc.
Use
Case
&
Demo
Mats
Johansson
Solutions
Engineer EMEA
Hortonworks
Tweet: #hadooproadshow
17. Page 17 © Hortonworks Inc. 2014
Elefante Wine Current Challenges
The Company
Elefante Wine is a boutique wine fulfillment company with a large fleet of trucks. It delivers wine
in a highly-regulated industry with stringent transportation requirements.
The Situation
Recently a number of driver violations led to fines and increased insurance rates
The Challenges
• Rising Operational Costs
• Driver Safety
• Risk Management
• Logistics Optimization
Tweet: #hadooproadshow
18. Page 18 © Hortonworks Inc. 2014
Elefante Wine Risk and Driver Safety Challenges
Trucks outfitted with new sensors generating
large volumes of new data:
• Location
• Speed
• Driver Violations
Need to be integrate real-time & historical data
Increase safety and reduce liabilities
Anticipate driver violations BEFORE they
happen and take precautionary actions
Find predictive correlations in driver behavior
over large volumes of real-time data
Difficult to deliver timely insights to the right
people and systems to take action
Data Discovery
Uncover new findings
Predictive Analytics
Identify your next best action
Better Understanding
of the Past
Better Prediction
of the Future
Tweet: #hadooproadshow
19. Page 19 © Hortonworks Inc. 2014
Elefante Wine’s YARN-enabled Architecture
Distributed
Storage:
HDFS
Many
Workloads:
YARN
Stream
Processing
(Storm)
Inbound
Messaging
(Kafka)
Real-‐time
Serving
(HBase)
Alerts
&
Events
(ActiveMQ)
Real-‐Time
Web
App
SQL
Interactive
Query
(Hive
on
Tez)
Truck
Sensors
One
cluster
with
consistent
security,
governance
&
operations
Tweet: #hadooproadshow
20. Page 20 © Hortonworks Inc. 2014
Explore
Enriched
Events
to
Build
a
Predictive
Model
Apache Zeppelin
Notebook environment that supports Spark
Agile data visualizations
Zeppelin Supports Spark Jobs on
YARN
Data Scientists
Explore and visualize events in Zeppelin
Build a machine-learning model in Spark, to
predict driver violations
Tweet: #hadooproadshow
21. Page 21 © Hortonworks Inc. 2014
Streaming Demo
Data Discovery Through Streaming Sensor Data from Trucks
22. Page 22 © Hortonworks Inc. 2014
Enriching
Truck
Events
for
Analysis
with
Pig
HDFS Raw
Truck
EventsWeather
Data
Sets
Raw
Weather
Data
HCatalog (Metadata)
Payroll
Data
HR
&
Payroll
DBs
Load
Raw
Truck
Events
Clean
&
Filter
Cleaned
Events
Transformed
Events
Transform
Join
with
HR
&
weather
data
Enriched
Events
Enriched
Events
Store
Zeppelin
Tweet: #hadooproadshow
23. Page 23 © Hortonworks Inc. 2014
Apache Zeppelin Visualization Demo
Exploring and Model Building on enriched sensor data
24. Page 24 © Hortonworks Inc. 2014
Recommendations from the CDO
Investment recommendations, in order of priority
1. Visibility sensors and auto braking systems to deal with foggy conditions
2. Slip-resistant tires for improved safety during rainy conditions
3. Driver certification to minimize violations
Tweet: #hadooproadshow
25. Page 25 © Hortonworks Inc. 2014
Apps
on
YARN
Trucking
company
datasets
stored
in
HDFS
Real-time and Predictive Application Architecture
Your
BI
Tool
Predictive
application
Truck
sensors
App
alerts
(ActiveMQ)
Messages
SQL Stream NoSQLML
Use
Model
Tweet: #hadooproadshow
26. Page 26 © Hortonworks Inc. 2014
Large
Scale
Machine-‐Learning
Insights
for
Elefante Wine
Improve Predictive Power
Algorithms on Terabytes of data
Improve confidence by testing hypotheses over
huge datasets
Accelerate Time to Market
Rapidly test out machine-learning algorithms
Integrate Predictive Models into Apps
Run models in Storm or your other apps
Run it All in a Multi-Tenant Cluster
Large scale machine learning on YARN respects
other tenants in an HDP cluster
Tweet: #hadooproadshow
27. Page 27 © Hortonworks Inc. 2014Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow
Try It Yourself, Download the Sandbox:
hortonworks.com/sandbox
Tweet: #hadooproadshow
28. Page 28 © Hortonworks Inc. 2014
Thank you!
Mats Johansson
mjohansson@hortonworks.com
@matsjo66
https://se.linkedin.com/in/matsjo66