Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
3. DATA ORCHESTRATION SUMMIT
Data Lake supports T3GO Intelligent Transportation
• Background check
• Face recognition
• transaction
• Behavior
• Driving
• ……
Driver
r
Vehicle Road
Data
Collection
Application
scenario
Cloud
• Safety management
• Driver management
• UBI Insurance
• Driving mode research
• ……
• Vehicle condition
• Driving
• Energy
consumption
• Accident
• Failure
• ……
• Capacity scheduling
• Active maintenance
• Product improvement
• car design
• ……
• Traffic
• Environmental
• Trajectory
• POI
• Abnormal
• ……
• Map drawing
• Real-time traffic
• Safety management
• Municipal management
• ……
• Risk control
• Capacity
• Transaction
• City
• User
• ……
• Intelligent scheduling
• Intelligent decision
• Smart marketing
• Customer
Experience
• ……
4. DATA ORCHESTRATION SUMMIT
A data lake is a centralized repository that allows
you to store all your structured and
unstructured data at any scale. You can store
your data as-is, without having to first structure
the data, and run different types of analytics—from
dashboards and visualizations to big data
processing, real-time analytics, and machine
learning to guide better decisions.
What is data lake ?
5. DATA ORCHESTRATION SUMMIT
Shared-nothing (pros)
• Tables are horizontally partitioned across nodes
• Every node has its own local storage
• Every node is only responsible for its local table partitions
• Elegant and easy to reason about
• Scales well for star-schema queries
• Dominant architecture in data warehousing
Network
CPU
Memory
Disk
6. DATA ORCHESTRATION SUMMIT
Shared-nothing (cons)
• Shared-nothing couples compute and storage resources
• Elasticity
• Resizing compute cluster requires redistributing (lots of) data
• Cannot simply shut off unused compute resources —> no pay-per-use
• Limited availability
• Membership changes (failures, upgrades) significantly
impact performance and may cause downtime
• Homogeneous resources vs. heterogeneous workload
• Bulk loading, reporting, exploratory analysis
Network
CPU
Memory
Disk
7. DATA ORCHESTRATION SUMMIT
Multi-cluster, Shared-data
• No data silos
• Storage decoupled from compute
• Any data
• Native for structeured & semi-structured
• Unlimited scalabilitiy
• Along many dimensions
• Homogeneous resources VS heterogeneous loads
• Bulk loading, reporting, exploration and analysis
Data lake Storage
Ad-Hoc Cluster
OLAP Cluster
Data Warehouse Cluster
ETL
Cluster
BI
Cluster
ML Cluster
9. DATA ORCHESTRATION SUMMIT
T3GO data lake technical architecture diagram
Aliyun OSS
YARN
Data Lake Storage
Storage format
Orchestration
acceleration
Resource management
Multiple
calculation
Computing
Storage
10. DATA ORCHESTRATION SUMMIT
Why not traditional Hadoop data warehouse
Tim
e
Order payment
rate
Pay the long tail: pay before the next
trip!
• Long business closed-loop window
• The hot and cold data is updated
randomly
and cannot be identified
• Multi-level update, long link, high cost
11. DATA ORCHESTRATION SUMMIT
High backtracking costs for order analysis
Order drive
r
Vehicl
e
Passeng
er
Tri
p
order_id driver_id user_id veh_id … status create_time lastupdate_time
… … … … … … …
xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx
… … … … … … …
Order(Snapshot
Table)
driver_id
Driver(Snapshot
Table)
user_id
User(Snapshot
Table)
veh_id
Trip(Snapshot
Table)
The historical snapshot half year ago is no longer accessible!
12. DATA ORCHESTRATION SUMMIT
Data ingestion pipeline cannot guarantee reliability
Business
system
Data
Warehouse
BI /
Report
Data
Ingest
Data
Processing
1. 10W data is successfully written 9.97W?
2. Incorrect calculation logic leads to dirty data?
3. Repeatedly write data due to unstable network?
15. DATA ORCHESTRATION SUMMIT
Introduction to Apache Hudi
Hadoop Upserts Deletes and Incrementals
Manage DFS/cloud ultra-large-scale (hundreds of PB)
analysis datasets
Incremental data lake processing framework supporting
insert, update, and delete
Joined Apache incubator in January 2019, graduated as
TLP in May 2020
All cloud services (AWS/Tencent Cloud/Aliyun) are
available out of the box
Has been operating stably on Uber for nearly 4 years
ACID
Storage management Time
travel
Incremental
17. DATA ORCHESTRATION SUMMIT
Hudi storage mode and view
Storage Mode
Supported Query
Type
Features
Copy On Write
• Snapshot Query
• Incremental Query
• Read Heavy
• Focus on low-latency queries
• Columnar Parquet data file
Merge On Read
• Snapshot Query
• Incremental Query
• Read Optimized
Query
• Write Heavy
• Focus on rapid data
ingestion
• Columnar Parquet data file
• Line Avro incremental file
Query Engine Snapshot Queries Incremental Queries
Read Optimized
Queries
Hive Y Y -
Spark SQL Y Y -
Spark Datasource Y Y -
Presto Y N -
Impala Y N -
Hive Y Y Y
Spark SQL Y Y Y
Spark Datasource Y N Y
Presto Y N Y
Impala N N Y
18. DATA ORCHESTRATION SUMMIT
The time travel query makes "back in time"
Order drive
r
Vehicl
e
Passeng
er
Tri
p
order_id driver_id user_id veh_id … status create_time lastupdate_time
… … … … … … …
xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx
… … … … … … …
Order(Snapshot
Table)
driver_id
Driver(v_2020-06-0
1)
user_id
User(v_2020-06-0
1)
veh_id
Trip(v_2020-06-0
1)
Take time back to the
moment
the order occurred!
Time
Travel
Data
Version
Hudi
Feature:
19. DATA ORCHESTRATION SUMMIT
Hudi guarantees the reliability of the data ingestion pipeline
Business
system
Data
Warehouse
BI /
Report
Data
Ingest
Data
Processing
Invisible
!
All data commit rollback
!
1. 10W data is successfully written 9.97W?
2. Incorrect calculation logic leads to dirty data?
3. Repeatedly write data due to unstable network?
Deduplication based on index
!
Hudi MVCC writes update data to versioned Parquet/base and log
files!
21. DATA ORCHESTRATION SUMMIT
Why T3go data lake need Alluxio
Serious network delay when reading and writing
Multi-cluster naming is not uniform
Low cluster stability
Low memory resource utilization
High timeout tolerance
Inefficient calculation
Serious network delay
Miss Cache ?
T3 Trips
Store
22. DATA ORCHESTRATION SUMMIT
Data Lake benefit from Alluxio
Better read and write performance
Unified namespace
Higher cluster stability
Higher cluster resource utilization
Reduce timeout
T3 Trips
Store
Efficient Calculation
Low Latency
Alluxio
24. DATA ORCHESTRATION SUMMIT
How T3GO data lake uses Alluxio & Hudi
OSS
Spark Cluster
Presto workers
Write Hudi File
Alluxio Cluster B Kylin
Short-Circuit Local Reads Short-Circuit Local Reads
Read Hive Table Read Hive Table
Ad-hoc cluster Kylin Cluster
Alluxio Cluster A
Sync to OSS
Alluxio Cluster C
25. DATA ORCHESTRATION SUMMIT
Hudi&Alluxio case 1 :near-real-time analysis on data lake
Low-latency data ingest
• Hudi and Spark decoupling
• Support Flink streaming
write
Efficient and fast data
processing
• Write a commit
notification
• Scheduling integration
Low-latency interactive query analysis
• Zeppelin、presto integration
• Alluxio data orchestration
acceleration
Streaming
consume
Streaming
Product
Scheduling
processing
Data
orchestration
26. DATA ORCHESTRATION SUMMIT
Hudi&Alluxio case 1 :near-real-time analysis on data lake
OSS
Presto workers Alluxio Cluster B Kylin
Short-Circuit Local Reads Short-Circuit Local Reads
Read Hive Table Read Hive Table
Alluxio Cluster C
Load hudi to kylin temp tableLoad hudi to presto local worker
Ad-hoc
Query
Self-service Report
Analysis
28. DATA ORCHESTRATION SUMMIT
Alluxio pressure test
Hudi on oss
performance is
poor! In the pressure test, after the
data volume is greater than a
certain magnitude (2400W), the
query speed using alluxio+oss
surpasses the HDFS query speed
of hybrid deployment.
After the data volume is greater
than 1E, the query speed starts to
double. After reaching 6E data, it
is up to 12 times higher than
querying native oss and 8 times
higher than querying native
HDFS.
The increase factor depends on
the machine configuration.