SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
DATA ORCHESTRATION SUMMIT
2020
High-performance data lake with Apache Hudi
and Alluxio at T3GO
Trevor Zhang | Big Data Sr. Engineer
VinoYang | Head of T3Go Big Data Platform
Agenda
1.T3GO data lake introduction
2.Why Apache Hudi
3. Hudi & Alluxio practice
DATA ORCHESTRATION SUMMIT
Data Lake supports T3GO Intelligent Transportation
• Background check
• Face recognition
• transaction
• Behavior
• Driving
• ……
Driver
r
Vehicle Road
Data
Collection
Application
scenario
Cloud
• Safety management
• Driver management
• UBI Insurance
• Driving mode research
• ……
• Vehicle condition
• Driving
• Energy
consumption
• Accident
• Failure
• ……
• Capacity scheduling
• Active maintenance
• Product improvement
• car design
• ……
• Traffic
• Environmental
• Trajectory
• POI
• Abnormal
• ……
• Map drawing
• Real-time traffic
• Safety management
• Municipal management
• ……
• Risk control
• Capacity
• Transaction
• City
• User
• ……
• Intelligent scheduling
• Intelligent decision
• Smart marketing
• Customer
Experience
• ……
DATA ORCHESTRATION SUMMIT
A data lake is a centralized repository that allows
you to store all your structured and
unstructured data at any scale. You can store
your data as-is, without having to first structure
the data, and run different types of analytics—from
dashboards and visualizations to big data
processing, real-time analytics, and machine
learning to guide better decisions.
What is data lake ?
DATA ORCHESTRATION SUMMIT
Shared-nothing (pros)
• Tables are horizontally partitioned across nodes
• Every node has its own local storage
• Every node is only responsible for its local table partitions
• Elegant and easy to reason about
• Scales well for star-schema queries
• Dominant architecture in data warehousing
Network
CPU
Memory
Disk
DATA ORCHESTRATION SUMMIT
Shared-nothing (cons)
• Shared-nothing couples compute and storage resources
• Elasticity
• Resizing compute cluster requires redistributing (lots of) data
• Cannot simply shut off unused compute resources —> no pay-per-use
• Limited availability
• Membership changes (failures, upgrades) significantly
impact performance and may cause downtime
• Homogeneous resources vs. heterogeneous workload
• Bulk loading, reporting, exploratory analysis
Network
CPU
Memory
Disk
DATA ORCHESTRATION SUMMIT
Multi-cluster, Shared-data
• No data silos
• Storage decoupled from compute
• Any data
• Native for structeured & semi-structured
• Unlimited scalabilitiy
• Along many dimensions
• Homogeneous resources VS heterogeneous loads
• Bulk loading, reporting, exploration and analysis
Data lake Storage
Ad-Hoc Cluster
OLAP Cluster
Data Warehouse Cluster
ETL
Cluster
BI
Cluster
ML Cluster
DATA ORCHESTRATION SUMMIT
Multi-cluster, Shared-data
• All data in one place
• Independently scale storage
and compute
• No unload / reload to
shut off compute
• Every virtual warehouse can
access all data
DATA ORCHESTRATION SUMMIT
T3GO data lake technical architecture diagram
Aliyun OSS
YARN
Data Lake Storage
Storage format
Orchestration
acceleration
Resource management
Multiple
calculation
Computing
Storage
DATA ORCHESTRATION SUMMIT
Why not traditional Hadoop data warehouse
Tim
e
Order payment
rate
Pay the long tail: pay before the next
trip!
• Long business closed-loop window
• The hot and cold data is updated
randomly
and cannot be identified
• Multi-level update, long link, high cost
DATA ORCHESTRATION SUMMIT
High backtracking costs for order analysis
Order drive
r
Vehicl
e
Passeng
er
Tri
p
order_id driver_id user_id veh_id … status create_time lastupdate_time
… … … … … … …
xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx
… … … … … … …
Order(Snapshot
Table)
driver_id
Driver(Snapshot
Table)
user_id
User(Snapshot
Table)
veh_id
Trip(Snapshot
Table)
The historical snapshot half year ago is no longer accessible!
DATA ORCHESTRATION SUMMIT
Data ingestion pipeline cannot guarantee reliability
Business
system
Data
Warehouse
BI /
Report
Data
Ingest
Data
Processing
1. 10W data is successfully written 9.97W?
2. Incorrect calculation logic leads to dirty data?
3. Repeatedly write data due to unstable network?
DATA ORCHESTRATION SUMMIT
Summary
Pain points of Hadoop data warehouse system
Low
Reliability
Small File
Problem
Missing Data
Version
Not support
Incremental
Processing
High
Latency
Agenda
1.T3GO data lake introduction
2.Why Apache Hudi
3. Hudi & Alluxio practice
DATA ORCHESTRATION SUMMIT
Introduction to Apache Hudi
Hadoop Upserts Deletes and Incrementals
Manage DFS/cloud ultra-large-scale (hundreds of PB)
analysis datasets
Incremental data lake processing framework supporting
insert, update, and delete
Joined Apache incubator in January 2019, graduated as
TLP in May 2020
All cloud services (AWS/Tencent Cloud/Aliyun) are
available out of the box
Has been operating stably on Uber for nearly 4 years
ACID
Storage management Time
travel
Incremental
DATA ORCHESTRATION SUMMIT
Hudi plug-in architecture
Pluggable
Index
(Bloom/HBase)
Pluggable
Data format
(Avro, Parquet)
Timeline
Metadata
Hive
Hudi DataSet
Presto
Spark
write read
Storage type Query/View
Impala
Read Optimized Query
COW
MOR
Pluggable Storage(HDFS, OSS, S3)
Java
Flink
Spark
Python
Increamental Query
Snapshot Query
DATA ORCHESTRATION SUMMIT
Hudi storage mode and view
Storage Mode
Supported Query
Type
Features
Copy On Write
• Snapshot Query
• Incremental Query
• Read Heavy
• Focus on low-latency queries
• Columnar Parquet data file
Merge On Read
• Snapshot Query
• Incremental Query
• Read Optimized
Query
• Write Heavy
• Focus on rapid data
ingestion
• Columnar Parquet data file
• Line Avro incremental file
Query Engine Snapshot Queries Incremental Queries
Read Optimized
Queries
Hive Y Y -
Spark SQL Y Y -
Spark Datasource Y Y -
Presto Y N -
Impala Y N -
Hive Y Y Y
Spark SQL Y Y Y
Spark Datasource Y N Y
Presto Y N Y
Impala N N Y
DATA ORCHESTRATION SUMMIT
The time travel query makes "back in time"
Order drive
r
Vehicl
e
Passeng
er
Tri
p
order_id driver_id user_id veh_id … status create_time lastupdate_time
… … … … … … …
xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx
… … … … … … …
Order(Snapshot
Table)
driver_id
Driver(v_2020-06-0
1)
user_id
User(v_2020-06-0
1)
veh_id
Trip(v_2020-06-0
1)
Take time back to the
moment
the order occurred!
Time
Travel
Data
Version
Hudi
Feature:
DATA ORCHESTRATION SUMMIT
Hudi guarantees the reliability of the data ingestion pipeline
Business
system
Data
Warehouse
BI /
Report
Data
Ingest
Data
Processing
Invisible
!
All data commit rollback
!
1. 10W data is successfully written 9.97W?
2. Incorrect calculation logic leads to dirty data?
3. Repeatedly write data due to unstable network?
Deduplication based on index
!
Hudi MVCC writes update data to versioned Parquet/base and log
files!
Agenda
1.T3GO data lake introduction
2.Why Apache Hudi
3. Hudi & Alluxio practice
DATA ORCHESTRATION SUMMIT
Why T3go data lake need Alluxio
Serious network delay when reading and writing
Multi-cluster naming is not uniform
Low cluster stability
Low memory resource utilization
High timeout tolerance
Inefficient calculation
Serious network delay
Miss Cache ?
T3 Trips
Store
DATA ORCHESTRATION SUMMIT
Data Lake benefit from Alluxio
Better read and write performance
Unified namespace
Higher cluster stability
Higher cluster resource utilization
Reduce timeout
T3 Trips
Store
Efficient Calculation
Low Latency
Alluxio
DATA ORCHESTRATION SUMMIT
Hudi and Alluxio integration
OSS
Spark
Hudi target-base-path oss://……
Alluxio
Spark
Hudi target-base-path alluxio://……
OSS
change
DATA ORCHESTRATION SUMMIT
How T3GO data lake uses Alluxio & Hudi
OSS
Spark Cluster
Presto workers
Write Hudi File
Alluxio Cluster B Kylin
Short-Circuit Local Reads Short-Circuit Local Reads
Read Hive Table Read Hive Table
Ad-hoc cluster Kylin Cluster
Alluxio Cluster A
Sync to OSS
Alluxio Cluster C
DATA ORCHESTRATION SUMMIT
Hudi&Alluxio case 1 :near-real-time analysis on data lake
Low-latency data ingest
• Hudi and Spark decoupling
• Support Flink streaming
write
Efficient and fast data
processing
• Write a commit
notification
• Scheduling integration
Low-latency interactive query analysis
• Zeppelin、presto integration
• Alluxio data orchestration
acceleration
Streaming
consume
Streaming
Product
Scheduling
processing
Data
orchestration
DATA ORCHESTRATION SUMMIT
Hudi&Alluxio case 1 :near-real-time analysis on data lake
OSS
Presto workers Alluxio Cluster B Kylin
Short-Circuit Local Reads Short-Circuit Local Reads
Read Hive Table Read Hive Table
Alluxio Cluster C
Load hudi to kylin temp tableLoad hudi to presto local worker
Ad-hoc
Query
Self-service Report
Analysis
DATA ORCHESTRATION SUMMIT
Hudi&Alluxio case 2 : Spark multi-layer ETL and data processing
DWS
OSS
DWD
ODS
Load
Sync
DATA ORCHESTRATION SUMMIT
Alluxio pressure test
Hudi on oss
performance is
poor! In the pressure test, after the
data volume is greater than a
certain magnitude (2400W), the
query speed using alluxio+oss
surpasses the HDFS query speed
of hybrid deployment.
After the data volume is greater
than 1E, the query speed starts to
double. After reaching 6E data, it
is up to 12 times higher than
querying native oss and 8 times
higher than querying native
HDFS.
The increase factor depends on
the machine configuration.
DATA ORCHESTRATION SUMMIT
2020
Thanks

Mais conteúdo relacionado

Mais procurados

Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Bitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouseBitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouseAltinity Ltd
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 

Mais procurados (20)

Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Bitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouseBitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouse
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 

Semelhante a High Performance Data Lake with Apache Hudi and Alluxio at T3Go

Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john malloryAmazon Web Services
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Alluxio, Inc.
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio, Inc.
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsAshish Mrig
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Cloudian
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAlluxio, Inc.
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSAmazon Web Services
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudAlluxio, Inc.
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Accelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data LakeAccelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data LakeAlluxio, Inc.
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudCAMMS
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 

Semelhante a High Performance Data Lake with Apache Hudi and Alluxio at T3Go (20)

Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle Meetup
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Accelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data LakeAccelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data Lake
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 

Mais de Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.
 

Mais de Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Último

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 

Último (20)

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 

High Performance Data Lake with Apache Hudi and Alluxio at T3Go

  • 1. DATA ORCHESTRATION SUMMIT 2020 High-performance data lake with Apache Hudi and Alluxio at T3GO Trevor Zhang | Big Data Sr. Engineer VinoYang | Head of T3Go Big Data Platform
  • 2. Agenda 1.T3GO data lake introduction 2.Why Apache Hudi 3. Hudi & Alluxio practice
  • 3. DATA ORCHESTRATION SUMMIT Data Lake supports T3GO Intelligent Transportation • Background check • Face recognition • transaction • Behavior • Driving • …… Driver r Vehicle Road Data Collection Application scenario Cloud • Safety management • Driver management • UBI Insurance • Driving mode research • …… • Vehicle condition • Driving • Energy consumption • Accident • Failure • …… • Capacity scheduling • Active maintenance • Product improvement • car design • …… • Traffic • Environmental • Trajectory • POI • Abnormal • …… • Map drawing • Real-time traffic • Safety management • Municipal management • …… • Risk control • Capacity • Transaction • City • User • …… • Intelligent scheduling • Intelligent decision • Smart marketing • Customer Experience • ……
  • 4. DATA ORCHESTRATION SUMMIT A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions. What is data lake ?
  • 5. DATA ORCHESTRATION SUMMIT Shared-nothing (pros) • Tables are horizontally partitioned across nodes • Every node has its own local storage • Every node is only responsible for its local table partitions • Elegant and easy to reason about • Scales well for star-schema queries • Dominant architecture in data warehousing Network CPU Memory Disk
  • 6. DATA ORCHESTRATION SUMMIT Shared-nothing (cons) • Shared-nothing couples compute and storage resources • Elasticity • Resizing compute cluster requires redistributing (lots of) data • Cannot simply shut off unused compute resources —> no pay-per-use • Limited availability • Membership changes (failures, upgrades) significantly impact performance and may cause downtime • Homogeneous resources vs. heterogeneous workload • Bulk loading, reporting, exploratory analysis Network CPU Memory Disk
  • 7. DATA ORCHESTRATION SUMMIT Multi-cluster, Shared-data • No data silos • Storage decoupled from compute • Any data • Native for structeured & semi-structured • Unlimited scalabilitiy • Along many dimensions • Homogeneous resources VS heterogeneous loads • Bulk loading, reporting, exploration and analysis Data lake Storage Ad-Hoc Cluster OLAP Cluster Data Warehouse Cluster ETL Cluster BI Cluster ML Cluster
  • 8. DATA ORCHESTRATION SUMMIT Multi-cluster, Shared-data • All data in one place • Independently scale storage and compute • No unload / reload to shut off compute • Every virtual warehouse can access all data
  • 9. DATA ORCHESTRATION SUMMIT T3GO data lake technical architecture diagram Aliyun OSS YARN Data Lake Storage Storage format Orchestration acceleration Resource management Multiple calculation Computing Storage
  • 10. DATA ORCHESTRATION SUMMIT Why not traditional Hadoop data warehouse Tim e Order payment rate Pay the long tail: pay before the next trip! • Long business closed-loop window • The hot and cold data is updated randomly and cannot be identified • Multi-level update, long link, high cost
  • 11. DATA ORCHESTRATION SUMMIT High backtracking costs for order analysis Order drive r Vehicl e Passeng er Tri p order_id driver_id user_id veh_id … status create_time lastupdate_time … … … … … … … xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx … … … … … … … Order(Snapshot Table) driver_id Driver(Snapshot Table) user_id User(Snapshot Table) veh_id Trip(Snapshot Table) The historical snapshot half year ago is no longer accessible!
  • 12. DATA ORCHESTRATION SUMMIT Data ingestion pipeline cannot guarantee reliability Business system Data Warehouse BI / Report Data Ingest Data Processing 1. 10W data is successfully written 9.97W? 2. Incorrect calculation logic leads to dirty data? 3. Repeatedly write data due to unstable network?
  • 13. DATA ORCHESTRATION SUMMIT Summary Pain points of Hadoop data warehouse system Low Reliability Small File Problem Missing Data Version Not support Incremental Processing High Latency
  • 14. Agenda 1.T3GO data lake introduction 2.Why Apache Hudi 3. Hudi & Alluxio practice
  • 15. DATA ORCHESTRATION SUMMIT Introduction to Apache Hudi Hadoop Upserts Deletes and Incrementals Manage DFS/cloud ultra-large-scale (hundreds of PB) analysis datasets Incremental data lake processing framework supporting insert, update, and delete Joined Apache incubator in January 2019, graduated as TLP in May 2020 All cloud services (AWS/Tencent Cloud/Aliyun) are available out of the box Has been operating stably on Uber for nearly 4 years ACID Storage management Time travel Incremental
  • 16. DATA ORCHESTRATION SUMMIT Hudi plug-in architecture Pluggable Index (Bloom/HBase) Pluggable Data format (Avro, Parquet) Timeline Metadata Hive Hudi DataSet Presto Spark write read Storage type Query/View Impala Read Optimized Query COW MOR Pluggable Storage(HDFS, OSS, S3) Java Flink Spark Python Increamental Query Snapshot Query
  • 17. DATA ORCHESTRATION SUMMIT Hudi storage mode and view Storage Mode Supported Query Type Features Copy On Write • Snapshot Query • Incremental Query • Read Heavy • Focus on low-latency queries • Columnar Parquet data file Merge On Read • Snapshot Query • Incremental Query • Read Optimized Query • Write Heavy • Focus on rapid data ingestion • Columnar Parquet data file • Line Avro incremental file Query Engine Snapshot Queries Incremental Queries Read Optimized Queries Hive Y Y - Spark SQL Y Y - Spark Datasource Y Y - Presto Y N - Impala Y N - Hive Y Y Y Spark SQL Y Y Y Spark Datasource Y N Y Presto Y N Y Impala N N Y
  • 18. DATA ORCHESTRATION SUMMIT The time travel query makes "back in time" Order drive r Vehicl e Passeng er Tri p order_id driver_id user_id veh_id … status create_time lastupdate_time … … … … … … … xxx xxx xxx xxx xxx end 2020-06-01 xx:xx:xx … … … … … … … Order(Snapshot Table) driver_id Driver(v_2020-06-0 1) user_id User(v_2020-06-0 1) veh_id Trip(v_2020-06-0 1) Take time back to the moment the order occurred! Time Travel Data Version Hudi Feature:
  • 19. DATA ORCHESTRATION SUMMIT Hudi guarantees the reliability of the data ingestion pipeline Business system Data Warehouse BI / Report Data Ingest Data Processing Invisible ! All data commit rollback ! 1. 10W data is successfully written 9.97W? 2. Incorrect calculation logic leads to dirty data? 3. Repeatedly write data due to unstable network? Deduplication based on index ! Hudi MVCC writes update data to versioned Parquet/base and log files!
  • 20. Agenda 1.T3GO data lake introduction 2.Why Apache Hudi 3. Hudi & Alluxio practice
  • 21. DATA ORCHESTRATION SUMMIT Why T3go data lake need Alluxio Serious network delay when reading and writing Multi-cluster naming is not uniform Low cluster stability Low memory resource utilization High timeout tolerance Inefficient calculation Serious network delay Miss Cache ? T3 Trips Store
  • 22. DATA ORCHESTRATION SUMMIT Data Lake benefit from Alluxio Better read and write performance Unified namespace Higher cluster stability Higher cluster resource utilization Reduce timeout T3 Trips Store Efficient Calculation Low Latency Alluxio
  • 23. DATA ORCHESTRATION SUMMIT Hudi and Alluxio integration OSS Spark Hudi target-base-path oss://…… Alluxio Spark Hudi target-base-path alluxio://…… OSS change
  • 24. DATA ORCHESTRATION SUMMIT How T3GO data lake uses Alluxio & Hudi OSS Spark Cluster Presto workers Write Hudi File Alluxio Cluster B Kylin Short-Circuit Local Reads Short-Circuit Local Reads Read Hive Table Read Hive Table Ad-hoc cluster Kylin Cluster Alluxio Cluster A Sync to OSS Alluxio Cluster C
  • 25. DATA ORCHESTRATION SUMMIT Hudi&Alluxio case 1 :near-real-time analysis on data lake Low-latency data ingest • Hudi and Spark decoupling • Support Flink streaming write Efficient and fast data processing • Write a commit notification • Scheduling integration Low-latency interactive query analysis • Zeppelin、presto integration • Alluxio data orchestration acceleration Streaming consume Streaming Product Scheduling processing Data orchestration
  • 26. DATA ORCHESTRATION SUMMIT Hudi&Alluxio case 1 :near-real-time analysis on data lake OSS Presto workers Alluxio Cluster B Kylin Short-Circuit Local Reads Short-Circuit Local Reads Read Hive Table Read Hive Table Alluxio Cluster C Load hudi to kylin temp tableLoad hudi to presto local worker Ad-hoc Query Self-service Report Analysis
  • 27. DATA ORCHESTRATION SUMMIT Hudi&Alluxio case 2 : Spark multi-layer ETL and data processing DWS OSS DWD ODS Load Sync
  • 28. DATA ORCHESTRATION SUMMIT Alluxio pressure test Hudi on oss performance is poor! In the pressure test, after the data volume is greater than a certain magnitude (2400W), the query speed using alluxio+oss surpasses the HDFS query speed of hybrid deployment. After the data volume is greater than 1E, the query speed starts to double. After reaching 6E data, it is up to 12 times higher than querying native oss and 8 times higher than querying native HDFS. The increase factor depends on the machine configuration.