Alluxio Product School Webinar
Feb. 23, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker: Greg Palmer (Lead Solutions Engineer, Alluxio)
In February’s product school, Greg Palmer, Lead Solution Engineer at Alluxio, will present a live demo featuring Transparent URI, a key feature in Alluxio Enterprise Edition which provides ease of integration of Alluxio with your existing data stack without any changes to the location metadata of the Hive Metastore. Join us to learn the configurations and other advanced settings for employing Transparent URI to simplify DevOps of Alluxio implementation, allowing users to access their existing storage systems without changing URIs at application level.
3. Unprecedented Complexity of Data Platforms
3
Data Trend Complex Platform
New compute and storage tech
created every 3-8 years
On-premise, cloud, hybrid,
multi-cloud environments all have
different environment properties
More data generated every day,
and stored in data silos
Data copies, synchronization costs
More people and teams need to
access and leverage these data
Multiple APIs necessitate
integration and application rewrites
3
4. Data Silos Across Data Centers, Regions and Clouds
Manual copy-based data synchronization across storage systems spread across environments
v
REGION A
v
REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
Hive
DATACENTER 2
DATACENTER 1
ERROR PRONE AND
NETWORK INTENSIVE
DATA COPIES
4
4
5. Strong Market Demand For Simplification
Acceleration &
auto-tiering of data
based on policies
EFFICIENT ACCESS &
DATA MANAGEMENT
Agility across regions for
private, hybrid or
multi-cloud
ENVIRONMENT
AGNOSTICITY
Multiple APIs to serve
analytics & AI with
storage abstraction
UNIFICATION OF
DATA LAKES
≈
5
7. No-copy data access across silos
agnostic to compute engine
Foundation of a heterogeneous data
platform across geos
SOLUTION
≈
Multi-Cloud Ready Analytics & AI Platform
v
REGION A
v
REGION B
REGION A REGION B
GKE
DATACENTER 2
DATACENTER 1
HMS
7
7
8. COMMON LANDING USE CASES
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
CASE 02: HYBRID
Alluxio
Spark
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
CASE 03: MULTI-DATACENTER
Trino/Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
Consistent SLAs, Performance, and
Cost Savings on cloud storage
CASE 01: CLOUD
PUBLIC CLOUD
Tensorflow
Alluxio
8
Google Cloud Storage
10. SEAMLESS HIVE CATALOG DEFINITIONS - PRESTO/TRINO
No table redefinitions required using “Transparent URI”
Example Scenario
I. Initial state
A. Data in HDFS
B. Hive Metastore table definitions pointing
to HDFS
II. Compute cluster with Alluxio
A. Catalog points to Hive Metastore
B. Alluxio intercepts Presto/Trino calls to
HDFS
III. Query execution
A. Accesses to HDFS are served by Alluxio
B. No manual data copies or application
re-writes
Presto Catalog
Hive
Metastore
Hive Connector
hdfs://hive/warehouse/table
1.
1I.
Presto/Trino
Alluxio
III.
Public Cloud
On-premise
s
HDFS
10
11. SEAMLESS HIVE CATALOG DEFINITIONS - SPARK
No table redefinitions required using “Transparent URI”
Example Scenario
I. Initial state
A. Data in HDFS
B. Hive Metastore table definitions pointing
to HDFS
II. Compute cluster with Alluxio
A. Spark points to Hive Metastore
B. Alluxio intercepts Spark calls to HDFS
III. Query execution
A. Accesses to HDFS are served by Alluxio
B. No manual data copies or application
re-writes
Spark Hive Integration
Hive
Metastore
Hive Connector
hdfs://hive/warehouse/table
1.
1I.
Spark
Alluxio
III.
Public Cloud
On-premise
s
HDFS
11
12. SEAMLESS HIVE CATALOG DEFINITIONS - HDFS & S3
No table redefinitions required using “Transparent URI”
Example Scenario
I. Initial state
A. Data in on-prem HDFS and S3 in cloud
B. Hive Metastore table definitions pointing
to HDFS and S3
II. Compute cluster with Alluxio
A. Spark points to Hive Metastore
B. Alluxio intercepts Spark calls to HDFS
and S3
III. Query execution
A. Accesses to HDFS and S3 are served by
Alluxio
B. No manual data copies or application
re-writes
Hive Integration
Hive
Metastore
Hive Connector
hdfs://hive/warehouse/table
s3:/bucket/hive/warehouse/table
1.
1I.
Trino/Spark
Alluxio
III.
Public Cloud
On-premise
s
HDFS
12
S3