Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty, Director of Engineering (Rakuten)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
4. 4
SuperDB: Centralized Data Platform for Rakuten
Ecosystem
43Services*1
700+
TeraBytes
Normalized Data sets
6,500+
Users*2
from 40+
Businesses
70+Services in Ecosystem
*1 Excluding small services and common services *2 # of weekly active
users
PetaBytes of
Data
6. 6
Our Data
Landscape
Web /
RAT
User
Transaction
IoT /
Device
Apps
Real-Time & Batch
(Containers)
Common
Schema
Business Generated
(Data Producers) Business DWH / DL
SuperDB
(Enterprise Repository)
Data challenges
Diverse data from diverse
sources, growing rapidly
Easier Data Management
Based on Personas, Gives Transparency & Better Cost Control,
Standardized and Automated
Faster & Better Insight & AI
Start Analysis by ability to connect and Run
anywhere
Insights & Data Science
(Data Consumers)
SINGLE VERSION OF TRUTH!
Data Projections
and Feature
Sets
Virtual Data Mart
(On-Prem / GCP
Cloud)
Super DB
on-premise (JP)
SuperDB Cloud
(US, Japan)
Auto-Sync
On-Demand Scale (Cloud + Containers)
Common
Schema
Cloud
Bursting
(Containers)
AWS
Azure
GCP
Click Stream
On-premise
Faster Business
Insight
Faster time to
Analysis
Quick
Experimentation
Cloud Native & Hybrid Architecture Granular
Access
control
Data encryption (End to End)
Multi-Factor
Authenticatio
n
Query Layer
Normalized
Transaction /
aggregated
Transaction /
aggregated
Transaction /
aggregated
Auto-Sync
7. 7
•Adhoc Query Capacity
•Discover, Fast and Easy Access, OLAP
& Low Latency
•BI Support and Reporting.
Business
Analysts
•Adhoc Query Capacity. (OLAP, low
latency)
•Run workload in large scale computing.
•Data Science Platform and tools for ML
- AI workloads
Data
Scientists
•Ability to Integrate with API’s
•Support of Data Sync to different
clusters
Applications
•Query, Data Ingestion and
Transformation
•Scalable processing, long running jobs
•Real-time and Batch Support
•Data QC Support
Data
Engineers
• Secured Access Layer
• Ability to create Audit Reports
• Data Lineage and traceability
Governanc
e, Audit &
Security
•Maintaining the data system infra.
•Workload Turning.
•Data Pipeline maintaining.
•Data QC
System
Admin &
Operators
•Creates, Joins, Ad-hoc Report,
KPI’s
•Experiments & Quick Analysis
•Support various Marketing
activities
Sand-Box
Users
Support for Different
Personas
8. 8
Our Challenges
• Compute elasticity for experiments.
• Adding capacity was time-consuming process
System
Scalability
• Unable to address / optimize for different Personas
• Legacy Code, limited processing power resulting in Job delays
Data
Availability
• Too many data copy pipelines needed to be built, delaying the access to data
• Managing for data copy pipelines to different clusters became an operation overhead.Data freshness
• Data Movements before any Analysis can be done. Not all is present in DWH for
analysis.
• Quick Analysis cannot be done across different businesses data silos.
Analytics
Agility
9. 9
Our Approach
•Compute Elasticity for Experiments
•Adding capacity was time-consuming
process
System
Scalability
•Unable to address / optimize for
different Personas
•Legacy Code, limited processing
power resulting in Job delays
Data
Availability
•Data sync cannot be done between
different cluster in DC’s.
•Too many Data Copies
Data
freshness
•Cannot join between Transaction &
behavior data.
•Needs lot of Data Movements
•Quick Analysis cannot be done.
Analytics
Agility
Hybrid & Cloud-native architecture
• On Demand Compute with Public Cloud
• Separate Storage and Process
• Containerization and Cloud Native
Data Sync & Orchestration (Alluxio)
• Data Sync across DC’s and Cloud.
• Data Processing Cache Layer
Query Layer (Starburst Presto)
• Start Analytics connecting to different stores on
multi-cloud , on-prem before any data
movement
• Common security layer with Ranger
10. 10
One Major Challenge
Data Sources Teradata
Legacy HDFS
New HDFS
PwC
Legac
y
ODIN
Python
Legacy
copy
copy
copy
Pipeline X
Pipeline Y
Pipeline Z
❖ ODIN is homebrewed data ingestion system
❖ Legacy HDFS and New HDFS are in different data centers, so downstream migration is not straight forward due to computing resource
constrains
GCS
copy
Spark
Pipeline
New
11. 11
Data Sync
Source
Data
Alluxio Ingest
Alluxio XHDFS Cluster
HDFS Cluster
GCS
Alluxio Y
Alluxio Z
Rakuten
DC1
Rakuten
DC2
GCP
❖ Alluxio Ingest Cluster: data persist to multi destination via Under Store Replication.
❖ Consumption tool cache data from different DC to improve performance, and enable DR
Released in Production
12. 12
Data Caching for Consumption
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
Mem
Cach
e
Mem
Cach
e
Mem
Cach
e
Mem
Cach
e
GKE (GCP) & AKS (Azure) 2020
Production
Physical box
Physical box
Physical box
HDFS: DC
local
HDFS: DC
remote 1
HDFS DC
remote 2
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
On-Prem Bare Metal (POC)
13. 13
Consumption in Production Today
Physical
box
Physical
box
Physical
box
HDFS DC1
HDFS DC 2
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
On-Prem Bare Metal
(2019 - Early 2020)
Bare Metal (K8 Cluster) --- Present Production
15. 15
Our Journey with
Alluxio
Started using Presto
Open source
(On-Prem)
201
7
201
8
Started using Presto
Open source
(GCP)
POC with Presto +
Alluxio
(GCP)
201
9
202
0
Presto + Alluxio
(GCP , Azure)
POC : Distributed
Cache with Alluxio for
ML & Data Pipeline
Jobs
Data Sync with Alluxio
(On-Prem)
202
1
Planned : Distributed
Cache with Alluxio for
ML & Data Pipeline
Jobs
16. 16
Overview: Wrap-up
RDB
NoSQL
Files
events Pipeline
Service
Hadoop
Discovery Service
Consumption Service
Transformations
Landing
zone
Common
Schema
mapping
Common
Marts
Data Orchestration Layer
Presto
BI toolsAI / ML
Data
Exploring
Downstream
pipelines
Spark
Schema management Data ACL Classification Auditing
Changelogs
Changelogs
Cloud