Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

DATA ORCHESTRATION SUMMIT
2020
SuperDB
Modernizing Global Shared Data Analytics Platform and our Alluxio
Journey
Sandipan Chakraborty | Director Engineering

2
Topics
• Brief About Rakuten
• SuperDB Journey
• Our Data Landscape
• Challenges
• Approach
• Journey with Alluxio

3
70
+
Service
s
Japan’s Largest e-commerce company
Internet
Services
Fintech
Services
Communication
s
& Contents

4
SuperDB: Centralized Data Platform for Rakuten
Ecosystem
43Services*1
700+
TeraBytes
Normalized Data sets
6,500+
Users*2
from 40+
Businesses
70+Services in Ecosystem
*1 Excluding small services and common services *2 # of weekly active
users
PetaBytes of
Data

5
Our Journey
201
3
201
8
8 25+
Teradata + Hadoop
Big Data Stack
2013 - 2018
Presto
Mesos DC/OS
On-Premise & GCP
Hadoop
Cluster
GCS
Click Stream Data
Recommendation,
PersonalizationSupport ML,
1 5
200
7
201
2
Traditional EDW
Teradata
2007 –
2012
On-Premise
BI Reporting,
Ad-Hoc Analysis
30+
services
2019
2019 -
2020
40+
services
202
0
Multi-Cloud (GCP + Azure)
Presto + Alluxio
(POC)
Mesos DC/OS
Kubernetes
Starburst Presto
Alluxio (Prod.)
Cloud Storage
Hybrid Compute
Hadoop
ClustersObject Storage
Optimize Analytics
Optimize AI / ML
Teradata + Hadoop
Big Data Stack

6
Our Data
Landscape
Web /
RAT
User
Transaction
IoT /
Device
Apps
Real-Time & Batch
(Containers)
Common
Schema
Business Generated
(Data Producers) Business DWH / DL
SuperDB
(Enterprise Repository)
Data challenges
Diverse data from diverse
sources, growing rapidly
Easier Data Management
Based on Personas, Gives Transparency & Better Cost Control,
Standardized and Automated
Faster & Better Insight & AI
Start Analysis by ability to connect and Run
anywhere
Insights & Data Science
(Data Consumers)
SINGLE VERSION OF TRUTH!
Data Projections
and Feature
Sets
Virtual Data Mart
(On-Prem / GCP
Cloud)
Super DB
on-premise (JP)
SuperDB Cloud
(US, Japan)
Auto-Sync
On-Demand Scale (Cloud + Containers)
Common
Schema
Cloud
Bursting
(Containers)
AWS
Azure
GCP
Click Stream
On-premise
Faster Business
Insight
Faster time to
Analysis
Quick
Experimentation
Cloud Native & Hybrid Architecture Granular
Access
control
Data encryption (End to End)
Multi-Factor
Authenticatio
n
Query Layer
Normalized
Transaction /
aggregated
Transaction /
aggregated
Transaction /
aggregated
Auto-Sync

7
•Adhoc Query Capacity
•Discover, Fast and Easy Access, OLAP
& Low Latency
•BI Support and Reporting.
Business
Analysts
•Adhoc Query Capacity. (OLAP, low
latency)
•Run workload in large scale computing.
•Data Science Platform and tools for ML
- AI workloads
Data
Scientists
•Ability to Integrate with API’s
•Support of Data Sync to different
clusters
Applications
•Query, Data Ingestion and
Transformation
•Scalable processing, long running jobs
•Real-time and Batch Support
•Data QC Support
Data
Engineers
• Secured Access Layer
• Ability to create Audit Reports
• Data Lineage and traceability
Governanc
e, Audit &
Security
•Maintaining the data system infra.
•Workload Turning.
•Data Pipeline maintaining.
•Data QC
System
Admin &
Operators
•Creates, Joins, Ad-hoc Report,
KPI’s
•Experiments & Quick Analysis
•Support various Marketing
activities
Sand-Box
Users
Support for Different
Personas

8
Our Challenges
• Compute elasticity for experiments.
• Adding capacity was time-consuming process
System
Scalability
• Unable to address / optimize for different Personas
• Legacy Code, limited processing power resulting in Job delays
Data
Availability
• Too many data copy pipelines needed to be built, delaying the access to data
• Managing for data copy pipelines to different clusters became an operation overhead.Data freshness
• Data Movements before any Analysis can be done. Not all is present in DWH for
analysis.
• Quick Analysis cannot be done across different businesses data silos.
Analytics
Agility

9
Our Approach
•Compute Elasticity for Experiments
•Adding capacity was time-consuming
process
System
Scalability
•Unable to address / optimize for
different Personas
•Legacy Code, limited processing
power resulting in Job delays
Data
Availability
•Data sync cannot be done between
different cluster in DC’s.
•Too many Data Copies
Data
freshness
•Cannot join between Transaction &
behavior data.
•Needs lot of Data Movements
•Quick Analysis cannot be done.
Analytics
Agility
Hybrid & Cloud-native architecture
• On Demand Compute with Public Cloud
• Separate Storage and Process
• Containerization and Cloud Native
Data Sync & Orchestration (Alluxio)
• Data Sync across DC’s and Cloud.
• Data Processing Cache Layer
Query Layer (Starburst Presto)
• Start Analytics connecting to different stores on
multi-cloud , on-prem before any data
movement
• Common security layer with Ranger

10
One Major Challenge
Data Sources Teradata
Legacy HDFS
New HDFS
PwC
Legac
y
ODIN
Python
Legacy
copy
copy
copy
Pipeline X
Pipeline Y
Pipeline Z
❖ ODIN is homebrewed data ingestion system
❖ Legacy HDFS and New HDFS are in different data centers, so downstream migration is not straight forward due to computing resource
constrains
GCS
copy
Spark
Pipeline
New

11
Data Sync
Source
Data
Alluxio Ingest
Alluxio XHDFS Cluster
HDFS Cluster
GCS
Alluxio Y
Alluxio Z
Rakuten
DC1
Rakuten
DC2
GCP
❖ Alluxio Ingest Cluster: data persist to multi destination via Under Store Replication.
❖ Consumption tool cache data from different DC to improve performance, and enable DR
Released in Production

12
Data Caching for Consumption
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
Mem
Cach
e
Mem
Cach
e
Mem
Cach
e
Mem
Cach
e
GKE (GCP) & AKS (Azure) 2020
Production
Physical box
Physical box
Physical box
HDFS: DC
local
HDFS: DC
remote 1
HDFS DC
remote 2
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
On-Prem Bare Metal (POC)

13
Consumption in Production Today
Physical
box
Physical
box
Physical
box
HDFS DC1
HDFS DC 2
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
On-Prem Bare Metal
(2019 - Early 2020)
Bare Metal (K8 Cluster) --- Present Production

14
TensorFlow /
Caffe
Spark
Compute
(Transformati
on)
Spark
Compute
Aggregations
Distributed
Cache
Kubernetes ,
KubeflowLinu
x
Rakuten
OneClou
d
Bare Metal GPU CPU
HD
FS
Object
Store
NA
S
Libfuse
AlluxioFUSE
Alluxio
JVM
Distributed Cache (Presently under POC)

15
Our Journey with
Alluxio
Started using Presto
Open source
(On-Prem)
201
7
201
8
Started using Presto
Open source
(GCP)
POC with Presto +
Alluxio
(GCP)
201
9
202
0
Presto + Alluxio
(GCP , Azure)
POC : Distributed
Cache with Alluxio for
ML & Data Pipeline
Jobs
Data Sync with Alluxio
(On-Prem)
202
1
Planned : Distributed
Cache with Alluxio for
ML & Data Pipeline
Jobs

16
Overview: Wrap-up
RDB
NoSQL
Files
events Pipeline
Service
Hadoop
Discovery Service
Consumption Service
Transformations
Landing
zone
Common
Schema
mapping
Common
Marts
Data Orchestration Layer
Presto
BI toolsAI / ML
Data
Exploring
Downstream
pipelines
Spark
Schema management Data ACL Classiﬁcation Auditing
Changelogs
Changelogs
Cloud

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Semelhante a Modernizing Global Shared Data Analytics Platform and our Alluxio Journey (20)

Mais de Alluxio, Inc.

Mais de Alluxio, Inc. (20)

Último

Último (20)

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey