Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The hidden engineering behind machine learning products at Helixa
Gianmario Spacagna, (Helixa)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
2. DATA ORCHESTRATION
SUMMIT
2020
In the next 20 minutes you will
learn about
1. A real ML system powering a platform used by
thousands of marketers around the globe
2. The tools and engineering practices that enabled
us to build fast, cheap, and robust pipelines
3. About Helixa
Helixa is an audience
intelligence platform that
uses Machine Learning to
provide accurate, and
timely, consumers
insights for modern
market research
4. DATA ORCHESTRATION SUMMIT
Audience: Size: 1.5M / 223M represented population
François CholletBen Hamner George Hotz
Top Influencers
201x 114x 106x
Cifar News
Top Media
The Hacker News AngelList
65x 31x 28x
Tensorflow
Top Products and Companies
Waymo Airbnb Engineering
107x 66x 55x
Demographics
18-40 years old
Male
U.S. and India
7. DATA ORCHESTRATION SUMMIT
Helixa end-to-end pipeline
Insights Engine
Other Analytics
Tools
Audience
Projection
Real-time
analytics
applications
Common
Data Model
Data
Processing
Data IntegrationsData
Contents
Embedding
Entity
Resolution
Taxonomy
Categorization
Users Digital
DNA
Traits
Classifiers
Latent Interests
Augmentation
Machine Learning
jobs
9. DATA ORCHESTRATION SUMMIT
Batch inference
Model repository and evaluation metrics
Training and hyper-parameters tuning
Analysis and Research
ML libraries
Data Labeling
Feature Store
Feature Engineering
Data Lake
Tech stack and tools
In this talk we
will focus on
12. DATA ORCHESTRATION SUMMIT
Artifacts are saved in S3 and crawled by Glue
Athena is used to build logical views on top of them such as:
▪ Retrieve the latest version of the artifact
▪ Aggregate multiple partitions of the same artifact
▪ Filter and merge with other tables
▪ Export snapshot of the views as versioned parquet datasets
Data Lake(house) using Glue and Athena
13. DATA ORCHESTRATION SUMMIT
Feature Store Partitions (X)
S3 bucket
❏ users
❏ features
❏ feature_family=text_embedding
❏ timestamp=2020-10-14-12-58
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ timestamp=2020-09-18-18-35
❏ ...
❏ feature_family=picture_embedding
❏ ...
❏ feature_family=category_counts
❏ ...
❏ items
❏ other entities
Parquet data indexed by user_id
Metadata containing info on how
the features were created
Partition by set of features
generated by the same job
Creation time
14. DATA ORCHESTRATION SUMMIT
Label Store Partitions (y)
S3 bucket
❏ users
❏ labels
❏ variable=gender
❏ source=first_name
❏ timestamp=2020-10-14-12-58
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ source=public_profile
❏ ...
❏ variable=age
❏ items
❏ other entities
Partition by the variable we are
trying to predict
Partition by the source of ground
truth
Label management for weak learning done via
15. DATA ORCHESTRATION SUMMIT
Prediction Store Partitions (y_pred)
S3 bucket
❏ users
❏ predictions
❏ variable=gender
❏ model=xgbc
❏ timestamp=2020-11-05-17-22
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ model=cnn
❏ ...
❏ variable=age
❏ ...
❏ items
❏ other entities
Partition by the identifier of the
model used to predict
17. DATA ORCHESTRATION SUMMIT
Platforms for managing the ML lifecycle
● Training
● Predictions
● Model serving
● Model repository
● Experiments
tracking
● Evaluation metrics
Production
● Dev data
versioning and
linkage
● Automated
evaluation reports
● Collaborative
experiments
● Deep Learning
computing
environment
R&D
18. DATA ORCHESTRATION SUMMIT
R&D workflow
Pull
Notebooks and data stored
and shared in S3
Data cache
Dev unix
machine in
the cloud
Notebook name matching branch ID
Install the latest version of
the code
Develop code locally using
professional IDEs
Feature branches
matching Jira key
Gitflow
branching model
Commit and
push
19. DATA ORCHESTRATION SUMMIT
EC2 memory-optimized machines (r4 or r5 family)
EBS volume of 250GB of storage
Alluxio and Jupyter services to start at boot time
200GB reserved for the Alluxio cache
S3 buckets mounted locally in --readonly mode using fuse API
Read parquet data in multi-processing using Dask directly from the local file system instead of
using the S3 boto API
cache configuration
20. DATA ORCHESTRATION SUMMIT
Research & Development data: ~1TB
We only focus on 15% of data every month (~150GB)
Re-access of the data for every kernel restart (~5 times a day)
Data science team members (~5 people)
Datasets spread into files of ~120MB each
=> roughly 1.2k files and 500k read requests every month
We observed a speed-up between 3x to 5x using Alluxio
+ all of the benefits of accessing the S3 data from the POSIX API
benefits for the R&D
21. DATA ORCHESTRATION SUMMIT
Processing large datasets with EMR
Picture source: https://dimensionless.in/different-ways-to-manage-apache-spark-applications-on-amazon-emr/
Ephemeral clusters on spot instances can dramatically reduce the cost of operations
+ SparkMagic
SUBMIT JOB
23. DATA ORCHESTRATION SUMMIT
Automate code with a task-oriented
containerized jobs
Picture source: https://medium.com/@davidstevens_16424/make-my-day-ta-science-easier-e16bc50e719c
All of the analysis findings are moved into a production-quality
modules and entry points declared in makefiles for tasks such as:
● Data preparations
● Feature extractions
● Model selection / tuning
● Evaluations
● Model Inference
● Predictions post-processing
24. DATA ORCHESTRATION SUMMIT
Automate tasks execution using Continuous
Integration (CI)
Picture source: https://deploybot.com/blog/the-expert-guide-to-continuous-integration
On commit
Code tests
Evaluation reportsBuilds & Deployment
On release
25. DATA ORCHESTRATION SUMMIT
Embarrassingly parallel data processing and
batch inference with AWS batch
Source: https://spotinst.com/blog/cost-efficient-batch-computing-on-spot-instances-aws-batch-integration/
JobsData batches
~ a few GBs each
Output storage
26. DATA ORCHESTRATION SUMMIT
Model serving via microservices
SERVERLESS CHOICE
Cheap and simple solution for
deploying containers without have
to care about the infrastructure
Limits as of today:
Max 4 vCPUs and 30GB of RAM
OR
SERVERFUL CHOICE
Advanced, customizable, powerful,
widespread solutions for containers
orchestration on pools of EC2
instances
Requires infrastructure management
AWS EC2
27. DATA ORCHESTRATION SUMMIT
How do containers scale for real-time
varying requests load?
Number of requests per second
capacity
unexpected sudden burst
Over-provisioning cost
28. DATA ORCHESTRATION SUMMIT
Training pipeline
Real-time serverless model serving
Lookup user
and model info
Get users
features
trigger
Update
metainfo
and configs
REST
request
Get model
Package requirements
EFS
read libraries
predictionsreturn
save model
Build and deploy
29. DATA ORCHESTRATION SUMMIT
Comparison for real-time applications
Horizontal scaling Autoscaling rules based on predicted
load and capacity
Elastic, based on real-time demand
Provisioning time Minutes Immediately or seconds if cold start
Burst concurrency Depends on available resources 3000 + additional 500 every minute
Cost efficiency Pay for the over-provisioning Only pay for what you use (10x
cheaper in our use cases)
Vertical scaling Limited by instance types Limited to 3GB and 2 CPUs
Execution timeout Unlimited 15 minutes
31. DATA ORCHESTRATION SUMMIT
Orchestrating functions and microservices
with Step Functions
Workflows defined as a finite states
machine and plug-and-play integration
with most of the AWS services:
AWS Batch ECS
Sagemaker
35. DATA ORCHESTRATION SUMMIT
Infrastructure Monitoring and Alerting
Basic Monitoring
AWS resources and
custom metrics generated
by your applications and
services
General Infra Monitoring
Cloud-scale monitoring of
logs, metrics and traces
from distributed, dynamic
and hybrid infrastructure.
Serverless Monitoring
All-in-one performance
management tool down to
the single lines of code
specifically designed for
serverless applications.
36. DATA ORCHESTRATION SUMMIT
KPIs and Metrics Dashboard Data sanity checks
KPIs over time such as:
● Distribution shifts
● Model drift
● Utilization
● Coverage
Analytics dashboard on top of
athena SQL queries
Custom programmatic dashboards
with interactive charts
42. DATA ORCHESTRATION SUMMIT
Download the Non-Technical Guide
Topics covered:
✅ Getting started with understanding the technology
✅ Designing the right ML product
✅ Planning under uncertainty
✅ Building a balanced ML team
www.helixa.ai/machine-learning-guide-2020