The hidden engineering behind machine learning products at Helixa

DATA ORCHESTRATION SUMMI
T
The hidden engineering behind machine learning products at Helixa
Gianmario Spacagna
Chief Scientist at Helixa

DATA ORCHESTRATION
SUMMIT
2020
In the next 20 minutes you will
learn about
1. A real ML system powering a platform used by
thousands of marketers around the globe
2. The tools and engineering practices that enabled
us to build fast, cheap, and robust pipelines

About Helixa
Helixa is an audience
intelligence platform that
uses Machine Learning to
provide accurate, and
timely, consumers
insights for modern
market research

DATA ORCHESTRATION SUMMIT
Audience: Size: 1.5M / 223M represented population
François CholletBen Hamner George Hotz
Top Influencers
201x 114x 106x
Cifar News
Top Media
The Hacker News AngelList
65x 31x 28x
Tensorﬂow
Top Products and Companies
Waymo Airbnb Engineering
107x 66x 55x
Demographics
18-40 years old
Male
U.S. and India

Platform Requirements
Multiple Datasets Accurate consumers insights Real-time analytics quickly
Always available Minimum infrastructure
maintenance
Cost effective

Helixa ML System Overview

Helixa end-to-end pipeline
Insights Engine
Other Analytics
Tools
Audience
Projection
Real-time
analytics
applications
Common
Data Model
Data
Processing
Data IntegrationsData
Contents
Embedding
Entity
Resolution
Taxonomy
Categorization
Users Digital
DNA
Traits
Classifiers
Latent Interests
Augmentation
Machine Learning
jobs

Helixa architecture
Data Ingestions
ML Cloud Services
Pre-trained models External APIs
ML LibrariesML pipelines
Model repository
Production
DB
Microservices
Data Lake
Batch Jobs
Analytics
applications

Batch inference
Model repository and evaluation metrics
Training and hyper-parameters tuning
Analysis and Research
ML libraries
Data Labeling
Feature Store
Feature Engineering
Data Lake
Tech stack and tools
In this talk we
will focus on

The Data Lake(house)

Native Cloud Object (Data) Storage
Benefits:
● Cheaper
● Elastic
● Highly available
● Performant
Hadoop HDFS

Artifacts are saved in S3 and crawled by Glue
Athena is used to build logical views on top of them such as:
▪ Retrieve the latest version of the artifact
▪ Aggregate multiple partitions of the same artifact
▪ Filter and merge with other tables
▪ Export snapshot of the views as versioned parquet datasets
Data Lake(house) using Glue and Athena

Feature Store Partitions (X)
S3 bucket
❏ users
❏ features
❏ feature_family=text_embedding
❏ timestamp=2020-10-14-12-58
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ timestamp=2020-09-18-18-35
❏ ...
❏ feature_family=picture_embedding
❏ ...
❏ feature_family=category_counts
❏ ...
❏ items
❏ other entities
Parquet data indexed by user_id
Metadata containing info on how
the features were created
Partition by set of features
generated by the same job
Creation time

Label Store Partitions (y)
S3 bucket
❏ users
❏ labels
❏ variable=gender
❏ source=first_name
❏ timestamp=2020-10-14-12-58
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ source=public_profile
❏ ...
❏ variable=age
❏ items
❏ other entities
Partition by the variable we are
trying to predict
Partition by the source of ground
truth
Label management for weak learning done via

Prediction Store Partitions (y_pred)
S3 bucket
❏ users
❏ predictions
❏ variable=gender
❏ model=xgbc
❏ timestamp=2020-11-05-17-22
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ model=cnn
❏ ...
❏ variable=age
❏ ...
❏ items
❏ other entities
Partition by the identifier of the
model used to predict

The Development

Platforms for managing the ML lifecycle
● Training
● Predictions
● Model serving
● Model repository
● Experiments
tracking
● Evaluation metrics
Production
● Dev data
versioning and
linkage
● Automated
evaluation reports
● Collaborative
experiments
● Deep Learning
computing
environment
R&D

R&D workflow
Pull
Notebooks and data stored
and shared in S3
Data cache
Dev unix
machine in
the cloud
Notebook name matching branch ID
Install the latest version of
the code
Develop code locally using
professional IDEs
Feature branches
matching Jira key
Gitflow
branching model
Commit and
push

EC2 memory-optimized machines (r4 or r5 family)
EBS volume of 250GB of storage
Alluxio and Jupyter services to start at boot time
200GB reserved for the Alluxio cache
S3 buckets mounted locally in --readonly mode using fuse API
Read parquet data in multi-processing using Dask directly from the local file system instead of
using the S3 boto API
cache configuration

Research & Development data: ~1TB
We only focus on 15% of data every month (~150GB)
Re-access of the data for every kernel restart (~5 times a day)
Data science team members (~5 people)
Datasets spread into files of ~120MB each
=> roughly 1.2k files and 500k read requests every month
We observed a speed-up between 3x to 5x using Alluxio
+ all of the benefits of accessing the S3 data from the POSIX API
benefits for the R&D

Processing large datasets with EMR
Picture source: https://dimensionless.in/different-ways-to-manage-apache-spark-applications-on-amazon-emr/
Ephemeral clusters on spot instances can dramatically reduce the cost of operations
+ SparkMagic
SUBMIT JOB

The Deployment

Automate code with a task-oriented
containerized jobs
Picture source: https://medium.com/@davidstevens_16424/make-my-day-ta-science-easier-e16bc50e719c
All of the analysis findings are moved into a production-quality
modules and entry points declared in makefiles for tasks such as:
● Data preparations
● Feature extractions
● Model selection / tuning
● Evaluations
● Model Inference
● Predictions post-processing

Automate tasks execution using Continuous
Integration (CI)
Picture source: https://deploybot.com/blog/the-expert-guide-to-continuous-integration
On commit
Code tests
Evaluation reportsBuilds & Deployment
On release

Embarrassingly parallel data processing and
batch inference with AWS batch
Source: https://spotinst.com/blog/cost-efficient-batch-computing-on-spot-instances-aws-batch-integration/
JobsData batches
~ a few GBs each
Output storage

Model serving via microservices
SERVERLESS CHOICE
Cheap and simple solution for
deploying containers without have
to care about the infrastructure
Limits as of today:
Max 4 vCPUs and 30GB of RAM
OR
SERVERFUL CHOICE
Advanced, customizable, powerful,
widespread solutions for containers
orchestration on pools of EC2
instances
Requires infrastructure management
AWS EC2

How do containers scale for real-time
varying requests load?
Number of requests per second
capacity
unexpected sudden burst
Over-provisioning cost

Training pipeline
Real-time serverless model serving
Lookup user
and model info
Get users
features
trigger
Update
metainfo
and configs
REST
request
Get model
Package requirements
EFS
read libraries
predictionsreturn
save model
Build and deploy

Comparison for real-time applications
Horizontal scaling Autoscaling rules based on predicted
load and capacity
Elastic, based on real-time demand
Provisioning time Minutes Immediately or seconds if cold start
Burst concurrency Depends on available resources 3000 + additional 500 every minute
Cost efficiency Pay for the over-provisioning Only pay for what you use (10x
cheaper in our use cases)
Vertical scaling Limited by instance types Limited to 3GB and 2 CPUs
Execution timeout Unlimited 15 minutes

Pick the best of both worlds

Orchestrating functions and microservices
with Step Functions
Workflows defined as a finite states
machine and plug-and-play integration
with most of the AWS services:
AWS Batch ECS
Sagemaker

Hybrid solution for:

Monitoring and Alerting

Centralized logging with the ELK stack
Generate Logs Aggregation &
Transformation
Storage & Indexing Visualization & Analysis

Infrastructure Monitoring and Alerting
Basic Monitoring
AWS resources and
custom metrics generated
by your applications and
services
General Infra Monitoring
Cloud-scale monitoring of
logs, metrics and traces
from distributed, dynamic
and hybrid infrastructure.
Serverless Monitoring
All-in-one performance
management tool down to
the single lines of code
specifically designed for
serverless applications.

KPIs and Metrics Dashboard Data sanity checks
KPIs over time such as:
● Distribution shifts
● Model drift
● Utilization
● Coverage
Analytics dashboard on top of
athena SQL queries
Custom programmatic dashboards
with interactive charts

Final Remarks

DATA ORCHESTRATION SUMMITSource: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems
Only a small fraction of real-world ML systems is composed of the ML Code.
The required surrounding infrastructure is vast and complex.

Facebook new motto in 2014Facebook original motto

Different tools

Embrace the serverless paradigm

Download the Non-Technical Guide
Topics covered:
✅ Getting started with understanding the technology
✅ Designing the right ML product
✅ Planning under uncertainty
✅ Building a balanced ML team
www.helixa.ai/machine-learning-guide-2020

Gianmario Spacagna
Chief Scientist at Helixa
gspacagna@helixa.ai
gm_spacagna
gmspacagna
datasciencevademecum.com

The hidden engineering behind machine learning products at Helixa

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a The hidden engineering behind machine learning products at Helixa

Semelhante a The hidden engineering behind machine learning products at Helixa (20)

Mais de Alluxio, Inc.

Mais de Alluxio, Inc. (20)

Último

Último (20)

The hidden engineering behind machine learning products at Helixa