Accelerate ML workloads using EC2 accelerated computing

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Accelerate ML workloads using EC2
accelerated computing
Chetan Kapoor
Principal Product Manager – Amazon EC2
C M P 2 0 2

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EC2 instance types
General
purpose
Compute
optimized
Storage
optimized
Memory
optimized
Accelerated
computing
M5
T3
C5
C4
H1
I3
X1e
R5
F1
P3
G3
D2

Choice of processors and architectures*
Right compute for each application and workload
*Not all processors and architectures available globally
Over 100 EC2 instances
Featuring
Intel Xeon Processors
AWS Graviton Processor
based on 64-bit Arm
architecture
AMD EPYC processor
Additional Amazon EC2 instances featuring
Nvidia GPUs FPGAs

Hardware accelerationfor computationallydemand applications
• Image recognition, natural language processing, speech recognition
Machine learning
• Computational fluid dynamics, genomics, weather simulation, EDA
High performance computing
• Graphics workstations, video transcoding, game streaming
Graphic intensive

C5: Compute-optimized instances
Custom 3.0 GHz Intel Xeon Scalable Processors (Skylake)
Up to 72 vCPUs and 144 GiB of memory (2:1 Memory:vCPU ratio)
25 Gbps network bandwidth
Support for Intel AVX-512 – Great for ML inference
C5d with local NVMe-based SSD storage
Up to 50%* AWS instance saving over C4
25% price/ performance
improvement over C4
C4 C5
“We saw significant performance improvement on Amazon EC2
C5, with up to a 140% performance improvement in open
standard CPU benchmarks over C4.”
“We are eager to migrate onto the AVX-512 enabled c5.18xlarge
instance size… We expect to decrease the processing time of
some of our key workloads by more than 30%.”

C5n: fastest networking in the cloud
33% Increased memory footprint over C5 instances
25 Gbps peak bandwidth on smaller instance sizes
Featuring Intel Xeon Scalable processors
Faster analytics and
big data workloads
Lower costs for
network-bound workloads
All of the elasticity, security,
and scalability of AWS
C5n
100 Gbps network bandwidth on largest instance sizes

z1d: high frequency for specialized workloads
High Frequency instances with custom Intel
Xeon Scalable processors running at
sustained 4 GHz all core turbo
8:1 GiB to vCPU ratio
Up to 25-Gbps network bandwidth and up to
1.8 TB of local NVMe storage
Electronic Design Automation Relational databases Gaming
z1d.large z1d.12xlarge
384 GiB
48 vCPU
…6 sizes

• 10s–100s of processing
cores
• Pre-defined instruction set &
datapath widths
• Optimized for general-
purpose computing
CPU
CPUs vs. GPUs vs. FPGA vs. ASICs for compute
• 1,000s of processing cores
• Pre-defined instruction set
and datapath widths
• Highly effective at parallel
execution
GPU
• Millions of programmable
digital logic cells
• No predefined instruction set
or datapath widths
• Hardware-timed execution
FPGA
DRAM
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
• Optimized & custom design
for particular use/function
• Predefined software
experience exposed through
API
ASICs

EC2 accelerated computing instances
P3: GPU compute instance
• Up to 8 NVIDIA V100 GPUs in a single instance, with NVLink for peer-to-peer GPU communication
• Supporting a wide variety of use cases including deep learning, HPC simulations, financial computing, and batch
rendering
G3: GPU graphics instance
• Up to 4 NVIDIA M60 GPUs, with GRID Virtual Workstation features and licenses
• Designed for workloads such as 3D rendering, 3D visualizations, graphics-intensive remote workstations, video
encoding, and virtual reality applications
F1: FPGA instance
• Up to 8 Xilinx Virtex UltraScale+ VU9P FPGAs in a single instance. Programmable via VHDL, Verilog, or OpenCL.
Growing marketplace of pre-built application accelerations.
• Designed for hardware-accelerated applications including financial computing, genomics, accelerated search, and
image processing
AWS Inferentia – ML Inference Chip
• High-performance machine learning inference chip, custom designed by AWS
• Designed for lower cost-per-inference across the full range of ML applications
P3
G3
F1

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EC2 P3 instances for
compute acceleration

Amazon EC2 P3 instances (October 2017)
• Up to eight NVIDIA Tesla V100 GPUs
• 1 PetaFLOPs of computational performance – Up
to 14x better than P2
• 300 GB/s GPU-to-GPU communication (NVLink)
– 9x better than P2
• 16-GB GPU memory with 900 GB/sec peak GPU
memory bandwidth
O n e o f t h e f a s t e s t , m o s t p o w e r f u l G P U i n s t a n c e s i n t h e c l o u d

Use cases for P3 instances
Machine learning/AI High performance computing
Natural language
processing
Image and video
recognition
Autonomous vehicle
systems
Recommendation systems
Computational fluid
dynamics
Financial and data
analytics
Weather
simulation
Computational chemistry

Data visualization &
analysis
Business problem –
ML problem framing Data collection
Data integration
Data preparation &
Cleaning
Feature engineering
Model training &
Parameter tuning
Model evaluation
Are business
goals met?
Model deployment
Monitoring &
Debugging
– Predictions
YesNo
DataAugmentation
Feature
augmentation
The machine learning process
Re-training

Training machine learning models
AlexNet, 2012
• A large, deep convolutional neural network with 5 convolutional layer, 60
million parameters, and 650,000 neurons
• Created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton
• Won the 2012 ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)
• Used two NVIDIA GTX 580 GPUs
• Took nearly a week to train!
Source - https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AWS P3 vs. P2 instance
GPUperformancecomparison
• P2 instances use K80 Accelerator (Kepler architecture)
• P3 instances use V100 Accelerator (Volta architecture)
0
2
4
6
8
10
12
14
16
K80 P100 V100
FP32 Perf (TFLOPS)
1.7x
0
1
2
3
4
5
6
7
8
K80 P100 V100
FP64 Perf (TFLOPS)
2.6x
0
20
40
60
80
100
120
140
K80 P100 V100
Mixed/FP16 Perf (TFLOPS)
14x
over K80s
max perf.
FP32

P3 instances details
Instance size GPUs
GPU peer to
peer
vCPUs
Memory
(GB)
Network
bandwidth
Amazon EBS
bandwidth
On-Demand
price/hr.*
1-yr RI
effective
hourly*
3-yr RI
effective
hourly*
P3.2xlarge 1 No 8 61 Up to 10 Gbps 1.7 Gbps $3.06
$1.99
(35% disc.)
$1.23
(60% disc.)
P3.8xlarge 4 NVLink 32 244 10 Gbps 7 Gbps $12.24
$7.96
(35% disc.)
$4.93
(60% disc.)
$15.91
(35% disc.)
$9.87
(60% disc.)
Regional availability
P3 instances are generally available in AWS US East
(Northern Virginia), US East (Ohio), US West (Oregon),
EU (Ireland), Asia Pacific (Seoul), Asia Pacific (Tokyo),
AWS GovCloud (US) and China (Beijing) Regions
Framework support
P3 instances and their V100 GPUs supported across
all major frameworks (such as TensorFlow, MXNet,
PyTorch, Caffe2 and CNTK)

P3 instances details
Instance size GPUs
GPU peer to
peer
vCPUs
Memory
(GB)
Network
bandwidth
EBS
bandwidth
On-Demand
price/hr*
1-yr RI
effective
hourly*
3-yr RI
effective
hourly*
P3.2xlarge 1 No 8 61 Up to 10 Gbps 1.7 Gbps $3.06
$1.99
(35% disc.)
$1.23
(60% disc.)
$7.96
(35% disc.)
$4.93
(60% disc.)
$15.91
(35% disc.)
$9.87
(60% disc.)
• P3 instances provide GPU-to-GPU data
transfer over NVLink
• P2 instanced provided GPU-to-GPU data
transfer over PCI Express

New larger P3 size – P3dn.24xlarge
OptimizedfordistributedMLtraining
• One of the most powerful GPU instances available in the cloud
• 100 Gbps of networking throughput
• 96 vCPU using AWS customer Skylake CPUs and 768 GB of system memory
• Based on NVIDIA’s latest GPU Tesla V100 with 32 GB of memory
Instance size GPUs GPU memory
GPU
peer to peer
vCPUs CPU type
Memory
(GB)
Network
bandwidth
Amazon EBS
bandwidth
Local instance
storage
P3.2xlarge 1 x V100 16 GB/GPU No 8 Broadwell 61 Up to 10 Gbps 1.7 Gbps NA
P3.8xlarge 4 x V100 16 GB/GPU NVLink 32 Broadwell 244 10 Gbps 7 Gbps NA
P3.16xlarge 8 x V100 16 GB/GPU NVLink 64 Broadwell 488 25 Gbps 14 Gbps NA
P3dn.24xlarge 8 x V100 32 GB/GPU NVLink 96 Skylake 768 100 Gbps 14 Gbps 2 TB NVMe
Latest NVIDA V100 GPU with 32 GB
memory for large models and higher
batch sizes
96 Skylake vCPUs with support for
AVX-512 instructions for pre-
processing of training data
100 Gbps of networking throughput
for large-scale distributed training &
fast data access

Scaling performance using distributed training
-
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
1 2 4 8 16 32 64
Images/Second
Number of GPUs
Training using P3 instances
(ResNet-50, ImageNet Images/Second)
• Using single P3 instances, with Volta GPUs,
customers can cut down training times of
their machine learning models from days to
a few hours.
• Using distributed training via multiple P3
instances, high performance networking and
storage solutions, customers can further cut
down their time-to-train from hours in to
minutes.
• Example – We have been able to train
ResNet-50 to Top1 validation accuracy of
76% in 14 mins cluster of P3.16xlarge
instances.
https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-deep-learning-training-using-gpus-in-the-aws-cloud/

The broadest global availability Available AWS regions for P3
instances include:
• US East (N. Virginia)
• US East (Ohio)
• US West (Oregon)
• Canada (Central)
• Europe (Ireland)
• Europe (Frankfurt)
• Europe (London)
• Asia Pacific (Tokyo)
• Asia Pacific (Seoul)
• Asia Pacific (Sydney)
• Asia Pacific (Singapore)
• China (Beijing)
• China (Ningxia)
• AWS GovCloud (US)
Available AWS regions for P3dn.24xlarge instances include:
• US East (N. Virginia)
• US West (Oregon)

Amazon S3
Secure, durable, highly
scalable
object storage
Fast access, low cost
For long-term durable
storage of data, in a
readily accessible get/put
access format
Primary durable and
scalable storage for data
Amazon S3 Glacier
Secure, durable, long term,
highly cost-effective object
storage
For long-term storage and
archival of data that is
infrequently accessed
Use for long-term, lower-
cost archival of data
EC2+EBS
Create a Single-AZ
shared file system using
Amazon EC2 and
Amazon EBS, with third-
party or open source
software (e.g., ZFS, Intel
Lustre, etc.)
For near-line storage of
files optimized for high
I/O performance
Use for high-IOPs,
temporary working
storage
AWS storage options
Amazon EFS
Highly available, Multi-
AZ, fully managed
network-attached
elastic file system
For near-line, highly
available storage of files
in a traditional NFS
format (NFSv4)
Use for read-often,
temporary working
storage

Amazon FSx for Lustre
• High-performance file system optimized for
fast processing of workloads such as
machine learning, HPC, video processing,
financial modeling, and electronic design
automation
• Launch and run a file system that provides
submillisecond access to your data
• Enables you to read and write data at
speeds of up to hundreds of gigabytes per
second of throughput and millions of IOPS
Learn more at aws.amazon.com/fsx/lustre

AWS Deep Learning AMI
• Get started quickly with easy-to-launch tutorials
• Hassle-free setup and configuration
• Pay only for what you use—no additional charge for the
AMI
• Accelerate your model training and deployment
• Support for popular deep learning frameworks

1
2
3
Amazon SageMaker:
Build, train, and deploy ML models at scale

Amazon SageMaker:

RL Coach
Amazon SageMaker:

A W S I o T
G R E E N G R A S S
A m a z o n
E C 2 C 5
Amazon SageMaker:

Amazon EC2 G3 instances for
graphics acceleration

AWS G3 GPU instances
• Up to four NVIDIA M60 GPUs
• Includes GRID Virtual Workstation features and licenses, supports up to four monitors with
4096x2160 (4K) resolution
• Includes NVIDIA grid virtual application capabilities for application virtualization software like Citrix
XenApp Essentials and VMWare Horizon, supporting up to 25 concurrent users per GPU
• Hardware encoding to support up to 10 H.265 (HEVC) 1080p30 streams, and up to 18 H.264
1080p30 streams per GPU
• Designed for workloads such as 3D rendering, 3D visualizations, graphics-intensive remote
workstations, video encoding, and virtual reality applications
Instance Size GPUs vCPUs Memory (GiB)
Linux price per hour
(IAD)
Windows price per hour (IAD)
g3s.xlarge 1 4 30.5 $0.75 $0.93
g3.4xlarge 1 16 122 $1.14 $1.88
g3.8xlarge 2 32 244 $2.28 $3.75
g3.16xlarge 4 64 488 $4.56 $7.50

Four modes of using G3 instances
CPU
16 vCPUs
GPU
1 x M60
Memory
122 GB
G3.4xlarge
Up to 10G
Network
Graphics
rendering,
simulations, video
encoding
EC2 instance with
NVIDIA drivers &
libraries
EC2 instance with
NVIDIA GRID
NVIDIA GRID virtual
workstation
NVIDIA GRID
virtual application
Professional
workstation (single
user)
Virtual apps
(25 concurrent
users) Gaming
services
EC2 instance w/
NVIDIA GRID for
gaming

M&E – Content creation
Auto – Car configurators
E&P – Analytics
• Seismic analysis, energy E&P, cloud GPU rendering &
visualization, such as high end car configurators, AR/VR
• Desktop and application virtualization
• Productivity and consumer apps
• Design and engineering
• Media and entertainment post-production
• Media and entertainment: video playout/broadcast,
encoding/transcoding
• Cloud gaming
G3 use cases

AWS G4 GPU instances
• Designed for machine learning inferencing,
video transcoding, remote graphics
workstation, and other demanding graphics
applications.
• Up to 8 NVIDIA T4 Tensor Core GPUs
• 2560 CUDA Cores, 320 Turing Codes including
support for Ray-Tracing technology
• Available in multiple sizes
• AWS-custom Intel CPUs (4–96 vCPUs)
• Available soon

Amazon EC2 F1 instances for
custom hardware acceleration

An FPGA is effective at processing data of many types in parallel, for example creating a
complex pipeline of parallel, multistage operations on a video stream, or performing
massive numbers of dependent or independent calculations for a complex financial model…
• An FPGA does not have an instruction-
set!
• Data can be any bit-width (9-bit integer?
No problem!)
• Complex control logic (such as a state
machine) is easy to implement in an
FPGA
Each FPGA in
F1 has more
than 2M of
these cells
Parallel processing in FPGAs

….
….
module filter1 (clock, rst, strm_in, strm_out)
for (i=0; i<NUMUNITS; i=i+1)
always@(posedge clock)
integer i,j; //index for loops
tmp_kernel[j] = k[i*OFFSETX];
FPGA handles compute-
intensive, deeply
pipelined, hardware-
accelerated operations
CPU handles the rest
Application
How FPGA acceleration works
….
….
….
….
….
….
….

F1 FPGA instance types on AWS
▪Up to 8 Xilinx UltraScale+ 16 nm VU9P FPGA devices in a single instance
▪The f1.16xlarge size provides:
▪ 8 FPGAs, each with over 2 million customer-accessible FPGA programmable logic
cells and over 5000 programmable DSP blocks
▪ Each of the 8 FPGAs has 4 DDR-4 interfaces, with each interface accessing a 16
GiB, 72-bit wide, ECC-protected memory
Instance size FPGAs
FPGA memory
(GB)
vCPUs
Instance memory
(GB)
NVMe instance
storage (GB)
Network
bandwidth
f1.2xlarge 1 64 8 122 1 x 470 Up to 10 Gbps
f1.4xlarge 2 128 16 244 1 x 940 Up to 10 Gbps
f1.16xlarge 8 512 64 976 4 x 940 25 Gbps

Three methods to use F1 instance
Hardware engineers/developers1
•Developers who are comfortable programming FPGA
•Use F1 Hardware Development Kit (HDK) to develop and deploying custom FPGA accelerations using Verilog and VHDL
Software engineers/developers2
• Developers who are not proficient in FPGA design
• Use OpenCL to create custom accelerations
Software engineers/developers3
• Developers who are not proficient in FPGA design
• Use pre-build and ready to use accelerations available in AWS Marketplace

FPGA acceleration development
PCIe
DDR
controllers DDR-4
attached
memory
EC2
F1
Launch instance and load AFI
Amazon Machine Image (AMI)
CPU
Application
Amazon FPGA Image (AFI)
An F1 instance can have any
number of AFIs
An AFI can be loaded into
the FPGA in seconds

Developing custom accelerations
The FPGA Developer AMI
Use Xilinx Vivado and a hardware description language (Verilog or VHDL for RTL) with the HDK to
describe and simulate your FPGA logic
Xilinx Vivado for custom logic development Virtual JTAG for interactive debugging

OpenCL generally available for F1
▪ Familiar development experience to accelerate C/C++
applications
▪ 50+ F1 code examples available that span multiple
domains: security, image processing, and accelerated
algorithms
▪ Already supported on the FPGA Developer AMI, no
need to upgrade/install

AWS Marketplace
Discover, procure, deploy, and manage software in the cloud

Delivering FPGA partner solutions
Amazon EC2 FPGA
deployment via AWS
Marketplace
CPU
Application
Customers
Amazon Machine Image (AMI)
Amazon FPGA Image (AFI)
AFI is secured, encrypted, dynamically loaded
into the FPGA – can’t be copied or
downloaded

AWS Inferentia
High-performancemachinelearninginferencechip,customdesignedbyAWS
• Making predictions using a trained machine learning model–a process called inference–can drive as much
as 90% of the compute costs of the application.
• AWS Inferentia is a machine learning inference chip designed to deliver high performance at low cost.

• High Frequency instances with
custom Intel Xeon Scalable
processors
• Running at sustained 4 GHz all
core turbo
• Fastest networking in the
cloud
• Compute optimized instances
Intel Xeon Scalable Processors
(Skylake)
• Up to 100 GBps networking
Summary
• Pick the right compute platform for accelerating your application
• You have a choice of using compute optimize CPU platforms, GPU, or FPGA
accelerated platforms
• We aspire to provide you with the broadest and deepest set of products and services
to support your workload.
• Compute optimized instances
• Custom 3.0 GHz Intel Xeon
Scalable Processors (Skylake)
• Support for Intel AVX-512 –
Great for ML inference

EC2 accelerated computing instances
P3: GPU Compute instance
• Up to 8 NVIDIA V100 GPUs in a single instance, with NVLink for peer-to-peer GPU communication
• Supporting a wide variety of use cases including deep learning, HPC simulations, financial computing, and batch
rendering
G3: GPU Graphics instance
• Up to 4 NVIDIA M60 GPUs, with GRID Virtual Workstation features and licenses
• Designed for workloads such as 3D rendering, 3D visualizations, graphics-intensive remote workstations, video
encoding, and virtual reality applications
F1: FPGA instance
• Up to 8 Xilinx Virtex UltraScale+ VU9P FPGAs in a single instance. Programmable via VHDL, Verilog, or OpenCL.
Growing marketplace of pre-built application accelerations.
• Designed for hardware-accelerated applications including financial computing, genomics, accelerated search, and
image processing
AWS Inferentia – ML inference chip
• High-performance machine learning inference chip, custom designed by AWS
• Designed for lower cost-per-inference across the full range of ML applications
P3
G3
F1

Thank you!
Chetan Kapoor
Principal Product Manager
Amazon EC2

Accelerate ML workloads using EC2 accelerated computing

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Accelerate ML workloads using EC2 accelerated computing

Semelhante a Accelerate ML workloads using EC2 accelerated computing (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

Accelerate ML workloads using EC2 accelerated computing