SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Copyright 2020 ITRI 工業技術研究院
ITRI DLA Accelerating System
design, system, tools, and applications
工業技術研究院 Industrial Technology Research Institute (ITRI)
資訊與通訊研究所 Information and Communication Research Lab (ICL)
Copyright 2020 ITRI 工業技術研究院
CNN Models Advance Fast
2
Source: Alfredo Canziani, 2017
We need high accuracy
with low computation 
Many computer
vision tasks and
DNN models;
classification is the
basic 
Different DNN Models for the
same classification based on
ImageNet database 
Copyright 2020 ITRI 工業技術研究院
Three Steps for A High Efficient Accelerator
3
1. Increase MAC PEs with high parallelism
2. Ensure the data supplement to those PEs
3. Improve energy efficiency, adaptive to the models
Throughput
Computation Power
3
2
Concepts of step 1~ 3 Take Alexnet for example
Throughput Curves
given various DRAM BW
Convolution contains many independent MACs and data overlap
Copyright 2020 ITRI 工業技術研究院
FPS/Throughput of Various Models
-- profiled using 256MAC, 128KB, INT8, DLA inference
4
Copyright 2020 ITRI 工業技術研究院
C2C Ratio Preference in Classification Models
5
AlexNet prefers
Need bandwidth due to
heavy-weight FC layers
Inception prefers
More computation
power because
many branches of
CNN computations,
where concat layer
is memory BW free
ResNet prefers
evenly memory bandwidth
and computation power;
element-wise add need
bandwidth
MobileNet prefers
More memory
bandwidth, although
DW-CONV layers
reduce computation but
increase intermediate
activations
Heavy parameters
in last 3 FC layers
Concatenate
small CNN layers
Element-wise
add two activations
Depth-wise & point-wise CONV
Replace conventional CONV
Copyright 2019 ITRI 工業技術研究院
Customization
6
Copyright 2020 ITRI 工業技術研究院
Customize Flow for An Accelerator
7
User’s
AI Framework
& Models
Framework
Converter
8-bit
Retrain
Framework
Network
Compiler
API, Driver (FW)
Accelerator (HW)
User’s
PPA SPEC
Candidate
HW SPEC
Model
Parser
Coarse Compile &
PPA Profiler
InferenceSynthesisAnalysis
HW Assembler
APP
Call API
HW Library
From NN analysis, synthesis, to inference
NV-DLA
NV-DLA
NV-DLA
Copyright 2020 ITRI 工業技術研究院 8
DLA Architecture Customizable and Configurable
1. Variable CONV MAC resources
• 64-MAC to 2048-MAC for convolution processer
• Variable size of convolutional buffer
2. Configurable NN operator processors
• Options for batch normalization, PReLU, scale,
bias, quantization, element-wise operators
• Options for down sample ( like pool) operators
• Options for nonlinear LUTs
• Options for user to add new processors
3. Custom memories and host CPUs
• Can be driven by user’s MCU or CPU
• Options for shared or private DRAM / SRAM /
NVM
Convolutional
Processor
Element-Wise
Processor
Pool
Processor
Nonlinear
Processor
InterfaceUnit
ConfigurationUnit
AXI
APB
BUS
AXI
Bridge
APB
Bridge
Flow
controller
High
Speed IO
DRAMIF
Peripherals
DLA IP
Custom
SRAMIF
User’s New
Processor
AXI
option B: Board Integrationoption A: SoC Integration
CPU DSP
Custom Host System
Copyright 2020 ITRI 工業技術研究院
DLA Reference SPECs
9
1. Atomic operation size (atomic C and K) of convolution
2. Convolutional buffer structure
3. Optional : nonlinear LUT, data reshape, weight decompression
Original NVDLA 64-MAC 256-MAC 512-MAC 1024-MAC 2048-MAC
Data type INT8 INT8 INT8 INT8 INT8
MAC for channel # 8 32 32 32 64
MAC for kernel # 8 8 16 32 32
Internal Buffer Size 128 KB 128 KB 512 KB 512 KB 512 KB
AXI (DBB) width 64 64 128 256 256
AXI (DBB) burst 1 1 4 4 4
CONV SRAM width X X X 256 256
CONV SRAM burst X X X 4 4
Status OK Not complete, bugs to generate
ITRI Version 64-MAC 256-MAC 512-MAC 1024-MAC 2048-MAC
AXI (DBB) burst upto 8 upto 8 upto 8 upto 8 upto 8
CONV SRAM width 64 64 128 256 256
CONV SRAM burst upto 8 upto 8 upto 8 upto 8 upto 8
Status OK OK OK TBA OK
Additional Functions Depth-wise convolution, Up-sampling
DEV Tools Bare-metal compiler, Performance profiler, Golden pattern generator & simulator
ITRIVersionImprovement
Copyright 2020 ITRI 工業技術研究院 10
Features of DLA Hardware width
height
IN IN
IN
OUT
kernels
Stride 1, no pad
Channel first Plane first
3D CONV example
1. Variable HW resource
• Search an efficient resource to models
• Adaptive performance & power consumption
2. Suit for long-channel convolution
• Output pixel first, share IN, avoiding partial sum storage
• Support any kernel size (n x m) ,the same data flow
3. Revision for depth-wise convolution
• Output pixel first, channel = 1 convolution
• Support any kernel size (n x m) ,the same data flow
4. Data reuse and hetero-layer fusion
• Input reuse or weight reuse by setup
• Fuse popular layers [CONV(BN)–Quantize-PReLU–Pooling ]
5. Program time hiding
• Configure N and N+1 layer simultaneously
• Cover the configuration time during layer change
Copyright 2020 ITRI 工業技術研究院
Exclusive HW View
of Depth-wise CONV
Convolution
DMA
(CDMA)
Convolution Buffer
(CBUF)
DW + Original CSC
Convolution MAC Array (CMAC)
DW + Original CACC
Global Unit (GLB)
(interrupt/fault)
CSB Master
MCIF
DW + Original SDP
Non-convolutional ProcessorsConvolutional Processor
Controller
DRAM
(AXI)
APB
CSB to APB
CVIF
Cross-Channel Data Processor (CDP)
Planar Data Processor (PDP)
RUBIK engine (RUBIK)
SRAM
(AXI)
BDMA
Fused depth-wise convolution engine
• DW data flow controller on CSC, CACC, SDP
• Fused DW-CONV with BN and PReLU
Copyright 2020 ITRI 工業技術研究院
API in C
API in C
API in C
API in C
12
NN-to-DLA Translator Flow and Verification
Layer Queue
CFGs
Model
Parse
Layer
Fuse
Layer
Partition
Model
Graph
HW-aware
Quantize
Insert
Direct Quantize
or Re-train
(Tensorflow)
Weight
Convert &
Partition
Model
Weights
Quantized
Weights
API in C
Baremetal Inference Example
1. Allocate free memory space
2. Capture image
3. Call coarse object detection API
4. Draw bounding boxes, capture each ROI
5. Call detailed classification API
6. Post processing, and loop back
Libraries
Copyright 2020 ITRI 工業技術研究院
HW Inference Queue Examples
13
ID name type
1 conv1 Convolution
2 bn1 BatchNorm
3 scale1 Scale
4 relu1 ReLU
5 pool1 Pooling
6 conv2 Convolution
7 bn2 BatchNorm
8 scale2 Scale
9 relu2 ReLU
10 pool2 Pooling
11 conv3 Convolution
12 bn3 BatchNorm
13 scale3 Scale
14 relu3 ReLU
15 pool3 Pooling
16 conv4 Convolution
17 bn4 BatchNorm
18 scale4 Scale
19 relu4 ReLU
20 pool4 Pooling
21 conv5 Convolution
22 bn5 BatchNorm
23 scale5 Scale
24 relu5 ReLU
25 pool5 Pooling
26 conv6 Convolution
27 bn6 BatchNorm
28 scale6 Scale
29 relu6 ReLU
30 pool6 Pooling
31 conv7 Convolution
32 bn7 BatchNorm
33 scale7 Scale
34 relu7 ReLU
35 conv8 Convolution
36 bn8 BatchNorm
37 scale8 Scale
38 relu8 ReLU
39 fc9 InnerProduct
#
Layer
Number
0~13 Hybrid Layer 1
14~20 Hybrid Layer 2
21~23 Hybrid Layer 3
23~25 Hybrid Layer 4
26 Hybrid Layer 5
27 Hybrid Layer 6
28 Hybrid Layer 7
29 Hybrid Layer 8
30 FC9
Tiny YOLO v1
(39 DNN layers)
Tiny YOLO v1
HW Inference Queue
9 graph layers, 30 HW layers
* Detection done by host CPU
Conv1 + bn + scale + Relu + cvt(ReQ) #0 - #3
pool1(MaxPool) #4
Res2a_branch2a + bn + scale + relu + cvt(ReQ) #5 - #6
Res2a_branch2b + bn + scale + relu + cvt(ReQ) #7 - #9
Res2a_branch2c + bn + scale + cvt(ReQ) #10-#11
Res2a_branch1 + bn + scale+ReQ #12-#13
res2a(Eltwise) + relu + cvt(ReQ) #14
Res2b_branch2a + bn + scale + relu + cvt(ReQ) #15-#21
Res2b_branch2b + bn + scale + relu + cvt(ReQ) #22-#24
Res2b_branch2c + bn + scale + cvt(ReQ) #25-#26
ReQ_res2a(ReQ) + res2b(Eltwise) + relu + cvt(ReQ) #27
Res2c_branch2a + bn + scale + relu + cvt(ReQ) #28-#34
Res2c_branch2b + bn + scale + relu + cvt(ReQ) #35-#37
Res2c_branch2c + bn + scale + cvt(ReQ) #38
res2c_branch1_maxPool(MaxPool) #39
res2c_branch1_ReQ(ReQ) + res2c(Eltwise) + relu + cvt(ReQ) #40
Res3a_branch2a + bn + scale + relu + cvt(ReQ) #41-#43
Res3a_branch2b + bn + scale + relu + cvt(ReQ) #44
Res3a_branch2c + bn + scale + cvt(ReQ) #45
Res3a_branch1 + bn + scale + cvt(ReQ) #46-#47
res3a(Eltwise) + relu + cvt(ReQ) #48
Res3b_branch2a + bn + scale + relu + cvt(ReQ) #49-#55
Res3b_branch2b + bn + scale + relu + cvt(ReQ) #56
Res3b_branch2c + bn + scale + cvt(ReQ) #57
ReQ_res3a(ReQ) + res3b(Eltwise) + relu + cvt(ReQ) #58
Res3c_branch2a + bn + scale + relu + cvt(ReQ) #59-#65
Res3c_branch2b + bn + scle + relu + cvt(ReQ) #66
Res3c_branch2c + bn + scale + cvt(ReQ) #67
ReQ_res3b(ReQ) + res3c(Eltwise) + relu + cvt(ReQ) #68
Res3d_branch2a + bn + scale + relu + cvt(ReQ) #69-#75
Res3d_branch2b + bn + scale + relu + cvt(ReQ) #76
Res3d_branch2c + bn + scale + cvt(ReQ) #77
res3c_branch1_maxPool(MaxPool) #78
res3c_branch1_ReQ(ReQ) + res3d(Eltwise) + relu + cvt(ReQ) #79
Res4a_branch2a + bn + scale + relu + cvt(ReQ) #80
Res4a_branch2b + bn + scale + relu + cvt(ReQ) #81
Res4a_branch2c + bn + scale + cvt(ReQ) #82
Res4a_branch1 + bn + scale + cvt(ReQ) #83
res4a(Eltwise) + relu + cvt(ReQ) #84
Res4b_branch2a + bn + scale + relu + cvt(ReQ) #85-#86
Res4b_branch2b + bn + scale + relu + cvt(ReQ) #87
Res4b_branch2c + bn + scale + cvt(ReQ) #88
ReQ_res4a(ReQ) + res4b(Eltwise) + relu + cvt(ReQ) #89
Res4c_branch2a + bn + scale + relu + cvt(ReQ) #90-#91
Res4c_branch2b + bn + scale + relu + cvt(ReQ) #92
Res4c_branch2c + bn + scale + cvt(ReQ) #93
ReQ_res4b(ReQ) + res4c(Eltwise) + relu + cvt(ReQ) #94
Res4d_branch2a + bn + scale + relu + cvt(ReQ) #95-#96
Res4d_branch2b + bn + scale + relu + cvt(ReQ) #97
Res4d_branch2c + bn + scale + cvt(ReQ) #98
ReQ_res4c(ReQ) + res4d(Eltwise) + relu + cvt(ReQ) #99
Res4e_branch2a + bn + scale + relu + cvt(ReQ) #100-#101
Res4e_branch2b + bn + scale + relu + cvt(ReQ) #102
Res4e_branch2c + bn + scale + cvt(ReQ) #103
ReQ_res4d(ReQ) + res4e(Eltwise) + relu + cvt(ReQ) #104
Res4f_branch2a + bn + scale + relu + cvt(ReQ) #105-#106
Res4f_branch2b + bn + scale + relu + cvt(ReQ) #107
Res4f_branch2c + bn + scale + cvt(ReQ) #108
res4e_branch1_maxPool(MaxPool) #109
res4e_branch1_ReQ(ReQ) + res4f(Eltwise) + relu + cvt(ReQ) #110
Res5a_branch2a + bn + scale + relu + cvt(ReQ) #111
Res5a_branch2b + bn + scale + relu + cvt(ReQ) #112
Res5a_branch2c + bn + scale + cvt(ReQ) #113
Res5a_branch1 + bn + scale + ReQ #114
res5a(Eltwise) + relu + cvt(ReQ) #115
Res5b_branch2a + bn + scale + relu + cvt(ReQ) #116
Res5b_branch2b + bn + scale + relu + cvt(ReQ) #117
Res5b_branch2c + bn + scale + cvt(ReQ) #118
ReQ_res5a(ReQ) + res5b(Eltwise) + relu + cvt(ReQ) #119
Res5c_branch2a + bn + scale + relu + cvt(ReQ) #120
Res5c_branch2b + bn + scale + relu + cvt(ReQ) #121
Res5c_branch2c + bn + scale + cvt(ReQ) #122
ReQ_res5b(ReQ) + res5c(Eltwise) + relu + cvt(ReQ) + pool5 #123
fc1000 + fc_bias + cvt(ReQ) #124
Resnet50
HW Inference
Queue
74 graph layers,
125 HW layers
Copyright 2020 ITRI 工業技術研究院
Quantize and Rescale in Classical NNs
14
mAP in voc-2007 dataset; ACC: Top1-accuracy in ImageNet
VGG-like Element-Add Concat Depth-wise
Tiny YOLO Resnet-v1-50 Inception-v3 MobileNet-v1
mAP: INT8 / FP ACC: INT8 / FP
v1: 38.08 / 40.86* 67.59 / 68.96+ 75.64 / 78.91+ 61.02 / 60.93+
v2: 48.03 / 49.92*
Add_E
ReQ
ReQ
ReQReQ
ReQ
ReQ
ReQ
Obj. Det.
Layer
ReQ
DW-C
ReQ
PW-C
ReQ
Concate
Output
Input
Copyright 2020 ITRI 工業技術研究院
Support Operators
15
Operator Applied Network
Convolution
Standard kernels R x S,
can be dilated
All CNN
Depthwise (ITRI Exclusive) Xception, MobileNet
Pointwise, kernel = 1*1 Inception, SuffleNet, MobileNet
Normalization
Local Response Norm. (LRN) AlexNet, Inception
Batch Norm. ResNet, DenseNet, MobileNet, Yolo
Activation
Tanh, Sigmoid RNN-LSTM
ReLU, PReLU AlexNet, VGG, Inception, Yolo
Pooling
Up-sampling
Max Pooling, kernel = 2*2, 3*3 All CNN
Avg Pooling, Global Avg Pooling ResNet, DenseNet, NiN, SSD
Up-sample Segamentation/SuperRes/YOLOv3
Elementwise
Concat, Split, Slice Inception, DenseNet, SSD
Add, Scale ResNet、RNN-LSTM
Fully Connection Tensor Mult., Add All CNN, RNN-LSTM
Copyright 2019 ITRI 工業技術研究院
Simulation & Visualization
16
Copyright 2020 ITRI 工業技術研究院
NN-to-DLA Model Translation Tools
for Profile and Bare-metal Compile
Netron supports
ONNX (.onnx, .pb, .pbtxt),
Keras (.h5, .keras),
Core ML (.mlmodel),
Caffe (.caffemodel, .prototxt),
Caffe2 (predict_net.pb, predict_net.pbtxt),
MXNet(.model, -symbol.json),
TorchScript (.pt, .pth),
NCNN (.param)
TensorFlow Lite(.tflite).
Intermediate Format
Caffe-based considering
• asymmetric pad
• quantized layer
Pattern Generator
Parameter Formatter
HW Config
Generator
DLA
System
NN
Graph
Real Parameters
MUX
GUI Profiler
NN
Models
Compile / Translate
17
Copyright 2020 ITRI 工業技術研究院
Integrated Netron Executable Version
18
DNN模型
DLA 設定
https://github.com/SCLUO/Op
en-DLA-Performance-Profiler
1. MAC Utilization: average MAC
utilization under aggressive FPS
2. Roofline Factor: the ratio of memory
access cycles / total cycles
3. Conservative FPS: consider the
memory access and computation is
fully overlapped
4. Aggressive FPS: consider the
memory access and computation is
fully interleaved
Copyright 2020 ITRI 工業技術研究院
Equation-based Profiler
Network GMAC
Peak 1GBps DRAM
BW
Peak 1GBps DRAM
+ 9.6GBps SRAM for ACTs
Peak 2GBps DRAM
MAC Util. Est. FPS MAC Util. Est. FPS
SRAM
Size (MB)
MAC Util. Est. FPS
Alexnet(224) 0.73 12% 12.3 12% 12.4 0.6 18% 19.1
InceptionResnetV2(224) 9.13 77% 6.5 79% 6.6 2.0 91% 7.7
InceptionV1(224) 1.73 49% 22.0 56% 25.0 1.1 56% 24.7
InceptionV2(231) 2.25 66% 22.5 78% 26.8 1.1 77% 26.3
InceptionV3(299) 5.75 75% 10.0 86% 11.5 2.0 88% 11.7
InceptionV4(299) 12.47 88% 5.4 96% 5.9 2.0 95% 5.8
MobileNetV1(224) 0.54 45% 60.4 62% 83.1 1.1 63% 84.5
MobileNetV2(224) 0.43 25% 44.9 48% 85.0 1.4 44% 78.5
MobileV1-SSD(416) 2.13 50% 18.2 69% 24.8 4.0 67% 24.1
MobileV2-SSD(416) 1.13 26% 17.3 45% 30.6 5.0 42% 28.5
Resnet50(224) 3.86 57% 11.3 66% 13.1 1.9 71% 14.1
TinyYOLOv1(448) 1.61 38% 18.1 38% 18.2 1.1 48% 22.9
TinyYOLOv2(416) 3.28 64% 15.1 65% 15.2 1.0 78% 18.3
TinyYOLOv3(416) 2.79 74% 20.4 75% 20.7 1.0 75% 20.6
example: 256 MAC & 128 KB @ 300MHz with
1GBps DRAM / 1GBps DRAM+SRAM for ACTs / 2GBps DRAM
19
Copyright 2020 ITRI 工業技術研究院
Reduction of Activation (feature map)
Network GMAC
Peak
2GBps DRAM BW
Total ACT Traffic per
frame (MB)
Weight
(MB)
Average DRAM BW
=(ACT+Weight) x FPS
MAC Util. Est. FPS Original DLA Same Original DLA
Alexnet(224) 0.73 18% 19.1 4.1 2.8 60 1224 1199
InceptionResnetV2(224) 9.13 91% 7.7 189 71 30 1686 778
InceptionV1(224) 1.73 56% 24.7 20 12 6.9 664 467
InceptionV2(231) 2.25 77% 26.3 36 18 11 1236 763
InceptionV3(299) 5.75 88% 11.7 83 35 23 1240 679
InceptionV4(299) 12.47 95% 5.8 145 58 41 1079 574
MobileNetV1(224) 0.54 63% 84.5 40 10 4.1 3726 1191
MobileNetV2(224) 0.43 44% 78.5 70 18 3.3 5754 1672
MobileV1-SSD(416) 2.13 67% 24.1 140 42 5.6 3509 1147
MobileV2-SSD(416) 1.13 42% 28.5 102 60 3.0 2993 1796
Resnet50(224) 3.86 71% 14.1 93 32 25 1664 804
TinyYOLOv1(448) 1.61 48% 22.9 54 4.8 26 1832 705
TinyYOLOv2(416) 3.28 78% 18.3 48 4.7 16 1171 379
TinyYOLOv3(416) 2.79 75% 20.6 52 6.4 8.6 1248 309
Example: 256 MAC & 128 KB @ 300MHz
See the reduction of ACT, and average DRAM BW
20
Copyright 2020 ITRI 工業技術研究院
Comparisons with RTL and Equation-based Profiler
21
@400MHz, 0.5 GB/s DRAM model
Tiny YOLO v1 for DLA_64
• RTL = 46.6 M cycles
• Profiler = 41 M cycles
Tiny YOLO v1 for DLA_256
• RTL = 23.9M cycles
• Profiler = 28.4 M cycles
Tiny YOLO v1 for DLA_2048
• RTL = 7.3 M cycles
• Profiler = 8.0 M cycles
Tiny YOLO v3 for DLA_256
• RTL = 17.0 M cycles
• Profiler = 17.1 M cycles
@400MHz, 0.5 GB/s DRAM model
Resnet50 for DLA_64
• RTL = 74M cycles
• Profiler = 74M cycles
Resnet50 for DLA_256
• RTL = 41.9M cycles
• Profiler = 50.1 M cycles
MobileNet_v1 for DLA_256
• RTL = 9.2M cycles
• Profiler = 11.1 M cycles
Inception_v3 for DLA_256
• RTL = 55.4 M cycles
• Profiler = 54.1M cycles
Copyright 2019 ITRI 工業技術研究院
Reference System Design
22
Demos
https://sites.google.com/view
/itri-icl-dla/demonstrations
Copyright 2020 ITRI 工業技術研究院 23
Implementation of USB Accelerator
Host Linux machine
1. Load RISC-V INIT, NN CFGs, Weights
2. Capture an image + preprocessing
3. Call object detection (YOLO) Start
4. Return output; send next image
5. Draw bounding boxes, display
INIT
Ready
Send Image + Start
Done
Ready for next
image + Start
Read output
RV INIT,
NN CFGs
image
weights
swap
output
0x0
Address Map
of USB Stick
Copyright 2020 ITRI 工業技術研究院
USB Acceleration System
• RV32-IM RISC-V & 64-MAC DLA on CESYS EFM-03 (Xilinx Artix-7)
@100MHz, achieving 3 inference per second (3 fps) of Tiny YOLO v1
• RV32-IM RISC-V & 256-MAC DLA on Xilinx ZCU102 @150MHz,
achieving 9 inference per second (9 fps) of Tiny YOLO v1
• RV32-IM RISC-V & 2048-MAC DLA on Xilinx VCU118 @150MHz,
achieving 21 inference per second (21 fps) of Tiny YOLO v1
Linux mini PC
USB FPGA USB Live Camera
Screen of
the Linux
mini PC
Test figures from
Win PC
USB Accelerator
FPGA Prototype
VCU 118
USB interface
NB Host
24
Copyright 2020 ITRI 工業技術研究院
inputoutput
ASIC Implementation
25
Layout View
• RV32-IM RISC-V & 64-MAC DLA
• Clock network optimization
• Register reduction
• Data path pipeline retiming
• Coarse & fine-grained clock gating
USB
GPIF
DRAM
IF
RISC-V
Cache
DLA
AXI
A
P
B
Peripherals
PLL
Block View
SoCEVB
Demo video
https://www.youtube.com/watch?v=qKF82386Wf4
Copyright 2020 ITRI 工業技術研究院
ZCU102 FPGA Object Detection Setup
26
DRAM (1GB)
Input Image
Model Weights
OS Controlled Space
DRAM
CTRL
DP
USB ARM CPU
(FPGA)
DLA
(Processing System)
Temp Activations
Output Data
Reserved
for DLA
(~64MB)
Program INIT
Set parameters
Load Weight
Image Capture (YUV)
Re-Format to RGB
Activate DLA
Post Processing
Display
DLA Finished
Copyright 2020 ITRI 工業技術研究院
Standalone ZCU102 FPGA Demonstration
27
Tiny YOLO v3, Object Detection
DLA256 @ 200MHz, 12 fps DNN,
9 fps include mp4 decoding
MobileNet v1 Classification
DLA256 @ 200MHz, 32 fps DNN,
27 fps include image resize
Copyright 2019 ITRI 工業技術研究院
Summary
28
Copyright 2020 ITRI 工業技術研究院 29
Features of ITRI’s Solutions
• Support from profiling to implementation
▪ Profiler, NN-to-DLA translator, SoC/FPGA references
• Support complete inference on RTL simulation
▪ Accurate, straightforward for conventional IC design
• Support of various DLA SPECs, from 64 to 2048 MAC cores
▪ Successful ASIC and FPGA implementation references
▪ Exclusive operator support (DW CONV, up-sample)
• Collaborate compiler and software partner, Skymizer
• Complete HW-aware integer training flow
▪ Transparent model compression and quantization
Copyright 2020 ITRI 工業技術研究院
Our Services
Design Reference / License
DLA series with verification tool kits
Exclusive architecture of NN operator (DW-CONV, up-sample…)
Design Consultant / Service
System performance analysis and consultant
Customization of efficient & exclusive HW
HW-aware model compression
Design & Application Service
DNN Model profile, analysis, NN-to-DLA translate
HW-aware quantization and re-training
30
Copyright 2019 ITRI 工業技術研究院
THANK YOU!
QUESTIONS AND COMMENTS?
Introduction
https://sites.google.co
m/view/itri-icl-dla/
31
ITRI-OpenDLA
https://github.com/SCLU
O/ITRI-OpenDLA
DLA Perf Profiler
https://github.com/SCLUO/Open-
DLA-Performance-Profiler

Mais conteúdo relacionado

Mais procurados

Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neonsean chen
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
Kernel Recipes 2019 - XDP closer integration with network stack
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stackAnne Nicolas
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switchmicchie
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)Kirill Tsym
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_mapslcplcp1
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsShoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsJiannan Ouyang, PhD
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Tokyo Institute of Technology
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
BKK16-208 EAS
BKK16-208 EASBKK16-208 EAS
BKK16-208 EASLinaro
 
Debug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpointsDebug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpointsVipin Varghese
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 

Mais procurados (20)

Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neon
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Kernel Recipes 2019 - XDP closer integration with network stack
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stack
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switch
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsShoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
BKK16-208 EAS
BKK16-208 EASBKK16-208 EAS
BKK16-208 EAS
 
Debug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpointsDebug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpoints
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 

Semelhante a 2020 icldla-updated

Implementing an IPv6 Enabled Environment for a Public Cloud Tenant
Implementing an IPv6 Enabled Environment for a Public Cloud TenantImplementing an IPv6 Enabled Environment for a Public Cloud Tenant
Implementing an IPv6 Enabled Environment for a Public Cloud TenantShixiong Shang
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
NFV Orchestration for Optimal Performance
NFV Orchestration for Optimal PerformanceNFV Orchestration for Optimal Performance
NFV Orchestration for Optimal Performancedfilppi
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
 
OSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchOSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchChun Ming Ou
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistencyScyllaDB
 
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...LinuxCon ContainerCon CloudOpen China
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdfFrangoCamila
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0ScyllaDB
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing LandscapeSasha Goldshtein
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Ravi Sony
 

Semelhante a 2020 icldla-updated (20)

Implementing an IPv6 Enabled Environment for a Public Cloud Tenant
Implementing an IPv6 Enabled Environment for a Public Cloud TenantImplementing an IPv6 Enabled Environment for a Public Cloud Tenant
Implementing an IPv6 Enabled Environment for a Public Cloud Tenant
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
NFV Orchestration for Optimal Performance
NFV Orchestration for Optimal PerformanceNFV Orchestration for Optimal Performance
NFV Orchestration for Optimal Performance
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
 
OSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchOSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable Switch
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistency
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 

Último

Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 

Último (20)

Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 

2020 icldla-updated

  • 1. Copyright 2020 ITRI 工業技術研究院 ITRI DLA Accelerating System design, system, tools, and applications 工業技術研究院 Industrial Technology Research Institute (ITRI) 資訊與通訊研究所 Information and Communication Research Lab (ICL)
  • 2. Copyright 2020 ITRI 工業技術研究院 CNN Models Advance Fast 2 Source: Alfredo Canziani, 2017 We need high accuracy with low computation  Many computer vision tasks and DNN models; classification is the basic  Different DNN Models for the same classification based on ImageNet database 
  • 3. Copyright 2020 ITRI 工業技術研究院 Three Steps for A High Efficient Accelerator 3 1. Increase MAC PEs with high parallelism 2. Ensure the data supplement to those PEs 3. Improve energy efficiency, adaptive to the models Throughput Computation Power 3 2 Concepts of step 1~ 3 Take Alexnet for example Throughput Curves given various DRAM BW Convolution contains many independent MACs and data overlap
  • 4. Copyright 2020 ITRI 工業技術研究院 FPS/Throughput of Various Models -- profiled using 256MAC, 128KB, INT8, DLA inference 4
  • 5. Copyright 2020 ITRI 工業技術研究院 C2C Ratio Preference in Classification Models 5 AlexNet prefers Need bandwidth due to heavy-weight FC layers Inception prefers More computation power because many branches of CNN computations, where concat layer is memory BW free ResNet prefers evenly memory bandwidth and computation power; element-wise add need bandwidth MobileNet prefers More memory bandwidth, although DW-CONV layers reduce computation but increase intermediate activations Heavy parameters in last 3 FC layers Concatenate small CNN layers Element-wise add two activations Depth-wise & point-wise CONV Replace conventional CONV
  • 6. Copyright 2019 ITRI 工業技術研究院 Customization 6
  • 7. Copyright 2020 ITRI 工業技術研究院 Customize Flow for An Accelerator 7 User’s AI Framework & Models Framework Converter 8-bit Retrain Framework Network Compiler API, Driver (FW) Accelerator (HW) User’s PPA SPEC Candidate HW SPEC Model Parser Coarse Compile & PPA Profiler InferenceSynthesisAnalysis HW Assembler APP Call API HW Library From NN analysis, synthesis, to inference NV-DLA NV-DLA NV-DLA
  • 8. Copyright 2020 ITRI 工業技術研究院 8 DLA Architecture Customizable and Configurable 1. Variable CONV MAC resources • 64-MAC to 2048-MAC for convolution processer • Variable size of convolutional buffer 2. Configurable NN operator processors • Options for batch normalization, PReLU, scale, bias, quantization, element-wise operators • Options for down sample ( like pool) operators • Options for nonlinear LUTs • Options for user to add new processors 3. Custom memories and host CPUs • Can be driven by user’s MCU or CPU • Options for shared or private DRAM / SRAM / NVM Convolutional Processor Element-Wise Processor Pool Processor Nonlinear Processor InterfaceUnit ConfigurationUnit AXI APB BUS AXI Bridge APB Bridge Flow controller High Speed IO DRAMIF Peripherals DLA IP Custom SRAMIF User’s New Processor AXI option B: Board Integrationoption A: SoC Integration CPU DSP Custom Host System
  • 9. Copyright 2020 ITRI 工業技術研究院 DLA Reference SPECs 9 1. Atomic operation size (atomic C and K) of convolution 2. Convolutional buffer structure 3. Optional : nonlinear LUT, data reshape, weight decompression Original NVDLA 64-MAC 256-MAC 512-MAC 1024-MAC 2048-MAC Data type INT8 INT8 INT8 INT8 INT8 MAC for channel # 8 32 32 32 64 MAC for kernel # 8 8 16 32 32 Internal Buffer Size 128 KB 128 KB 512 KB 512 KB 512 KB AXI (DBB) width 64 64 128 256 256 AXI (DBB) burst 1 1 4 4 4 CONV SRAM width X X X 256 256 CONV SRAM burst X X X 4 4 Status OK Not complete, bugs to generate ITRI Version 64-MAC 256-MAC 512-MAC 1024-MAC 2048-MAC AXI (DBB) burst upto 8 upto 8 upto 8 upto 8 upto 8 CONV SRAM width 64 64 128 256 256 CONV SRAM burst upto 8 upto 8 upto 8 upto 8 upto 8 Status OK OK OK TBA OK Additional Functions Depth-wise convolution, Up-sampling DEV Tools Bare-metal compiler, Performance profiler, Golden pattern generator & simulator ITRIVersionImprovement
  • 10. Copyright 2020 ITRI 工業技術研究院 10 Features of DLA Hardware width height IN IN IN OUT kernels Stride 1, no pad Channel first Plane first 3D CONV example 1. Variable HW resource • Search an efficient resource to models • Adaptive performance & power consumption 2. Suit for long-channel convolution • Output pixel first, share IN, avoiding partial sum storage • Support any kernel size (n x m) ,the same data flow 3. Revision for depth-wise convolution • Output pixel first, channel = 1 convolution • Support any kernel size (n x m) ,the same data flow 4. Data reuse and hetero-layer fusion • Input reuse or weight reuse by setup • Fuse popular layers [CONV(BN)–Quantize-PReLU–Pooling ] 5. Program time hiding • Configure N and N+1 layer simultaneously • Cover the configuration time during layer change
  • 11. Copyright 2020 ITRI 工業技術研究院 Exclusive HW View of Depth-wise CONV Convolution DMA (CDMA) Convolution Buffer (CBUF) DW + Original CSC Convolution MAC Array (CMAC) DW + Original CACC Global Unit (GLB) (interrupt/fault) CSB Master MCIF DW + Original SDP Non-convolutional ProcessorsConvolutional Processor Controller DRAM (AXI) APB CSB to APB CVIF Cross-Channel Data Processor (CDP) Planar Data Processor (PDP) RUBIK engine (RUBIK) SRAM (AXI) BDMA Fused depth-wise convolution engine • DW data flow controller on CSC, CACC, SDP • Fused DW-CONV with BN and PReLU
  • 12. Copyright 2020 ITRI 工業技術研究院 API in C API in C API in C API in C 12 NN-to-DLA Translator Flow and Verification Layer Queue CFGs Model Parse Layer Fuse Layer Partition Model Graph HW-aware Quantize Insert Direct Quantize or Re-train (Tensorflow) Weight Convert & Partition Model Weights Quantized Weights API in C Baremetal Inference Example 1. Allocate free memory space 2. Capture image 3. Call coarse object detection API 4. Draw bounding boxes, capture each ROI 5. Call detailed classification API 6. Post processing, and loop back Libraries
  • 13. Copyright 2020 ITRI 工業技術研究院 HW Inference Queue Examples 13 ID name type 1 conv1 Convolution 2 bn1 BatchNorm 3 scale1 Scale 4 relu1 ReLU 5 pool1 Pooling 6 conv2 Convolution 7 bn2 BatchNorm 8 scale2 Scale 9 relu2 ReLU 10 pool2 Pooling 11 conv3 Convolution 12 bn3 BatchNorm 13 scale3 Scale 14 relu3 ReLU 15 pool3 Pooling 16 conv4 Convolution 17 bn4 BatchNorm 18 scale4 Scale 19 relu4 ReLU 20 pool4 Pooling 21 conv5 Convolution 22 bn5 BatchNorm 23 scale5 Scale 24 relu5 ReLU 25 pool5 Pooling 26 conv6 Convolution 27 bn6 BatchNorm 28 scale6 Scale 29 relu6 ReLU 30 pool6 Pooling 31 conv7 Convolution 32 bn7 BatchNorm 33 scale7 Scale 34 relu7 ReLU 35 conv8 Convolution 36 bn8 BatchNorm 37 scale8 Scale 38 relu8 ReLU 39 fc9 InnerProduct # Layer Number 0~13 Hybrid Layer 1 14~20 Hybrid Layer 2 21~23 Hybrid Layer 3 23~25 Hybrid Layer 4 26 Hybrid Layer 5 27 Hybrid Layer 6 28 Hybrid Layer 7 29 Hybrid Layer 8 30 FC9 Tiny YOLO v1 (39 DNN layers) Tiny YOLO v1 HW Inference Queue 9 graph layers, 30 HW layers * Detection done by host CPU Conv1 + bn + scale + Relu + cvt(ReQ) #0 - #3 pool1(MaxPool) #4 Res2a_branch2a + bn + scale + relu + cvt(ReQ) #5 - #6 Res2a_branch2b + bn + scale + relu + cvt(ReQ) #7 - #9 Res2a_branch2c + bn + scale + cvt(ReQ) #10-#11 Res2a_branch1 + bn + scale+ReQ #12-#13 res2a(Eltwise) + relu + cvt(ReQ) #14 Res2b_branch2a + bn + scale + relu + cvt(ReQ) #15-#21 Res2b_branch2b + bn + scale + relu + cvt(ReQ) #22-#24 Res2b_branch2c + bn + scale + cvt(ReQ) #25-#26 ReQ_res2a(ReQ) + res2b(Eltwise) + relu + cvt(ReQ) #27 Res2c_branch2a + bn + scale + relu + cvt(ReQ) #28-#34 Res2c_branch2b + bn + scale + relu + cvt(ReQ) #35-#37 Res2c_branch2c + bn + scale + cvt(ReQ) #38 res2c_branch1_maxPool(MaxPool) #39 res2c_branch1_ReQ(ReQ) + res2c(Eltwise) + relu + cvt(ReQ) #40 Res3a_branch2a + bn + scale + relu + cvt(ReQ) #41-#43 Res3a_branch2b + bn + scale + relu + cvt(ReQ) #44 Res3a_branch2c + bn + scale + cvt(ReQ) #45 Res3a_branch1 + bn + scale + cvt(ReQ) #46-#47 res3a(Eltwise) + relu + cvt(ReQ) #48 Res3b_branch2a + bn + scale + relu + cvt(ReQ) #49-#55 Res3b_branch2b + bn + scale + relu + cvt(ReQ) #56 Res3b_branch2c + bn + scale + cvt(ReQ) #57 ReQ_res3a(ReQ) + res3b(Eltwise) + relu + cvt(ReQ) #58 Res3c_branch2a + bn + scale + relu + cvt(ReQ) #59-#65 Res3c_branch2b + bn + scle + relu + cvt(ReQ) #66 Res3c_branch2c + bn + scale + cvt(ReQ) #67 ReQ_res3b(ReQ) + res3c(Eltwise) + relu + cvt(ReQ) #68 Res3d_branch2a + bn + scale + relu + cvt(ReQ) #69-#75 Res3d_branch2b + bn + scale + relu + cvt(ReQ) #76 Res3d_branch2c + bn + scale + cvt(ReQ) #77 res3c_branch1_maxPool(MaxPool) #78 res3c_branch1_ReQ(ReQ) + res3d(Eltwise) + relu + cvt(ReQ) #79 Res4a_branch2a + bn + scale + relu + cvt(ReQ) #80 Res4a_branch2b + bn + scale + relu + cvt(ReQ) #81 Res4a_branch2c + bn + scale + cvt(ReQ) #82 Res4a_branch1 + bn + scale + cvt(ReQ) #83 res4a(Eltwise) + relu + cvt(ReQ) #84 Res4b_branch2a + bn + scale + relu + cvt(ReQ) #85-#86 Res4b_branch2b + bn + scale + relu + cvt(ReQ) #87 Res4b_branch2c + bn + scale + cvt(ReQ) #88 ReQ_res4a(ReQ) + res4b(Eltwise) + relu + cvt(ReQ) #89 Res4c_branch2a + bn + scale + relu + cvt(ReQ) #90-#91 Res4c_branch2b + bn + scale + relu + cvt(ReQ) #92 Res4c_branch2c + bn + scale + cvt(ReQ) #93 ReQ_res4b(ReQ) + res4c(Eltwise) + relu + cvt(ReQ) #94 Res4d_branch2a + bn + scale + relu + cvt(ReQ) #95-#96 Res4d_branch2b + bn + scale + relu + cvt(ReQ) #97 Res4d_branch2c + bn + scale + cvt(ReQ) #98 ReQ_res4c(ReQ) + res4d(Eltwise) + relu + cvt(ReQ) #99 Res4e_branch2a + bn + scale + relu + cvt(ReQ) #100-#101 Res4e_branch2b + bn + scale + relu + cvt(ReQ) #102 Res4e_branch2c + bn + scale + cvt(ReQ) #103 ReQ_res4d(ReQ) + res4e(Eltwise) + relu + cvt(ReQ) #104 Res4f_branch2a + bn + scale + relu + cvt(ReQ) #105-#106 Res4f_branch2b + bn + scale + relu + cvt(ReQ) #107 Res4f_branch2c + bn + scale + cvt(ReQ) #108 res4e_branch1_maxPool(MaxPool) #109 res4e_branch1_ReQ(ReQ) + res4f(Eltwise) + relu + cvt(ReQ) #110 Res5a_branch2a + bn + scale + relu + cvt(ReQ) #111 Res5a_branch2b + bn + scale + relu + cvt(ReQ) #112 Res5a_branch2c + bn + scale + cvt(ReQ) #113 Res5a_branch1 + bn + scale + ReQ #114 res5a(Eltwise) + relu + cvt(ReQ) #115 Res5b_branch2a + bn + scale + relu + cvt(ReQ) #116 Res5b_branch2b + bn + scale + relu + cvt(ReQ) #117 Res5b_branch2c + bn + scale + cvt(ReQ) #118 ReQ_res5a(ReQ) + res5b(Eltwise) + relu + cvt(ReQ) #119 Res5c_branch2a + bn + scale + relu + cvt(ReQ) #120 Res5c_branch2b + bn + scale + relu + cvt(ReQ) #121 Res5c_branch2c + bn + scale + cvt(ReQ) #122 ReQ_res5b(ReQ) + res5c(Eltwise) + relu + cvt(ReQ) + pool5 #123 fc1000 + fc_bias + cvt(ReQ) #124 Resnet50 HW Inference Queue 74 graph layers, 125 HW layers
  • 14. Copyright 2020 ITRI 工業技術研究院 Quantize and Rescale in Classical NNs 14 mAP in voc-2007 dataset; ACC: Top1-accuracy in ImageNet VGG-like Element-Add Concat Depth-wise Tiny YOLO Resnet-v1-50 Inception-v3 MobileNet-v1 mAP: INT8 / FP ACC: INT8 / FP v1: 38.08 / 40.86* 67.59 / 68.96+ 75.64 / 78.91+ 61.02 / 60.93+ v2: 48.03 / 49.92* Add_E ReQ ReQ ReQReQ ReQ ReQ ReQ Obj. Det. Layer ReQ DW-C ReQ PW-C ReQ Concate Output Input
  • 15. Copyright 2020 ITRI 工業技術研究院 Support Operators 15 Operator Applied Network Convolution Standard kernels R x S, can be dilated All CNN Depthwise (ITRI Exclusive) Xception, MobileNet Pointwise, kernel = 1*1 Inception, SuffleNet, MobileNet Normalization Local Response Norm. (LRN) AlexNet, Inception Batch Norm. ResNet, DenseNet, MobileNet, Yolo Activation Tanh, Sigmoid RNN-LSTM ReLU, PReLU AlexNet, VGG, Inception, Yolo Pooling Up-sampling Max Pooling, kernel = 2*2, 3*3 All CNN Avg Pooling, Global Avg Pooling ResNet, DenseNet, NiN, SSD Up-sample Segamentation/SuperRes/YOLOv3 Elementwise Concat, Split, Slice Inception, DenseNet, SSD Add, Scale ResNet、RNN-LSTM Fully Connection Tensor Mult., Add All CNN, RNN-LSTM
  • 16. Copyright 2019 ITRI 工業技術研究院 Simulation & Visualization 16
  • 17. Copyright 2020 ITRI 工業技術研究院 NN-to-DLA Model Translation Tools for Profile and Bare-metal Compile Netron supports ONNX (.onnx, .pb, .pbtxt), Keras (.h5, .keras), Core ML (.mlmodel), Caffe (.caffemodel, .prototxt), Caffe2 (predict_net.pb, predict_net.pbtxt), MXNet(.model, -symbol.json), TorchScript (.pt, .pth), NCNN (.param) TensorFlow Lite(.tflite). Intermediate Format Caffe-based considering • asymmetric pad • quantized layer Pattern Generator Parameter Formatter HW Config Generator DLA System NN Graph Real Parameters MUX GUI Profiler NN Models Compile / Translate 17
  • 18. Copyright 2020 ITRI 工業技術研究院 Integrated Netron Executable Version 18 DNN模型 DLA 設定 https://github.com/SCLUO/Op en-DLA-Performance-Profiler 1. MAC Utilization: average MAC utilization under aggressive FPS 2. Roofline Factor: the ratio of memory access cycles / total cycles 3. Conservative FPS: consider the memory access and computation is fully overlapped 4. Aggressive FPS: consider the memory access and computation is fully interleaved
  • 19. Copyright 2020 ITRI 工業技術研究院 Equation-based Profiler Network GMAC Peak 1GBps DRAM BW Peak 1GBps DRAM + 9.6GBps SRAM for ACTs Peak 2GBps DRAM MAC Util. Est. FPS MAC Util. Est. FPS SRAM Size (MB) MAC Util. Est. FPS Alexnet(224) 0.73 12% 12.3 12% 12.4 0.6 18% 19.1 InceptionResnetV2(224) 9.13 77% 6.5 79% 6.6 2.0 91% 7.7 InceptionV1(224) 1.73 49% 22.0 56% 25.0 1.1 56% 24.7 InceptionV2(231) 2.25 66% 22.5 78% 26.8 1.1 77% 26.3 InceptionV3(299) 5.75 75% 10.0 86% 11.5 2.0 88% 11.7 InceptionV4(299) 12.47 88% 5.4 96% 5.9 2.0 95% 5.8 MobileNetV1(224) 0.54 45% 60.4 62% 83.1 1.1 63% 84.5 MobileNetV2(224) 0.43 25% 44.9 48% 85.0 1.4 44% 78.5 MobileV1-SSD(416) 2.13 50% 18.2 69% 24.8 4.0 67% 24.1 MobileV2-SSD(416) 1.13 26% 17.3 45% 30.6 5.0 42% 28.5 Resnet50(224) 3.86 57% 11.3 66% 13.1 1.9 71% 14.1 TinyYOLOv1(448) 1.61 38% 18.1 38% 18.2 1.1 48% 22.9 TinyYOLOv2(416) 3.28 64% 15.1 65% 15.2 1.0 78% 18.3 TinyYOLOv3(416) 2.79 74% 20.4 75% 20.7 1.0 75% 20.6 example: 256 MAC & 128 KB @ 300MHz with 1GBps DRAM / 1GBps DRAM+SRAM for ACTs / 2GBps DRAM 19
  • 20. Copyright 2020 ITRI 工業技術研究院 Reduction of Activation (feature map) Network GMAC Peak 2GBps DRAM BW Total ACT Traffic per frame (MB) Weight (MB) Average DRAM BW =(ACT+Weight) x FPS MAC Util. Est. FPS Original DLA Same Original DLA Alexnet(224) 0.73 18% 19.1 4.1 2.8 60 1224 1199 InceptionResnetV2(224) 9.13 91% 7.7 189 71 30 1686 778 InceptionV1(224) 1.73 56% 24.7 20 12 6.9 664 467 InceptionV2(231) 2.25 77% 26.3 36 18 11 1236 763 InceptionV3(299) 5.75 88% 11.7 83 35 23 1240 679 InceptionV4(299) 12.47 95% 5.8 145 58 41 1079 574 MobileNetV1(224) 0.54 63% 84.5 40 10 4.1 3726 1191 MobileNetV2(224) 0.43 44% 78.5 70 18 3.3 5754 1672 MobileV1-SSD(416) 2.13 67% 24.1 140 42 5.6 3509 1147 MobileV2-SSD(416) 1.13 42% 28.5 102 60 3.0 2993 1796 Resnet50(224) 3.86 71% 14.1 93 32 25 1664 804 TinyYOLOv1(448) 1.61 48% 22.9 54 4.8 26 1832 705 TinyYOLOv2(416) 3.28 78% 18.3 48 4.7 16 1171 379 TinyYOLOv3(416) 2.79 75% 20.6 52 6.4 8.6 1248 309 Example: 256 MAC & 128 KB @ 300MHz See the reduction of ACT, and average DRAM BW 20
  • 21. Copyright 2020 ITRI 工業技術研究院 Comparisons with RTL and Equation-based Profiler 21 @400MHz, 0.5 GB/s DRAM model Tiny YOLO v1 for DLA_64 • RTL = 46.6 M cycles • Profiler = 41 M cycles Tiny YOLO v1 for DLA_256 • RTL = 23.9M cycles • Profiler = 28.4 M cycles Tiny YOLO v1 for DLA_2048 • RTL = 7.3 M cycles • Profiler = 8.0 M cycles Tiny YOLO v3 for DLA_256 • RTL = 17.0 M cycles • Profiler = 17.1 M cycles @400MHz, 0.5 GB/s DRAM model Resnet50 for DLA_64 • RTL = 74M cycles • Profiler = 74M cycles Resnet50 for DLA_256 • RTL = 41.9M cycles • Profiler = 50.1 M cycles MobileNet_v1 for DLA_256 • RTL = 9.2M cycles • Profiler = 11.1 M cycles Inception_v3 for DLA_256 • RTL = 55.4 M cycles • Profiler = 54.1M cycles
  • 22. Copyright 2019 ITRI 工業技術研究院 Reference System Design 22 Demos https://sites.google.com/view /itri-icl-dla/demonstrations
  • 23. Copyright 2020 ITRI 工業技術研究院 23 Implementation of USB Accelerator Host Linux machine 1. Load RISC-V INIT, NN CFGs, Weights 2. Capture an image + preprocessing 3. Call object detection (YOLO) Start 4. Return output; send next image 5. Draw bounding boxes, display INIT Ready Send Image + Start Done Ready for next image + Start Read output RV INIT, NN CFGs image weights swap output 0x0 Address Map of USB Stick
  • 24. Copyright 2020 ITRI 工業技術研究院 USB Acceleration System • RV32-IM RISC-V & 64-MAC DLA on CESYS EFM-03 (Xilinx Artix-7) @100MHz, achieving 3 inference per second (3 fps) of Tiny YOLO v1 • RV32-IM RISC-V & 256-MAC DLA on Xilinx ZCU102 @150MHz, achieving 9 inference per second (9 fps) of Tiny YOLO v1 • RV32-IM RISC-V & 2048-MAC DLA on Xilinx VCU118 @150MHz, achieving 21 inference per second (21 fps) of Tiny YOLO v1 Linux mini PC USB FPGA USB Live Camera Screen of the Linux mini PC Test figures from Win PC USB Accelerator FPGA Prototype VCU 118 USB interface NB Host 24
  • 25. Copyright 2020 ITRI 工業技術研究院 inputoutput ASIC Implementation 25 Layout View • RV32-IM RISC-V & 64-MAC DLA • Clock network optimization • Register reduction • Data path pipeline retiming • Coarse & fine-grained clock gating USB GPIF DRAM IF RISC-V Cache DLA AXI A P B Peripherals PLL Block View SoCEVB Demo video https://www.youtube.com/watch?v=qKF82386Wf4
  • 26. Copyright 2020 ITRI 工業技術研究院 ZCU102 FPGA Object Detection Setup 26 DRAM (1GB) Input Image Model Weights OS Controlled Space DRAM CTRL DP USB ARM CPU (FPGA) DLA (Processing System) Temp Activations Output Data Reserved for DLA (~64MB) Program INIT Set parameters Load Weight Image Capture (YUV) Re-Format to RGB Activate DLA Post Processing Display DLA Finished
  • 27. Copyright 2020 ITRI 工業技術研究院 Standalone ZCU102 FPGA Demonstration 27 Tiny YOLO v3, Object Detection DLA256 @ 200MHz, 12 fps DNN, 9 fps include mp4 decoding MobileNet v1 Classification DLA256 @ 200MHz, 32 fps DNN, 27 fps include image resize
  • 28. Copyright 2019 ITRI 工業技術研究院 Summary 28
  • 29. Copyright 2020 ITRI 工業技術研究院 29 Features of ITRI’s Solutions • Support from profiling to implementation ▪ Profiler, NN-to-DLA translator, SoC/FPGA references • Support complete inference on RTL simulation ▪ Accurate, straightforward for conventional IC design • Support of various DLA SPECs, from 64 to 2048 MAC cores ▪ Successful ASIC and FPGA implementation references ▪ Exclusive operator support (DW CONV, up-sample) • Collaborate compiler and software partner, Skymizer • Complete HW-aware integer training flow ▪ Transparent model compression and quantization
  • 30. Copyright 2020 ITRI 工業技術研究院 Our Services Design Reference / License DLA series with verification tool kits Exclusive architecture of NN operator (DW-CONV, up-sample…) Design Consultant / Service System performance analysis and consultant Customization of efficient & exclusive HW HW-aware model compression Design & Application Service DNN Model profile, analysis, NN-to-DLA translate HW-aware quantization and re-training 30
  • 31. Copyright 2019 ITRI 工業技術研究院 THANK YOU! QUESTIONS AND COMMENTS? Introduction https://sites.google.co m/view/itri-icl-dla/ 31 ITRI-OpenDLA https://github.com/SCLU O/ITRI-OpenDLA DLA Perf Profiler https://github.com/SCLUO/Open- DLA-Performance-Profiler