The transition from Serial ATA (SATA ) to Peripheral Component Interconnect Express (PCIe) interface and Non-Volatile Memory Express (NVMe) protocol is taking client storage to a new level. This white paper discusses the benefits that PCIe NVMe SSDs, such as Samsung’s 950 PRO, bring to client PC users. Client PC workloads are not always well understood in the industry, since common benchmarking utilities tend to focus on measuring maximum performance rather than performance under typical PC usage. This white paper looks at actual IO traces of PC workloads to better understand how client SSDs should be benchmarked, and also tests the 950 PRO against other Samsung SSDs to show how PCIe and NVMe improve IO performance in tests that represent real-world IO activity.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads
1. White Paper:
Benefits of PCIe NVMe SSDs for
Client Workloads
Benchmarking Performance Against Real-World Workloads
2. With the release of the 950
PRO, Samsung is taking
client storage to a new level
by switching from Serial ATA
(SATA) to Peripheral Component
Interconnect Express (PCIe)
interface and utilizing Non-
Volatile Memory Express (NVMe)
protocol designed specifically
for Solid State Drives (SSDs).
The drive’s faster interface and
lower latency protocol makes
the V-NAND-equipped 950
PRO the biggest advancement
in the client SSD space since
the release of the first client-
oriented SSDs more than five
years ago.
This whitepaper discusses the
benefits that PCIe NVMe SSDs,
such as the 950 PRO, bring
to client PC users. Client PC
workloads are not always well
understood in the industry, since
common benchmarking utilities
tend to focus on measuring
maximum performance rather
than performance under typical
PC usage. More specifically,
benchmarking utilities often
use very high queue depths
to produce high performance
numbers, whereas in the real
world most IO activity is low in
queue depth. This whitepaper
provides actual IO traces
of PC workloads to better
understand how client SSDs
should be benchmarked, and
also tests the 950 PRO against
other Samsung SSDs to show
how PCIe and NVMe improve
IO performance in tests that
represent real-world IO activity.
SATA and PCIe are both
electrical interfaces used to
transfer data between an SSD
and the rest of the system.
Traditionally, storage devices
have used the SATA interface,
which connects to the CPU
through a controller hub (PCH).
However, due to the limits of
the SATA interface, the SSD
industry has shifted towards
using the PCIe interface.
PCIe offers substantially more
bandwidth than the SATA
interface and since PCIe SSDs
can connect directly to the
CPU, they provide lower latency
than SATA SSDs.
In addition to the electrical
interface, the operating system
and applications also need a
software interface to interact
with a storage device. For the
past decade, SSDs and HDDs
have utilized Advanced Host
Controller Interface (AHCI),
which became a bottleneck
for SSDs since it was originally
designed for SATA and HDDs.
SSDs utilize NAND Flash
memory rather than rotating
platters, so SSDs are inherently
capable of much higher transfer
speeds and lower latencies.
Without an optimized software
interface, though, SSDs cannot
reach their full potential.
Continues on page 3
iNTRODUCTION: Benefits of PCIe NVMe SSDs
What Are PCIe & NVMe?
DMI
CPU
PCH
PCIe 3.0 x4 - 32Gbps
SATA 3.0 - 6Gbps
3. The NVMe Interface: NVMe is
a new software interface that
replaces AHCI and was built from
the ground up for SSDs and NAND
Flash. It utilizes a simplified,
low latency stack between the
application and the SSD, which
reduces IO overhead by nearly
70%. With less overhead, NVMe
SSDs are able to provide higher
performance and better power
efficiency than AHCI-based SSDs.
Furthermore, NVMe includes a
vastly improved queueing system
with support for thousands of
queues each supporting up to
65,536 outstanding commands. In
comparison, AHCI only supports
one queue with up to
32 outstanding commands.
User App User App
File System
Device Driver
Block Driver
OS Scheduling & CTX Switch
File System
Device Driver
Block Driver
SCSI/SATA translation
OS Scheduling & CTX Switch
Linux NVMe Stack
=3x less overhead
Linux AHCI Stack
When benchmarking an SSD,
one of the first and most
critical steps is to understand
the workload intended for
the product. Without an
understanding of the workload,
tests may measure metrics that
are irrelevant to the intended
use case, resulting in inaccurate
conclusions about the product.
Most of the commercially
available, easy-to-use SSD
benchmarking tools, such
as CrystalDiskMark and AS-
SSD, are primarily focused
on measuring maximum
performance. Maximum transfer
rates can be relevant in tasks
like large file transfers, but they
don’t illustrate performance under
typical PC usage.
The best way to investigate and
understand PC workloads is to
trace IO activity for a period of
time and then perform statistical
analysis for the collected data. In
Windows IO tracing can be done
using Xperf, which is free with
Windows Performance Toolkit.
AnandTech extensively studied
client PC workloads and built
three traces to illustrate different
workloads, ranging from a power
user to very basic light usage.
The details of all three traces are
publicly available and provide a
great insight to the IO activity of
typical client PC workloads.
• The Destroyer is the most
intensive workload and
includes tasks such as
virtualization and application
development, along with
more general gaming and
photo editing usage. It best
describes a power user
workload.
• Heavy workload is a more
typical enthusiast workload
consisting of gaming, photo
editing and content creation
in Dreamweaver. It also will
include general productivity
tasks such as web browsing,
email management,
application installing and
virus scanning.
• Light workload illustrates
basic PC usage and
is focused on general
productivity tasks like web
browsing, email management
and application installing.
Defining a Client Workload
4. The first variable that needs to be
understood before benchmarking
an SSD is the IO size.
AnandTech’s IO traces show a
clear pattern, as most of the IOs
focus on IO sizes of 4KB and 64-
128KB, regardless of workload
intensity. This is actually in line
with what most benchmarking
applications measure, since
benchmarks usually consist of
4KB random read/write tests
and sequential tests with a large
+64KB IO size.
The second variable is queue
depth, meaning the number
of outstanding IOs. What
AnandTech’s traces show is that
the majority of IOs are happening
at queue depth of one, with
75-90% of IOs happening at
queue depth of three or below.
There are some differences
between the workloads. For
instance, the more IO intensive
“The Destroyer” and “Heavy”
workloads have a higher average
queue depth but even in the
heavier PC workloads only a
small portion of total IOs are high
queue depth, and only a fraction
are above queue depth of 32.
Benchmarking applications
tend to use high queue depths
to produce better performance
numbers, but as AnandTech’s
real-world IO trace data shows,
high queue depths do not
illustrate a typical PC workload.
In enterprise workloads, queue
depths are often high because
dozens of people may access
one drive at the same time,
whereas in PC environments, the
drive access is limited to a single
user, thus lowering the queue
depths.
1 2 3 4-5
Queue Depth
PercentageofTotalIOs
6-10 11-20 21-32
10
0
20
30
40
50
60
70
80
The Destroyer Heavy Light
>32
understanding ssd benchmarking variables
<4KB 4KB 8KB 16KB
IO Size
PercentageofTotalIOs
32KB 64KB 128KB
5
0
10
15
20
25
30
35
40
45
The Destroyer Heavy Light
AnandTech Storage Bench - Queue Depth Breakdown
AnandTech Storage Bench - IO Size Breakdown
5. Test System
Hardware
Motherboard AsRock Z170 Extreme7+
Chipset Intel Z170
Processor Intel Core i5-6600K
Graphics Intel HD Graphics 530
Memory 16GB (2x8GB) DDR4-2400
Boot Drive Samsung 850 PRO 1TB
Software
Operating System Windows 10 Pro x64
Test tool Iometer 1.1.0
NVMe Driver 950 PRO – Samsung NVMe Driver 1.0
AHCI Driver Intel Rapid Storage Technology 14.6.0.1029
Chipset Driver 10.0.27
Graphics Driver 15.40.7.4279
Electrical
Interface
Software
Interface
NAND
Configuration
950 PRO
PCIe 3.0 x4
(32Gbps)
NVMe
128Gbit 32-layer
MLC V-NAND
850 PRO
SATA 3.0
(6Gbps)
AHCI
128Gbit 32-layer
MLC V-NAND
840 PRO
SATA 3.0
(6Gbps)
AHCI
64Gbit 21nm planar
MLC NAND
Based on AnandTech’s Storage
Bench data, a basic test
suite can be built to measure
performance in PC workloads.
With IO sizes mostly split
between 4KB and 64-128KB,
there are essentially four
tests needed to determine
performance: 4KB random
read, 4KB random write,
128KB sequential read and
128KB sequential write. Small
IO patterns, such as log file
updates, are typically random
by nature whereas large IOs
like applications tend to be
sequential. Therefore, it is logical
to test small IOs with random
patterns and large IOs with
sequential patterns.
Since queue depths in PC
usage are typically very low, as
shown by AnandTech’s Storage
Bench IO traces, running
benchmarks at low queue
depths is necessary to produce
results that reflect actual usage.
Queue depth of one is the most
relevant, but for more accurate
results and conclusions, it is
recommended to test some
higher queue depths as well.
For this whitepaper, we have
chosen queue depths of 1, 2,
4 and 8 to show performance
scaling with higher, but still
relatively low, queue depths to
ensure relevancy to real-world
performance.
In these tests,
the 950 PRO is
compared against its
predecessors the
850 PRO and 840 PRO
to show the benefits of
NVMe and PCIe over
SATA 6Gbps. All drives
are 256GB in capacity.
Benchmarking 950 PRO and NVMe
6. Random Read: NVMe and the
950 PRO show substantial
performance gains in random
read performance. At queue
depth of one, the 950 PRO is
more than 40% faster than the
SATA and AHCI based 850
PRO. At higher queue depths,
the performance differences are
even greater, with the 950 PRO
performing up to 60% faster than
the 850 PRO.
Historically there has been
very little improvement in 4KB
random performance at low
queue depths due to SATA
and AHCI latencies. While
PCIe reduces electrical latency
through a direct connection to
the CPU, NVMe is able to reduce
latency overhead even more
with its simplified storage stack,
resulting in unprecedented SSD
performance.
Random Write: In random write,
the performance gains at queue
depth of one are even more
significant, with the 950 PRO
performing more than 70% faster
than the 850 PRO. At queue
depth of two, the performance
difference grows to 87% in favor
of the 950 PRO, although at
even higher queue depths the
performance delta decreases.
Given the rarity of queue depths
over four, the 950 PRO provides
substantially higher 4KB random
write performance under real-
world usage.
Nvme ssd performance gains
4KB Random Write 4KB Random Write QD1
4KB Random Read 4KB Random Read QD1
1 2 4 8
Queue Depth
TransferRateinMB/s
50
0
100
150
200
250
300
350
400
450
500
950 PRO 850 PRO 840 PRO
1 2 4 8
Queue Depth
TransferRateinMB/s
50
0
100
150
200
250
300
350
400
450
500
950 PRO 850 PRO 840 PRO
50
48
46
44
42
40
38
36
34
32
30
TransferRateinMB/s
250
200
150
100
50
0
TransferRateinMB/s
50
48
46
44
42
40
38
36
34
32
30
TransferRateinMB/s
250
200
150
100
50
0
TransferRateinMB/s
7. 1 2 4 8
Queue Depth
TransferRateinMB/s
0
500
1000
1500
2000
2500
950 PRO 850 PRO 840 PRO
1 2 4 8
Queue Depth
TransferRateinMB/s
0
200
400
600
800
1000
1200
950 PRO 850 PRO 840 PRO
Sequential Read and Write:
In sequential read performance,
the 950 PRO is more than three
times faster than the 850 PRO
at queue depth of one. At higher
queue depths the difference
grows fourfold in favor of the 950
PRO. A part of the performance
gain is due to the PCIe 3.0 x4’s
higher bandwidth, as compared
to the SATA 6Gbps interface.
However, the lower latency
of the NVMe stack is also a
crucial contributor to sequential
performance.
Similarly, the 950 PRO is more
than twice as fast as the 850
PRO in sequential write at
queue depth of one, and the
performance benefit is sustained
at higher queue depths as well.
128KB Sequential Write
128KB Sequential Read