Presentation of profiling method based on instrumented and sampling profiler to the perfug at Paris the 17 september 2015
Code source is available on github : https://github.com/FabienArcellier/Perfug-Deep-into-your-application
3. WHAT'S THE MENU
What means profiling a application ?
How does it works ?
Apply on real world application memcached
4. PROFILING IN A FEW WORDS ...
Software profiling is a form of dynamic
program analysis that measures, for
example :
the space or time complexity of a
program
the usage of particular instructions
the frequency and duration of function
calls, ...
6. TO HAVE A BETTER VIEW ON WHAT'S
HAPPENS ON YOUR HARDWARE, ...
@copyright highscalability
7. TO IMPROVE YOUR APPLICATION
PERFORMANCE, ...
@copyright macifcourseaularge
You need measurements to improve continuously your
application performance.
9. TO MONITOR YOUR SERVER, ...
Flame Graph Search
app
__libc_start_main
main
dot
mat_mul
You want to understand what your CPUs are doing.
10. AT THE BEGINNING THERE IS A
PROGRAM ...
int main(void)
{
return 0;
}
int func1(void) {
return 0;
}
Use gcc to compile it
gcc c app.c o app
11. WITH A SIMPLE SYMBOLS TABLE ...
readelf Displays information about ELF files
readelf s app
45: 0000000000400580 2 FUNC GLOBAL DEFAULT 13 __libc_csu_fini
46: 00000000004004f8 11 FUNC GLOBAL DEFAULT 13 func1
...
57: 0000000000601040 0 NOTYPE GLOBAL DEFAULT 25 _end
58: 0000000000400400 0 FUNC GLOBAL DEFAULT 13 _start
59: 0000000000601038 0 NOTYPE GLOBAL DEFAULT 25 __bss_start
60: 00000000004004ed 11 FUNC GLOBAL DEFAULT 13 main
...
00000000004004ed : Virtual address of the symbol
FUNC : type.
main : Name of the symbol
12. HOW IT WORKS ?
60: 00000000004004ed 11 FUNC GLOBAL DEFAULT 13 main
13. CAPTURE EVENTS AND ASSOCIATE
THEM TO SYMBOLS
Generally we can list 3 type of profilers :
Instrumented profiling
Sampling profiling
Event-based profiling (Java, .Net, ...)
14. INSTRUMENTED PROFILING
Gprof, Callgrind, ...
Pro
Capture all events
Granularity
Cons
Slower than raw execution (20 times slower for
callgrind)
Intrusive (modify code assembly or emulate a virtual
processor)
What they capture and what they show could differs
15. TOOLING - CALLGRIND
Callgrind is a callgraph analyzer that comes with Valgrind.
Valgrind is a virtual machine using just-in-time (JIT)
compilation techniques.
16. EXAMPLE WITH A MATRIX CALCULUS
You can instrument your execution with callgrind and
explore on kcachegrind.
17. SAMPLING PROFILING
Perf, Oprofile, Intel Vtune, ...
Pro
~5 or 10% slower than raw execution
Run on any code
Cons
Some events are invisible
18. SANDBOX - WRITE MY OWN
SAMPLING PROFILER
To understand how simple a sampling profiler is, write your
own thread dump using gdb.
gstack() {
tmp=$(tempfile)
echo thread apply all bt >"$tmp"
gdb batch nx q x "$tmp" p "$1"
rm f "$tmp"
}
You execute with frequency to know where your program is
spending time
while sleep 1; do gstack @pid@ ; done
19. TOOLING - PERF & FLAMEGRAPH
Perf instrumentation appears on linux 2.6+ (Ubuntu 11.10
& redhat 6)
common interface for hardware counter
Flamegraph is actively developped by Brendan Gregg
20. EXAMPLE WITH A MATRIX CALCULUS
Flame Graph
app
__libc_start_main
main
dot
mat_mul
We don't have any time record on mat_new, even if it's
called 3 times.
24. WHAT'S HIDDEN INSIDE MEMCACHE
BINARY ?
readelf s ./memcached
...
434: 000000000040edf0 10 FUNC GLOBAL DEFAULT 13 slabs_rebalancer_res
435: 0000000000000000 0 FUNC GLOBAL DEFAULT UND setuid@@GLIBC_2
436: 0000000000000000 0 FUNC GLOBAL DEFAULT UND event_base_loop
437: 0000000000412fd0 315 FUNC GLOBAL DEFAULT 13 pause_threads
438: 00000000004135e0 10 FUNC GLOBAL DEFAULT 13 STATS_LOCK
439: 0000000000000000 0 FUNC GLOBAL DEFAULT UND getaddrinfo@@GLIBC_2
440: 0000000000000000 0 FUNC GLOBAL DEFAULT UND strerror@@GLIBC_2
441: 000000000040f550 201 FUNC GLOBAL DEFAULT 13 do_item_unlink
442: 0000000000000000 0 FUNC GLOBAL DEFAULT UND event_init
443: 0000000000000000 0 FUNC GLOBAL DEFAULT UND sleep@@GLIBC_2
444: 0000000000412b40 247 FUNC GLOBAL DEFAULT 13 assoc_delete
...
25. WHAT'S HAPPENS WHEN I WRITE 100
RECORD ON MEMCACHE
Doing a test with valgrind (not production friendly)
Capture cpu usage with gdb
Capture cpu usage with perf_event
Capture cache miss with perf_event
26. MEMCACHE - PROFILING WITH
CALLGRIND
Understand what's happen internally by following execution
trace.
valgrind tool=callgrind instratstart=no ./memcached
On another terminal
callgrind_control i on
php memcacheset.php
callgrind_control i off
28. MEMCACHE - PROFILING WITH GDB
./memcached &
while sleep 0.1; do gstack 8748; done > stack.txt
cat stack.txt | stackcollapsegdb.pl | flamegraph.pl > gdb_graph.svg
In an another terminal
php memcacheset.php
29. MEMCACHE - PROFILING WITH PERF
We capture events to build callgraph
perf record g ./memcached
In an another terminal
php memcacheset.php
To show an interactive report
perf report
perf report stdio
30. MEMCACHE - PROFILING CPU CYCLE
WITH PERF
perf script | stackcollapseperf.pl | flamegraph.pl > graph_stack_missing.sv
Flamegraph
Some information from kernel are missing.
31. MEMCACHED - PROFILING CPU
CYCLE WITH PERF - WITH KERNEL
STACKTRACE
./memcached &
sudo perf record a g p @pid@
In an another terminal
php memcacheset.php
Generate the flamegraph
perf script | stackcollapseperf.pl | flamegraph.pl > graph.svg
Flamegraph
32. MEMCACHED - PROFILING CACHE
MISS WITH PERF
./memcached &
sudo perf record e branchmisses a g p @pid@
33. SYSTEM - WHAT'S YOUR SYSTEM IS
DOING ?
sudo perf record a g
34. USE FLAMEGRAPH WITH JAVA
You can export a flamegraph from jstack output
Logstash contention flamegraph
36. TO SUMMARY
Prefer :
perf when you are looking for a bottleneck or you want to
watch what's happens on a machine
callgrind when you want to understand what's happen in
the code and when the performance is not a requirement