1. Hotspot Detection in a Service Oriented
Architecture
Pranay Anchuri, anchupa@cs.rpi.edu,
http://cs.rpi.edu/~anchupa
Rensselaer Polytechnic Institute, Troy, NY
Roshan Sumbaly, roshan@coursera.org
Coursera, Mountain View, CA
Sam Shah, samshah@linkedin.com
LinkedIn, Mountain View, CA
6. www.rpi.edu
What is a Hotspot
Hotspot : Service responsible for suboptimal
performance of a user facing functionality.
7. www.rpi.edu
What is a Hotspot
Hotspot : Service responsible for suboptimal
performance of a user facing functionality.
Performance measures:
Latency
Cost to serve
Error rate
8. www.rpi.edu
Who uses hotspot detection ?
Engineering teams :
Minimize latency for the user.
Increase the throughput of the servers.
Operations teams :
Reduce the cost of serving user requests.
10. www.rpi.edu
Data - Service Call Graphs
Service call metrics logged into a central
system.
Call graph structure re-constructed from
random trace id.
11. www.rpi.edu
Example of Service Call Graph
Read
profile
Content
Service
Context
Service
Content
Service
Entitlements Visibility
3
7
12
10 11
12. www.rpi.edu
Example of Service Call Graph
Read
profile
Content
Service
Context
Service
Content
Service
Entitlements Visibility
3
7
12
10 11
13. www.rpi.edu
Example of Service Call Graph
Read
profile
Content
Service
Context
Service
Content
Service
Entitlements Visibility
3
7
12
10 11
14. www.rpi.edu
Example of Service Call Graph
Read
profile
Content
Service
Context
Service
Content
Service
Entitlements Visibility
3
7
12
10 11
17. www.rpi.edu
Structure of call graphs
Structure of call graphs change rapidly
across requests.
Depends on member’s attributes.
A/B testing.
Changes to code base.
Over 90% unique structures for most
requested services.
18. www.rpi.edu
Asynchronous service calls
Calls AB, AC are
Serial : C is called after B returns to A.
Parallel : B and C are called at same time or in a
brief time span.
Parallel service calls are particularly difficult
to handle.
Degree of parallelism ~ 20 for some
services.
19. www.rpi.edu
Related Work
Hu et. al [SIGCOMM 04, INFOCOMM 05]
Tools to detect bottlenecks along network paths.
Mann et. al [USENIX 11]
Models to estimate latency as a function of RPC’s
latencies.
20. www.rpi.edu
Why existing methods don’t work ?
Metric cannot be controlled as in bottleneck
detection algorithms.
Analyzing millions of small networks.
Parallel service calls.
24. www.rpi.edu
● Given call graphs
● Hotspots in each
call graph
● Ranking hotspots
Optimize and summarize approach
25. www.rpi.edu
What are the top-k hotspots in a call graph ?
Hotspots in a specific call
graph irrespective of
other call graphs for the
same type of request.
26. www.rpi.edu
Key Idea
What are the k services, if already optimized, that
would have lead to maximum reduction in the latency
of request ?
(Specific to a particular call graph)
28. www.rpi.edu
Quantifying impact of a service
What if a service was optimized by
θ ? (think after the fact)
Its internal computations are θ times faster.
No effect on the overall latency if its parent is
waiting on other service call to return.
39. www.rpi.edu
Under the propagation assumption
Computing the optimal 𝑘 services is NP-
hard.
Reduction from a variation of subset sum
problem.
Construction and proof in the paper.
40. www.rpi.edu
Relaxation
Variation of the propagation assumption
that allows for a service to propagate
fractional effects to its parent.
Leads to a greedy algorithm.
41. www.rpi.edu
Greedy algorithm to compute top-k
hotspots
Given an optimization factor θ,
Repeatedly select a service that has maximum impact
on frontend service.
Update the times after each selection.
Stop after k iterations.
42. www.rpi.edu
Ranking hotspots
top 𝑘 services change
significantly across
different call graphs.
Rank hotspots on:
Frequency (itemset
mining)
Impact on front end
service.
43. www.rpi.edu
Rest of the paper
Similar approach applied to cost of request
metric.
Generalized framework for optimizing
arbitrary metrics.
Other ranking schemes.
45. www.rpi.edu
Dataset
Request
type
Avg # of
call graphs
per day*
Avg # of
service call
per
request
Avg # of
subcalls
per service
Max # of
parallel
subcalls
Home 10.2 M 16.90 1.88 9.02
Mailbox 3.33 M 23.31 1.9 8.88
Profile 3.14 M 17.31 1.86 11.04
Feed 1.75 M 16.29 1.87 8.97
* Scaled down by a constant factor
50. www.rpi.edu
Conclusions
Defined hotspots in service oriented
architectures.
Framework to mine hotspots w.r.t various
performance metrics.
Experiments on real world large scale
datasets.