Keeping your build tool updated in a multi repository world
Improve Presto Architectural Decisions with Shadow Cache
1. Improve Presto architectural decisions with
shadow cache
Zhenyu Song (Princeton University)
Ke Wang (Facebook)
October 12, 2021
2. Introduction
2
● Zhenyu Song
● Ph.D. Candidate at Princeton
University
● Interested on caching system
● Ke Wang
● Engineer in facebook
● Focus on low latency queries in
presto team
3. Motivation: cache operation decisions
Shadow cache: a lightweight Alluxio component to
track the working set size & infinite cache hit ratio
3
Cache operator
How to size my cache for each tenant?
What is the potential hit ratio improvement?
4. Motivation: cache operation decisions
4
Cache operator
How to size my cache for
each tenant?
What is the potential hit ratio
improvement?
Shadow cache
Total unique bytes (pages)
accessed in the past 24 h
Total #hit/miss if the cache can
hold all 24h requested pages
5. Shadow cache design challenges
● Goal: track the working set size & infinite size hit ratio
● Challenges:
● Small memory & CPU overhead
● Accurate
● Dynamic update
5
6. Solution to overhead & accuracy challenge: Bloom filter
6
● Space-efficient probabilistic data structure membership testing
● Intuition: each object is represented with only several bits
● Possibly false positive, but not false negative
● It has k hash functions
○ To add an element, apply each hash function and set the bit to 1
○ To query an element, apply each hash function and AND the bits.
7. Why Bloom filter helps?
7
● To get infinite size hit ratio, we can query each get(key) to know
whether the key is in the Bloom filter.
● To measure the working set size, we leverage the approximation
Where is an estimate of the number of items in the filter, m is the
length (size) of the filter, k is the number of hash functions, and X is
the number of bits set to one.
8. Solution to dynamic update: Bloom filter chain
8
Bloom
filter
Bloom
filter
Bloom
filter
Bloom
filter
● The shadow cache is implemented by a chain of Bloom filters.
Each one tracks the unique objects in one period
6h 6h 6h 6h
12. Bloom filter chain: estimate_working_set_size()
12
Bloom
filter
Bloom
filter
Bloom
filter
Bloom
filter
t
OR all bits
Bloom
filter
13. Memory overhead estimation
● Example: track 27 M pages (27 TB working set size) uses 125 MB memory,
with only 3% error
○ Assume four bloom filters, each page is 1MB
○ Memory overhead is regardless of page key type (currently {string, long})
● Can further reduce by using HyperLogLog, but then not support infinite size
hit ratio estimation
13
14. Implementation
● Guava BloomFilter lib
● Automatically select the Bloom filter config (bits, #hash) by user-defined
memory overhead budget, and shadow cache window
● Support working set size in terms of #pages and #byte
● Support infinite size byte hit ratio and object hit ratio
14
15. Usage
#The past window to define the working set
alluxio.user.client.cache.shadow.window=24h
#The total memory overhead for bloom filters used for
tracking
alluxio.user.client.cache.shadow.memory.overhead=125MB
#The number of bloom filters used for tracking. Each
tracks a segment of window
alluxio.user.client.cache.shadow.bloomfilter.num=4
15
16. Conclusion
● We design Shadow cache: a lightweight Alluxio component to track the working
set size & infinite cache hit ratio
● Code merged:
https://github.com/Alluxio/alluxio/blob/master/core/client/fs/src/main/java/
alluxio/client/file/cache/CacheManagerWithShadowCache.java
● Many optimization opportunities
16
19. Motivation
1. We want to understand if a cluster is bounded by cache storage, Is
adding more storage going to help with cache hit rate and thus help with
query latency
2. It would also be useful to explore the potential improvement in caching
algorithms
3. We want to optimize the routing algorithm for better balance and
efficiency
19
20. Presto Routing for raptorX
● We shard the cache based on table name among clusters
● Query that access the same table will always go to the same target cluster to
maximize its cache
20
22. Options for optimizing routing logic
● Secondary cluster
○ when the primary cluster is busy, have a designated secondary cluster which will also have the
cache turned on for those queries
○ it requires storing additional tables cache on each cluster
● Two clusters both serving as designated primary, and do load balancing between
those two primary clusters
○ Cache disk usage -> X2
● Shuffle the tables between clusters to make the CPU distribution more even
based on query pattern.
○ it could make cache storage distribution not even and requires extra cache space
22
23. Key metrics on shadow cache
● Shadow cache is able to give us insights on the cache working set and how
cache hit rate would look like if we have infinite cache space.
● C1: Real Cache usage at a certain point of time
● C2: Shadow cache working set in a time window (1 day / 1 week)
● H1: Real Cache hit-rate
● H2: Shadow cache hit-rate
23