1. How We Scaled Freshdesk to
Handle 150M Requests/Week
Kiran Darisi
Director, Technical Operations at Freshdesk
2. Our customer base grew by 400% and the number of requests
per week boomed from 10 to 65 million in a year (2013).
3. Not from an engineering perspective
Cool for a 3 year old startup?
4. We used a bunch of methods to scale vertically in a really
short amount of time.
Sure, we eventually had to shard our databases, but some
of these techniques helped us stay afloat, for quite a while.
5. MOORE’S WAY
Increasing the RAM, CPU and I/O
But the amount of RAM we added and the
CPU cycles did not correlate with the
workload we got out of the instance. So we
stayed put at 64GB.
We upgraded from Medium Instance
Amazon EC2 First Generation to High
Memory Quadruple Extra Large (thus
increasing our RAM from 3.75 GB to 64 GB)
6. R/W split increased the number of I/Os
performed on our databases but it didn’t
do much for write performance.
We marked dedicated roles for each
slave because using round robin
algorithm to select different slaves for
different queries proved ineffective.
THE READ/WRITE SPLIT
Using MySQL replication and distributing
the reads between master and slave
7. We chose the partition key and the number
of partitions and the table was partitioned
automatically.
Post-partitioning, our read performance
increased dramatically but again, the write
performance was a problem.
MYSQL PARTITIONING
Using the MySQL 5 built-in
partitioning capability.
8. 1. Choose the partition key carefully or alter the current schema to
follow the MySQL partition rules.
2. The number of partitions you start with will affect the I/O
operations on the disk directly.
3. If you use a hash-based algorithm with hash-based keys, you
cannot control who goes where. This means you’ll be in trouble if
two or more noisy customers fall within the same partition.
4. Make sure that every query contains the MySQL partition key. A
query without the partition key ends up scanning all the
partitions. Performance is sure to take a dive.
Things to keep in mind while performing MySQL partitioning
9. We cached ActiveRecord objects as well as
HTML partials (bits and pieces of HTML) using
Memcached.
We chose Memcached because it scales well
with multiple clusters. The Memcached client
used makes a lot of difference in response
time so we went with dalli.
CACHING
Caching objects that rarely
change in their lifetime
10. DISTRIBUTED FUNCTIONS
Keeping response time low by
using different storage engines for
different purposes
We started using Amazon RedShift for
analytics and data mining, and Redis to
store state information and background
jobs for Resque.
But because Redis can’t scale or fallback,
we don’t use it for atomic operations.
11. We decided that scaling horizontally by sharding was
the only cost-effective way to increase write scalability
beyond the instance size.
But scaling vertically can only get you so far.
12. Two main concerns we had before we took the final call
on sharding:
1. No distributed transactions – We wanted all tenant
details to be in one shard.
2. Rebalancing the shards should be easy – We wanted
control over which tenant sits in which shard and to
be able to move them around when needed.
A little research showed us that directory-based
sharding was the only way to go.
13. REASONS FOR
CHOOSING DIRECTORY-
BASED SHARDING
It is simpler than hash key-based
or range-based sharding.
Rebalancing shards is easier here
than in other methods.
14. A typical directory entry looks like this
tenant info shard_details shard_status
Stark Industries shard1 Read & Write
• tenant_info - unique key referring to the DB entry
• shard_details - shard in which that tenant exists
• shard_status - tells what kind of activity the tenant is ready for (we have
multiple shard statuses like Not Ready, Only Reads, Read & Write etc)
15. The sharding API even
acts as a unique ID
generator so that the
tenant ID generated is
unique across shards.
How directory lookups work
API wrapper is tuned to
accept the tenant
information in multiple
forms like tenant URL,
tenant ID etc.
16. Sometimes a customer grows from processing 1000 tickets per day
to 10,000 tickets per day. This will affect the performance of the
whole shard.
We can’t solve this by splitting up customer data into multiple
shards because we didn’t want the mess of distributed transactions.
So, in these cases, we’d move the noisy customer to a shard of his
own. That way, everybody wins.
Why we care about rebalancing
18. Every shard will have its own slave to scale the reads.
For example, say Wayne Enterprises and Stark
industries are in shard1.
1
Wayne Enterprises shard1 Read & Write
Stark Industries shard1 Read & Write
The directory entry looks like this:
19. If Wayne enterprises grows at a breakneck
pace, we would decide to move it to
another shard.
(averting the danger of Bruce Wayne and
Tony Stark being mad at us the same time).
2
20. So we would boot up a new slave to shard1 and call it
shard2. Then, we’d attach a read replica to the new
slave and wait for it to sync with the master.
3
21. We would then stop the writes for Wayne Enterprises
by changing the shard status in the directory.
4
Wayne Enterprises shard1 Read Only
Stark Industries shard1 Read & Write
22. Then we would stop the replication of master data in
shard2 and promote it to master.
5
Now the directory entry will be updated accordingly.
Wayne Enterprises shard2 Read & Write
Stark Industries shard1 Read & Write
23. This effectively moves Wayne Enterprises to its own shard.
Batman is happy and so is Iron Man.
6
24. 1. Don’t do it unless it’s absolutely necessary. You will have to
rewrite code for your whole app, and maintain it.
2. You could use functional partitioning (moving an over-sized table
to another DB altogether) to completely avoid sharding if writes
are not a problem.
3. Choosing the right sharding algorithm is a bit tricky as each has
its own benefits and drawbacks. You need to make a thorough
study of all your requirements while picking one.
4. You will have to take care of the Unique ID generation across
shards.
Word of caution
25. We get 250,000 tickets across Freshdesk every day and 100 M
queries during the same time (with a peak of 3-4k QPS). We have a
separate shard now for all new sign ups. And each shard can
roughly carry 20,000 tenants.
In the future, we’d like to explore Multi-pod architecture and also
look at Proxy architecture using MySQL Fabric, Scalebase etc.
What’s next for Freshdesk
26. “Behind every slideshare is a
great blogpost”
Read more about scaling freshdesk here
http://blog.freshdesk.com/how-freshdesk-scaled-using-
sharding/