Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1awkL99.
Details on Pinterest's architeture, its systems -Pinball, Frontdoor-, and stack - MongoDB, Cassandra, Memcache, Redis, Flume, Kafka, EMR, Qubole, Redshift, Python, Java, Go, Nutcracker, Puppet, etc. Filmed at qconsf.com.
Yash Nelapati is an infrastructure engineer at Pinterest where he focusses on scalability, capacity planning and architecture. Prior to Pinterest he was into web development and rapidly prototyping UI. Marty Weiner joined Pinterest in early 2011 as the 2nd engineer. Previously worked at Azul Systems as a VM engineer focused on building/improving the JIT compilers in HotSpot.
2. Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/scaling-pinterest
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
3. Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
5. Growth
March 2010
Page views per day
Mar 2010
Scaling Pinterest
Monday, November 11, 13
Jan 2011
Jan 2012
May 2012
6. Growth
March 2010
Page views per day
Mar 2010
Scaling Pinterest
Monday, November 11, 13
Jan 2011
Jan 2012
May 2012
7. Growth
March 2010
Page views per day
·
·
·
·
RackSpace
1 small Web Engine
1 small MySQL DB
1 Engineer + 2 Founders
Mar 2010
Scaling Pinterest
Monday, November 11, 13
Jan 2011
Jan 2012
May 2012
20. Growth
April 2012
Page views per day
Mar 2010
Mar 2010
Scaling Pinterest
Monday, November 11, 13
Jan 2011
Jan 2012
May 2012
21. Growth
April 2012
Page views per day
Mar 2010
Mar 2010
Scaling Pinterest
Monday, November 11, 13
Jan 2011
Jan 2012
May 2012
22. Growth
April 2012
Page views per day
·
·
·
·
Amazon EC2 + S3 + Edge Cast
135 Web Engines + 75 API Engines
10 Service Instances
80 MySQL DBs (m1.xlarge) + 1 slave
each
·
·
·
110 Redis Instances
60 Memcache Instances
2 Redis Task Manager + 60 Task
Mar 2010
Processors
·
3rd party sharded Solr
Scaling Pinterest
Monday, November 11, 13
Mar 2010
Jan 2011
Jan 2012
May 2012
23. Growth
April 2012
Page views per day
·
12 Engineers
·
·
·
·
·
1 Data Infrastructure
1 Ops
2 Mobile
8 Generalists
10 Non-Engineers
Mar 2010
Mar 2010
Scaling Pinterest
Monday, November 11, 13
Jan 2011
Jan 2012
May 2012
27. Growth
April 2013
·
·
Page views per day
Amazon EC2 + S3 + Edge Cast
400+ Web Engines + 400+ API
Engines
·
70+ MySQL DBs (hi.4xlarge on SSDs)
+ 1 slave each
·
·
·
100+ Redis Instances
230+ Memcache Instances
10 Redis Task Manager + 500 Task
Processors
·
65+ Engineers (130+ total)
April 2012
Scaling Pinterest
Monday, November 11, 13
April 2013
28. Growth
April 2013
·
·
Page views per day
Amazon EC2 + S3 + Edge Cast
400+ Web Engines + 400+ API
Engines
·
70+ MySQL DBs (hi.4xlarge on SSDs)
+ 1 slave each
·
·
·
100+ Redis Instances
230+ Memcache Instances
10 Redis Task Manager + 500 Task
Processors
·
·
·
·
·
·
·
65+ Engineers (130+ total)
8 services (80 instances)
Sharded Solr
20 HBase
12 Kafka + Azkabhan
8 Zookeeper Instances
12 Varnish
Scaling Pinterest
Monday, November 11, 13
April 2012
April 2013
29. Growth
April 2013
·
65+ Engineers
·
·
·
·
·
·
·
·
·
·
Page views per day
7 Data Infrastructure + Science
7 Search and Discovery
9 Business and Platform
6 Spam, Abuse, Security
9 Web
9 Mobile
2 growth
10 Infrastructure
6 Ops
65+ Non-Engineers
Scaling Pinterest
Monday, November 11, 13
April 2012
April 2013
36. Choosing
Your
Tech
Questions to ask
• Does it meet your needs?
• How mature is the product?
• Is it commonly used? Can you hire people who have used it?
• Is the community active?
• How robust is it to failure?
• How well does it scale? Will you be the biggest user?
• Does it have a good debugging tools? Profiler? Backup
software?
• Is the cost justified?
Scaling Pinterest
Monday, November 11, 13
37. Hosting
Why Amazon Web Services (AWS)?
• Variety of servers running Linux
• Very good peripherals: load balancing, DNS, map
reduce, basic security, and more
• Good reliability
• Very active dev community
• Not cheap, but...
Scaling Pinterest
Monday, November 11, 13
38. Hosting
Why Amazon Web Services (AWS)?
• Variety of servers running Linux
• Very good peripherals: load balancing, DNS, map
reduce, basic security, and more
• Good reliability
• Very active dev community
• Not cheap, but...
• New instances ready in seconds
Scaling Pinterest
Monday, November 11, 13
39. Hosting
AWS Usage
• Route 53 for DNS
• ELB for 1st tier load balance
• EC2 Ubuntu Linux
• Varnish layer
• All web, API, background appliances
• All services
• All databases and caches
• S3 for images, logs
Scaling Pinterest
Monday, November 11, 13
40. Code
Why Python?
• Extremely mature
• Well known and well liked
• Solid active community
• Very good libraries specifically targeted to web
development
• Effective rapid prototyping
• Open Source
Scaling Pinterest
Monday, November 11, 13
41. Code
Why Python?
• Extremely mature
• Well known and well liked
• Solid active community
• Very good libraries specifically targeted to web
development
• Effective rapid prototyping
• Open Source
Some Java and Go...
• Faster, lower variance response time
Scaling Pinterest
Monday, November 11, 13
42. Code
Python Usage
• All web backend, API, and related business logic
• Most services
Scaling Pinterest
Monday, November 11, 13
43. Code
Python Usage
• All web backend, API, and related business logic
• Most services
Java and Go Usage
• Varnish plugins
• Search indexers
• High frequency services (e.g., MySQL service)
Scaling Pinterest
Monday, November 11, 13
44. Production
Data
Why MySQL and Memcache?
• Extremely mature
• Well known and well liked
• (MySQL) Rarely catastrophic loss of data
• Response time to request rate increases linearly
• Very good software support: XtraBackup, Innotop, Maatkit
• Solid active community
• Open Source
Scaling Pinterest
Monday, November 11, 13
45. Production
Data
MySQL and Memcache Usage
• Storage / Caching of core data
• Users, boards, pins, comments, domains
• Mappings (e.g., users to boards, user likes, repin info)
• Legal compliance data
Scaling Pinterest
Monday, November 11, 13
46. Production
Data
Why Redis?
• Well known and well liked
• Active community
• Consistently good performance
• Variety of convenient and efficient data structures
• 3 Flavors of Persistence: Now, Snapshot, Never
• Open Source
Scaling Pinterest
Monday, November 11, 13
47. Production
Data
Redis Usage
• Follower data
• Configurations
• Public feed pin IDs
• Caching of various core mappings (e.g., board to pins)
Scaling Pinterest
Monday, November 11, 13
48. Production
Data
Why HBase?
• Small, but growing loyal community
• Difficult to hire for, but...
• Non-volatile, O(1), extremely fast and efficient storage
• Strong Hadoop integration
• Consistently good performance
• Used by Facebook (bigger than us)
• Seems to work well
• Open Source
Scaling Pinterest
Monday, November 11, 13
49. Production
Data
HBase Usage
• User feeds (pin IDs are pushed to feeds)
• Rich pin details
• Spam features
• User relationships to pins
Scaling Pinterest
Monday, November 11, 13
50. Production
Data
What happened to Cassandra,
Mongo, ES, and Membase?
• Does it meet your needs?
• How mature is the product?
• Is it commonly used? Can you hire people who have used it?
• Is the community active? Can you get help?
• How robust is it to failure?
• How well does it scale? Will you be the biggest user?
• Does it have a good debugging tools? Profiler? Backup
software?
• Is the cost justified?
Scaling Pinterest
Monday, November 11, 13
52. A 2nd
Chance
Stuff we could have done better
• Logging on day 1 (StatsD, Kafka, Map Reduce)
• Log every request, event, signup
• Basic analytics
• Recovery from data corruption or failure
• Alerting on day 1
Scaling Pinterest
Monday, November 11, 13
53. A 2nd
Chance
Stuff we could have done better
• Shard our MySQL storage much earlier
• Once you start relying on read slaves, start the
timebomb countdown
• We also fell into the NoSQL trap (Membase,
Cassandra, Mongo, etc)
• Pyres for background tasks day 1
• Hire technical operations eng earlier
• Chef / Puppet earlier
• Unit testing earlier (Jenkins for builds)
Scaling Pinterest
Monday, November 11, 13
54. A 2nd
Chance
Stuff we could have done better
• A/B testing earlier
• Decider on top of Zookeeper WATCH
• Progressive roll out
• Kill switches
Scaling Pinterest
Monday, November 11, 13
55. What’s
next?
Looking Forward
• Continually improve Pinner experience
• Help Pinners discover more of the things they love
• Better uptime and lower latency
• Faster development times
• Reduce spam and abuse
• Continually improve collaboration and build bigger,
better, faster products
• 180 Pinployees and beyond
Scaling Pinterest
Monday, November 11, 13
59. My 2nd
Chance
If I could do it all over again...
• Stronger ACID transactional guarantees across multiple
systems
• Currently have: sometimes A, best effort C, I, D,
no silent failure
• Want: sometimes A, eventual C, I, D, no silent
completion
Scaling Pinterest
Monday, November 11, 13
60. My 2nd
Chance
Transactional tasks
• All tasks become a dependency tree of repeatable
synchronous or asynchronous actions
• All actions must be repeatable
• Otherwise, must add repeatability
• All tasks get a unique transaction number
• Counters are tricky
Scaling Pinterest
Monday, November 11, 13
61. My 2nd
Chance
Transactional tasks
• All tasks become a dependency tree of repeatable
synchronous or asynchronous actions
• Sync actions are executed in order
• Async actions are executed in any order
• Repeat until successful or too many failures
• Too many failures -> put in per task failure queue
• Gives eventual C, I, D
• No silent completion and A require extra effort
Scaling Pinterest
Monday, November 11, 13
62. My 2nd
Chance
Transactional tasks example
• Pin create sync
• Write empty pin object
• Write pin ID to board, likes, user’s pins, clear caches
• Write pin object
• Pin not shown until pin object created -> Atomicity!
Scaling Pinterest
Monday, November 11, 13
63. My 2nd
Chance
Transactional tasks example
• Pin create async
• Write pin to required user feeds and public feeds
• Feeds are sorted sets. Reinsertion is okay.
• Send emails, Facebook Likes, Twitter Tweets
• Before send, check / record in temporary storage
-> Gives (temporary) repeatability
Scaling Pinterest
Monday, November 11, 13
64. Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/scalingpinterest