3. How do sites with a social networking angle figure globally? 3 As ranked by Alexa Site Global ranking Facebook 2 YouTube 3 Yahoo 4 Windows Live 5 Blogger 7 Wikipedia 8 Twitter 10
4. 3 Principles 4 3 common principles Fast feature delivery is key Cache everything everywhere Relational data is dead
5. Interesting stats 5 Facebook - Serve 120 million queries per second without a single join 37 Signals - Developed a production application serving over 4 million items using only 579 lines of code Flickr - 2 Billion photos served without using relational databases
6. How did they do it? 6 Nobody thought this was possible Unencumbered by history or restrictive rules Had to be creative in solving problems that nobody had experienced using very little capital outlay
7. 3 Principles 7 Fast feature delivery is key Cache everything everywhere Relational data is dead
8. Fast feature delivery is key 8 Choose an appropriate language Speed of development more important than speed of execution Languages like PHP and Ruby commonly used for rapid development and deployment
10. 3 Principles 10 Fast feature delivery is key Cache everything everywhere Relational data is dead
11. Cache everything everywhere 11 You need a really good reason not to cache data for reading Local caching a good start but more than one server means duplicating the cache no group invalidation memory limited to how much spare RAM on the server Most social networks use a distributed cache like memcached
12. Cache everything everywhere 12 Check if the information is in the cache. If so, use it If not, query the database put the result in the cache On update delete from the cache. The next user goes to the database function get_foo(int userid) { result = memcached_fetch("userrow:" + userid); if (!result) { result = db_select("SELECT * FROM users WHERE userid = ?", userid); memcached_add("userrow:" + userid, result); } return result;
16. Relational issue No 1 - Normalisation 16 Relational databases do not scale well because of normalisation Why normalise? - reduce storage space - reduce anomalies Today - storage is cheap - as data gets larger, joins are expensive
17. Relational issue No 2 - Transactions 17 ACID principles govern transactions Relational databases do not scale well because of transactions
18. After relational 18 Use BASE (basically available, soft state, eventually consistent) Shard Data Favour Name value pair stores over relational databases
19. Lessons for enterprise 19 Design of software should always be it depends. Test your most basic assumptions Dynamic languages and frameworks may be suitable to deliver a feature quickly You don't need an RDBMS for everything, especially if you need huge scale You should always cache data for read (unless you shouldn’t)
Looked at top 10 sites on the web found 7 with social networking aspectsOther:Google 1Baidu 6QQ.com 9
Decided to look at the traffic and found some very interesting statsFacebook – 200 million active users & 50 billion page views per monthYouTube – over 1 billion views per dayBasecamp – 2 million active accounts & 1.3 million projects managedTwitter – 1 Million + users & 3 million tweets per day
It should be noted that neither are the most efficient languages as they are not compiled (both are interpreted languages, they are not directly executed by the CPU but executed by an interpreter)Sites like Twitter and Yellowpages.com are written using Ruby on Rails. Tada list – has so much build into the framework that a full production app can be developed with very little code.
Some treat language as a religion, its ok to try something different, it doesn’t define you as a person.
Duplicating the cache is a waste of memoryNo group invalidation means you either need to notify all of your servers that they need to refresh their cache or rely solely on cache timeouts.a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.Memcached is used by: Facebook, YouTube, Wikipedia, LiveJournal, Digg, Twitter, SourceForgeMost site founders said that the biggest gain was from implementing a caching layer
There is a significant penalty in going to disk to read every time as opposed to reading from the cache.Implementing a cache is extremely easy, as shown by the code aboveGreat for reading data, but you still have to write data
All about responsivnessUsers wont tolerate long waits on social networksThey are now expecting this behaviour from all software
To prevent anomalies we don't duplicate data. We split everything up so it is stored once. The price of normalization is that when we want a person's address we have to go find the person and their address and bring the data together again. This is called a join. Joins are relatively slow, especially over very large data sets. Not just for reads (caching takes care of this) but for CUD.Flickr decided to denormalize because it took 13 Selects to each Insert, Delete or Update.
eBay do not use transactions, they have so much data that distributed transactions would harm responsiveness. Referential integrity and sorting are done in application code.Atomicity - all parts of a transaction succeed or none of then succeed.Consistency - The database will be in a consistent state when the transaction begins and ends.Isolation - The transaction will behave as if it is the only operation being performed upon the database.Durability - Upon completion of the transaction, the operation will not be reversed.Facebook has 4500 database servers
All solutions are slightly differentSame challenge in 5 years may have a totally different solution (hardware/software changes)
Need fresh ideas – otherwise well copy the mistakes of others