The Boston Riak had Sean Kelly from Tapjoy digging into message queue infrastructure at the company. They process billions of requests a day and queuing is an important element of that scale.
To kick us off, we discussed the basics of message queues, distributed systems and why dual writes are evil. Here is that talk with a few links to get you started.
I find this personally and professionally interesting.
I’m going to make sure we’re all starting from the same assumptions by discussing common factors in the state of data management.
And then we’ll work through a disturbingly common pattern that our systems end up. From this pain point, we’ll look into some of the structural considerations for your application.
IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010.[3] Computer World states that unstructured information might account for more than 70%–80% of all data in organizations.[4]
I’ve implemented exactly zero of what I’m talking about. What I do offer is the good fortune of speaking to people who build these systems, basically non-stop. There is a lot to learn from just listening.
I’ve spoken to hundreds of developers from companies of every shape and size. I’ve argued with ops engineers, I’ve listened to data scientists. I’ve read the 8 years of posts, from Amazon’s Dynamo paper in 2007 that Basho actually designed Riak after.
And I have the good fortune to listen in to a ton of conversations.
Our database at Basho, Riak, is used by many companies to store everything from session data to log aggregation. In these conversations, I always pivot to asking about their architecture - the how, the why, and the waht could be better.
You’ll also see some hand drawn slides, complements of Martin Kleppmann. He gave me permissions to reuse his work after I tweeted him and I want to give back to him by letting you know about this book. Designing Data-Intensive Applications is a must read.
“The buzzwords that fill this space are a sign of enthusiasm for the new possibilities, which is a great thing. However, as software engineers and architects, we also need to have a technically accurate and precise understanding of the various technologies and their trade-offs if we want to build good applications. For that understanding, we have to dig deeper than buzzwords.”
What I’m going to talk about today isn’t really new — some people have known about these ideas for a long time. However, they aren’t as widely known as they should be. If you work on a non-trivial application, something with more than just one database, you’ll probably find these ideas very useful.
We start with a simple web app. It has multiple clients for HTTP and native mobile.
This is all successfully stored on our familiar RDBMS.
And we’re successful!
But success comes with more demand. Demand needs we need to speed things up.
Let’s assume that you’re working on a web application. In the simplest case, it probably has the stereotypical three-tier architecture: you have some clients (which may be web browsers, or mobile apps, or both), which make requests to a web application running on your servers. The web application is where your application code or “business logic” lives.
So we add a cache. We see performance improve for our users and all is well again. Then another need arrives.
Perhaps you get more users, making more requests, your database gets slow, and you add a cache to speed it up – perhaps memcached or Redis, for example.
We need search, which our RDBMS was not scoped to handle or does not give us the symatics we want, so we add a searching solution like Apache Solr or ElasticSearch.
Perhaps you need to add full-text search to your application, and the basic search facility built into your database is not good enough, so you end up setting a separate indexing service such as Elasticsearch or Solr.
Perhaps you need to move some expensive operations out of the web request flow, and into an asynchronous background process, so you add a message queue which lets you send jobs to your background workers.
ActiveMQ, RabbitMQ, something home grown on top of Redis..
Now that your business analytics are working, you find that your search system is no longer keeping up… but you realise that since you have all the data in HDFS anyway, you could actually build your search indexes in Hadoop and push them out to the search servers, and the system just keeps getting more and more complicated…
…and the result is complete and utter insanity.
We’re left with an incoherent jumble of services that all communicate with essentially the same data. Updates are terrifying because we fear the complexity we’ve relied on.
How did we get into that state? How did we end up with such complexity, where everything is calling everything else, and nobody understands what is going on?
It’s not that any particular decision we made along the way was bad. There is no one database or tool that can do everything that our application requires – we use the best tool for the job, and for an application with a variety of features that implies using a variety of tools.
Also, as a system grows, you need a way of decomposing it into smaller components in order to keep it manageable. That’s what microservices are all about. But if your system becomes a tangled mess of interdependent components, that’s not manageable either.
So how do we keep these different data systems in sync? There are a few different techniques.
A popular approach is so-called dual writes:
Dual writes is simple: it’s your application code’s responsibility to update data in all the right places. For example, if a user submits some data to your web app, there’s some code in the web app that first writes the data to your database, then invalidates or refreshes the appropriate cache entries, then re-indexes the document in your full-text search index, and so on. (Or maybe it does those things in parallel – doesn’t matter for our purposes.)
The dual writes approach is popular because it’s easy to build, and it more or less works at first. But I’d like to argue that it’s a really bad idea, because it has some fundamental problems. The first problem is race conditions.
The following diagram shows two clients making dual writes to two datastores. Time flows from left to right, following the black arrows:
Here, the first client (teal) is setting the key X to be some value A. They first make a request to the first datastore – perhaps that’s the database, for example – and set X=A. The datastore responds saying the write was successful. Then the client makes a request to the second datastore – perhaps that’s the search index – and also sets X=A.
At the same time as this is happening, another client (red) is also active. It wants to write to the same key X, but it wants to set the key to a different value B. The client proceeds in the same way: it first sends a request X=B to the first datastore, and then sends a request X=B to the second datastore.
All these writes are successful. However, look at what value is stored in each database over time:
In the first datastore, the value is first set to A by the teal client, and then set to B by the red client, so the final value is B.
In the second datastore, the requests arrive in a different order: the value is first set to B, and then set to A, so the final value is A. Now the two datastores are inconsistent with each other, and they will permanently remain inconsistent until sometime later someone comes and overwrites X again.
An the worst thing: we probably won’t even notice that your database and your search indexes have gone out of sync, because no errors occurred. You’ll probably only realize six months later, while you’re doing something completely different, that your database and your indexes don’t match up, and you’ll have no idea how that could have happened.
In this case, the most straightforward approach is quite fundamentally flawed.
We need to balance the availability of information, what queriable state it is in and whether or not we can afford the complexity.
The same information in many places.
The basics of all these choices is an ability to move out of the synchronous, low-latency flow of an application request.
We have many choices and many angles to our balancing act to keep in mind. So let’s walk through a few that are incredibly important in the choice of your database.
Message queues is a universal name for what acts as a data highway from your applications to your database services in order to keep data synchronized and avoid the insanity architecture above.
NoSQL tells you nothing about what’s important. We’ll get into that further.
Hadoop is actually a collection of tools, not a single solution in and of itself.
The Hadoop Distributed Filesystem is a multi-server filesystem designed for high throughput and high latency. It tolerates some failure scenarios and falls into the Data Warehouse world.
Unlike NoSQL, there are no low latency applications that write and then read from HDFS! That’s not what it’s intended to do.
Map/Reduce is fundamentally a querying system designed for parallel computation. It’s loved for getting people off of multi-million dollar systems and allowing them to scale out and it comes at the cost of its own mapper and reducer design.
Spark, another Apache project, is largely recognized as the successor to Map/Reduce. It provides a map/reduce-backwards compatability while also exposing all the data science processing available in its clients - Python and Scala. Data is pulled from disk and manuplicated in memory.
YARN = framework for job parallelization
All often misappropriated a the same problem set.
http://java.dzone.com/articles/exploring-message-brokers
Apache ActiveMQ is the most popular and powerful open source messaging and Integration Patterns server.
Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4. Apache ActiveMQ is released under the Apache 2.0 License
Supported by Pivotal.
Robust messaging for applications
Easy to use
Runs on all major operating systems
Supports a huge number of developer platforms
Open source and commercially supported
Supported by Confluent.io - founded at LinkedIn.
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.
Fast
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
Scalable
Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
Durable
Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
Distributed by Design
Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.
Not a question so much as a challenge for you: Get Hands-on with technology, right away.
Sometimes you just have to experience the system first hand to see its value. Don’t be scared of it. Whether you have the benefit of choicing an open source solution or simply need to spin up a server to test something, go use it right now. Don’t want to architecture the perfect solution because so much of what you’ll need will come from using it.