The document discusses LinkedIn's use of Apache Kafka as a central data pipeline to integrate their variety of real-time user data streams. Some key points:
- LinkedIn uses Kafka to ingest over 28 billion messages per day from various data sources like user activity and system metrics.
- Kafka provides a scalable central data pipeline that supports high throughput rates of hundreds of thousands to millions of messages per second.
- LinkedIn standardizes on the Avro data format for schemas and pushes data cleaning upstream by producers.
- They ensure correctness through an audit trail and evidence-based approach of validating that all messages reach consumers.
24. HADOOP SUMMIT 2013
Hadoop data load (Camus)
Open sourced:
– https://github.com/linkedin/camus
One job loads all events
~10 minute ETA on average from producer to HDFS
Hive registration done automatically
Schema evolution handled transparently
Talk about our data pipeline, the motivations behind it and how we built it out using Apache Kafka.
LI like most web companies derives a lot of value from tracking user activity – page views, clicks, ad impressions and so on. In fact some of this activity data is visible directly in one form or another on your own network update stream. People in your network may add a new connection or share a URL – and you ideally want to see these updates as soon as possible, and ideally in real-timeThis user activity feed is a useful user-facing product in and of itself, but it really is much more important than that. LinkedIn is a data-centric company and this activity data is also a hugely valuable ingredient in other data-driven products.
Use data to provide a richer, more relevant experience to members, and that engages our users more in turn generating more activity data and we get into this perpetual self-feeding cycle of successive refinement.
And that principle manifests in products such as PYMK which is only as engaging or as useful as it is relevant. If you only see people who are not related to you or you don’t care about, then it is pointless. Relevant suggestions here obviously lead to connecting with that person, triggering a connect event, you may click on their profile or company page triggering more activity events. So each page view can directly or indirectly follow up in additional front-end reads and considerable downstream activity – i.e., calls into backend services. So a simple page view (which is intuitively a read operation) when tracked as user-activity, results in a bunch of writes within your activity data pipeline. The data pipeline is not solely for activity tracking.
It is important to also have a metrics data feed for tracking your system and application metrics, logs. This is critical for monitoring the health trends of your production services, both low-level and application-level. Log and service call data for tools such as service call graphs. User activity and system metrics are just two kinds of data that you might want to have in your data pipeline. There are a few others as well. And there are a whole bunch of data-driven systems that need to feed off these data streams.
So that is really the key problem we want to solve – integrating these different data pipelines and making it easily available preferrably in real-time to each data-driven system.and what happens at most companies including linkedin is that we end up building specialized data systems to handle these types of data and very soon you end up with an architecture that looks something like:
… where there is a different solution or pipeline for each type of data. In this picture, the data sources are above and the data-driven systems (the consumers are below). (For e.g., for operational metrics we used JMX feeding into Zenoss, we had a different user activity tracking system (which I’ll talk more about a little later), we used splunk for scraping and searching logs) And there are a number of data-driven systems that are directly user-facing, some mid-tier and some that are more backend. And for many of these systems it is important to have access to the data in near real-time. Take security systems for e.g., which would need to consume user activity events from the user tracking system and detect anomalous or malicious patterns and react quickly. Likewise, for search systems to provide more relevant search results it is important to make content that is indexed as fresh as possible. Recommendation systems can do a better job of providing more relevant results if signals from activity data are incorporated early on. So in order to fulfill these use cases we needed tight coupling between the sources of the various types of data and the various specialized data-driven systems that feed off that data. Cons: In order to have universal access to that data need O(n2) p2p pipelines. end up having to have this tight coupling between sources and systems and end points need to know how to talk to each other.Concrete example to drive home these points hopefully a little more clearly: linkedin’s old activity data pipeline
To drive home these points a little more clearly I’ll provide some details on our previous (specialized) user activity data pipeline. Front-end applications would post XML blobs containing activity data to a HTTP-based logging service – activity logs were scraped from this service and periodically rsync’d over to staging servers in the offline data centers where the ETL process takes place. Number of limitations with this pipeline.First, The logging service did not really provide real-time access to the data that was sent to it. It just served as a point of aggregation. So other data systems in the live data center couldn’t directly feed off this activity data – in fact as this diagram shows, ultimately the only consumer of the user activity were the offline systems.Second, the data flowing through this pipeline was raw xml data – producers used whatever structure they wanted, and there’s this one single DWH team whose job is to suck in all this raw activity data, clean it and make it beautiful and represents everything about your business. Cleaning the data produced by tons of producers. Not well-versed with the producer data. Producer doesn’t know what constitutes clean data that is amenable for treatment at the ETL stage.Data flow is fragile wrt schema changes, labor intensive,unscalable at the human layer: people down at the batch layer build data-rich products and depend on a number of data sources. Say, 50+ - so highly likel to break if data format changes in any one of these data flows. New application – file ticket and talk to consumers.Hard to verify correctness. Does it work? Is all the activity data getting collected?Inherently batch oriented process: rsyncs were periodic, ETL jobs are periodic – multi-hour delays in getting the cleaned data.Furthermore, the DWH is the only source of clean data. Irony is that this data is very important to a data-centric company. i.e., we don’t want to stop at reports, but we want this clean data made available to production services as soon as possible in order to power insight-driven products. Part of the reason why we built out our hadoop cluster. Even though it is unscalable to move this data around making the data just available to these systems is hugely useful. E.g., after setting up hadoop – unlocked a lot of possibilities. New computation was possible on data that would have been hard to do before. Many new products and analysis just came from putting together multiple pieces of data that had previously been locked up in specialized systems. People really wanted the data.Forces setting up solutions that are relatively heavy-weight/clumsy and not particularly effective.E.g., recommendation systems match jobs, match people you may know groups you may want to join, etc. The way it works is user events end up in hadoop, on which some offline processing and enrichment takes place and we generate pre-built RO stores. For various reasons it is not possible to update a RW store in the live DC from the offline DC. these pre-built RO stores are shipped to the production data center intermittently. If those jobs aren’t running frequently enough your recommendation system is stuck with signals from potentially activity data from the last run.
So that was just the user-activity pipeline. Similar issues plague other specialized pipelines.
Simple recipe: take all the organization’s data and put it into a central pipeline for real-time consumption.Multiple benefits:Data is integrated and made available and if it supports persistence it is available for a period of time.Decouples producers and consumers - only need to know how to talk to the central pipelineAdding a new data source or sink is simple and organizationally scalable.Should point out that we have had and still have a pipeline for database update streams – that’s databus. I won’t be going into that in this talk.
Since we were already using activemq at the time for ad hoc messaging purposes, we wondered if we could use it for the central data pipeline as well – i.e., the approach was to hook it up to the activity feed and just see what happens.So we tried activemq and rabbitmq.
Problems With JMS Messaging SystemsNot designed for high volume data especially with large backlogs of unconsumed data.IOW, persistence is not an ingrained conceptDifficult to scale out – no inherent support for distribution.Featuritis: xactions (exactly once semantics)
This borrows from the traditional database log concept – which is take all changes, update a bunch of stuff – tables, indexes, materialized views and so on, and do this In a way that is correct in the presence of failures – write a log of all that happens. Someone else can read this log, take and apply these updates. So that’s a log – it’s an append-only, totally ordered sequence of records indexed by time. Not too different from file – but purpose of a log is specific. It captures what happened and when and provides a persistent re-playable record of history.
Apart from decoupling which is a benefit of the central log, subscribers can consume at their own pace. Important – e.g., for hadoop which may be on an hourly schedule or down for maintenance. Because the log is persistent, it can resume from where it left off when it comes back up and even if there is a large backlog that won’t impact consumer (or broker) performance. Linear access pattern.
Brokers producers consumers, topics.Engineering for high throughput:Batching at producer and consumerCompressionReliance on pagecacheHorizontally scalable:Can add more producersCan add more brokers -> more partitions. Don’t need that many. (16 brokers per DC).Can add more consumer instances. E.g., can have one consumer reading three partitions, or three consumers sharing that load.Guarantees:Successfully published messages must not be lost and must be available for delivery even in the presence of broker failures.Consumer: At least once, most of the time exactly once. E2E latency is generally under a second.
Cluster size – some are big some are small. Tracking cluster is 16Writes/reads are batched – rps will be less.Currently at 700 topics – ever growing number.
This is an approximation of our topology – each large box is a data center. The main point I want to show is that we can mirror clusters to other DCs efficiently (with minutes delay) and in doing so the data pipeline can almost seamlessly cut across data center boundaries.So the hadoop clusters in the offline DC has near real-time access to the activity feed and can push back enriched data for consumption by production services. This is a powerful use case. It allows live services to take large amounts of data that has been cleaned and enriched in one way or the other by a batch-oriented system (hadoop in this case) and consume it in a stream oriented manner. i.e., before we had this data pipeline, we were left with few options but to push the batch on to production services that needed it but that may not always be feasible because it’s a push model – services may not be able to deal with sudden bursts of data being pushed from hadoop. Instead, the hadoop jobs can now shove data into the data pipeline which is mirrored back to the live DC and services can consume this data in a stream-oriented fashion at its own pace.E.g., this is used by a system called Faust (under development by the Voldemort team also part of DI @ LI) to improve the turn around time of incorporating recent activity data for use by rec systems. I mentioned earlier we had these jobs that shipped out entire pre-built RO stores – now the jobs can just write out updates to the data pipeline that are read and applied by Faust in the live DC.
Second idea is to pre-emptively prevent data fragility due to schema changes and ensure that only clean well-structured data gets into the pipeline in the first place.
Picture: show sample schemaHighlight that compat check is automaticSchema review to ensure best practices: well-named fields, has required header information and so on, amenable to future evolution.Compile time check to ensure (if update) it is compat with previous version – we have central repository of all schemas to aid in that verification.May seem restrictive to have compatibility model, but if you have 50+ services consuming a given data source then it makes sense especially if you intend to evolve schemas over time.Ensure that the producer cannot send an event with a schema that is invalid or incompatible with a previous version.A reference to the schema (hash of canonicalized version) is embedded in each message so a reader always uses same schema as writer.
The first two ideas facilitate a much more streamlined O(1) approach to ETL – by O(1) I mean with regard to human effort. Previously new event types would probably need some custom parsing work to be ETL’d. However, since we now have a central pipeline, and because all data in that pipeline uses backwards compatible schemas with a standardized encoding – avro in our case, a new event type is just that – a new event type. You just send it and it gets ETL’d. The ETL process knows how to read avro records and the effort to get your new data into Hadoop is literally zero.
Bunch of mappers that read from Kafka and write to HDFS.
Does it work? In order to measure we need a metric to measure.
Every message should be received by every consumer quickly.So we want to measure event loss, and we want to measure lag from producer to various consumers.
Producers keep track of how many messages it sent for various topics in every 10 minute time window. The time is taken from the event header of each message.It sends an audit event every 10 min. saying…Likewise Kafka cluster and ETL report counts.We have an app that reconciles these counts every few minutes.LEAD into 0.8 – discrepancy due to producer failures; typically when we upgrade the cluster. Unavailability to producers and consumers alike. No ack so don’t know if data made it or not.