Building a Real-time Data Pipeline: Apache Kafka at LinkedIn

•Transferir como PPTX, PDF•

28 gostaram•9,568 visualizações

The document discusses LinkedIn's use of Apache Kafka as a central data pipeline to integrate their variety of real-time user data streams. Some key points: - LinkedIn uses Kafka to ingest over 28 billion messages per day from various data sources like user activity and system metrics. - Kafka provides a scalable central data pipeline that supports high throughput rates of hundreds of thousands to millions of messages per second. - LinkedIn standardizes on the Avro data format for schemas and pushes data cleaning upstream by producers. - They ensure correctness through an audit trail and evidence-based approach of validating that all messages reach consumers.

Tecnologia Negócios

Building a Real-Time Data Pipeline:
Apache Kafka at Linkedin
Hadoop Summit 2013
Joel Koshy
June 2013
LinkedIn Corporation ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Network update stream

LinkedIn Corporation ©2013 All Rights Reserved
We have a lot of data.
We want to leverage this data to build products.
Data pipeline

HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5

How do we integrate this variety of data
and make it available to all these systems?
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Point-to-point pipelines

HADOOP SUMMIT 2013
LinkedIn’s user activity data pipeline (circa 2010)

HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 10

HADOOP SUMMIT 2013
Central data pipeline

First attempt: don’t re-invent the wheel
LinkedIn Confidential ©2013 All Rights Reserved

Second attempt: re-invent the wheel!
LinkedIn Confidential ©2013 All Rights Reserved

Use a central commit log
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
What is a commit log?

HADOOP SUMMIT 2013
The log as a messaging system
LinkedIn Corporation ©2013 All Rights Reserved 17

HADOOP SUMMIT 2013
Apache Kafka
LinkedIn Corporation ©2013 All Rights Reserved 18

HADOOP SUMMIT 2013
Usage at LinkedIn
 16 brokers in each cluster
 28 billion messages/day
 Peak rates
– Writes: 460,000 messages/second
– Reads: 2,300,000 messages/second
 ~ 700 topics
 40-50 live services consuming user-activity data
 Many ad hoc consumers
 Every production service is a producer (for metrics)
 10k connections/colo
LinkedIn Corporation ©2013 All Rights Reserved 19

HADOOP SUMMIT 2013
Usage at LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 20

HADOOP SUMMIT 2013
Standardize on Avro in data pipeline
LinkedIn Corporation ©2013 All Rights Reserved 22
{
"type": "record",
"name": "URIValidationRequestEvent",
"namespace": "com.linkedin.event.usv",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": ”TrackingEventHeader",
"namespace": "com.linkedin.event",
"fields": [
{
"name": "memberId",
"type": "int",
"doc": "The member id of the user initiating the action"
},
{
"name": ”timeMs",
"type": "long",
"doc": "The time of the event"
},
{
"name": ”host",
"type": "string",
...
...

HADOOP SUMMIT 2013
Hadoop data load (Camus)
 Open sourced:
– https://github.com/linkedin/camus
 One job loads all events
 ~10 minute ETA on average from producer to HDFS
 Hive registration done automatically
 Schema evolution handled transparently

Does it work?
“All published messages must be delivered to all consumers (quickly)”
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Kafka replication (0.8)
 Intra-cluster replication feature
– Facilitates high availability and durability
 Beta release available
https://dist.apache.org/repos/dist/release/kafka/
 Rolled out in production at LinkedIn last week
LinkedIn Corporation ©2013 All Rights Reserved 28

HADOOP SUMMIT 2013
Join us at our user-group meeting tonight @ LinkedIn!
– Thursday, June 27, 7.30pm to 9.30pm
– 2025 Stierlin Ct., Mountain View, CA
– http://www.meetup.com/http-kafka-apache-org/events/125887332/
– Presentations (replication overview and use-case studies) from:
 RichRelevance
 Netflix
 Square
 LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 29

HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30

Mais conteúdo relacionado

Mais procurados

Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...HostedbyConfluent

Change Data Streaming Patterns for Microservices With Debezium confluent

Managing Egress with IstioSolo.io

Tuning PostgreSQL for High Write Throughput Grant McAlister

Docker Kubernetes IstioAraf Karsh Hamid

Deploying Confluent Platform for Productionconfluent

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

Apache Kafka Best PracticesDataWorks Summit/Hadoop Summit

Introduction to RedisArnab Mitra

Producer Performance Tuning for Apache KafkaJiangjie Qin

From my sql to postgresql using kafka+debeziumClement Demonchy

OpenTelemetry For DevelopersKevin Brockhoff

kubernetes, pourquoi et commentJean-Baptiste Claramonte

Tuning TCP and NGINX on EC2Chartbeat

Migrating with DebeziumMike Fowler

ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks

Mastering GC.pdfJean-Philippe BEMPEL

Models for hierarchical dataKarwin Software Solutions LLC

What you need to know about cephEmma Haruka Iwao

Apache Kafka 0.8 basic training - VerisignMichael Noll

Mais procurados (20)

Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...

Change Data Streaming Patterns for Microservices With Debezium

Managing Egress with Istio

Tuning PostgreSQL for High Write Throughput

Docker Kubernetes Istio

Deploying Confluent Platform for Production

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

Apache Kafka Best Practices

Introduction to Redis

Producer Performance Tuning for Apache Kafka

From my sql to postgresql using kafka+debezium

OpenTelemetry For Developers

kubernetes, pourquoi et comment

Tuning TCP and NGINX on EC2

Migrating with Debezium

ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...

Mastering GC.pdf

Models for hierarchical data

What you need to know about ceph

Apache Kafka 0.8 basic training - Verisign

Destaque

Apache Kafka at LinkedInGuozhang Wang

Introduction to Kafka StreamsGuozhang Wang

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

Building a Replicated Logging System with Apache KafkaGuozhang Wang

Apache kafkaRahul Jain

Introduction to Apache KafkaJeff Holoman

Automatic Scaling Iterative ComputationsGuozhang Wang

Behavioral Simulations in MapReduceGuozhang Wang

Slide #1:Introduction to Apache StormMd. Shamsur Rahim

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services

Developing with the Go client for Apache KafkaJoe Stein

Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Spark Summit

Apache Kafka at LinkedInDiscover Pinterest

Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll

Kafka at Scale: Multi-Tier ArchitecturesTodd Palino

Apache KafkaJoe Stein

Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah

AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...Lucas Jellema

Destaque (20)

Apache Kafka at LinkedIn

Introduction to Kafka Streams

Developing Real-Time Data Pipelines with Apache Kafka

Building a Replicated Logging System with Apache Kafka

Apache kafka

Introduction to Apache Kafka

Automatic Scaling Iterative Computations

Behavioral Simulations in MapReduce

Slide #1:Introduction to Apache Storm

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...

Developing with the Go client for Apache Kafka

Apache Kafka, and the Rise of Stream Processing

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...

Apache Kafka at LinkedIn

Introducing Kafka Streams, the new stream processing library of Apache Kafka,...

Kafka at Scale: Multi-Tier Architectures

Apache Kafka

Building Stream Infrastructure across Multiple Data Centers with Apache Kafka

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013

AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...

Semelhante a Building a Real-time Data Pipeline: Apache Kafka at LinkedIn

All data accessible to all my organization - Presentation at OW2con'19, June...OW2

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten

The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4

Sparkling Water Webinar October 29th, 2014Sri Ambati

Advanced Analytics and Machine Learning with Data VirtualizationDenodo

Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bhaskar Ghosh

Breaking down data silos with ODataWoodruff Solutions LLC

The LOD Gateway: Open Source Infrastructure for Linked DataDavid Newbury

The oecd delta project – providing easier access to data through api'sJonathan Challener

Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community

Big Data, Bigger BrainsDenny Lee

Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit

Advanced Analytics and Machine Learning with Data VirtualizationDenodo

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j

Opensocial Haifa Seminar - 2008.04.08Ari Leichtberg

Better integrations through open interfacesSteve Speicher

Test trend analysis: Towards robust reliable and timely testsHugh McCamphill

Semelhante a Building a Real-time Data Pipeline: Apache Kafka at LinkedIn (20)

All data accessible to all my organization - Presentation at OW2con'19, June...

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

Sparkling Water Webinar October 29th, 2014

Advanced Analytics and Machine Learning with Data Virtualization

Interactive Analytics at Scale in Apache Hive Using Druid

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Breaking down data silos with OData

The LOD Gateway: Open Source Infrastructure for Linked Data

The oecd delta project – providing easier access to data through api's

Microsoft Graph: Connect to essential data every app needs

Big Data, Bigger Brains

Interactive Analytics at Scale in Apache Hive Using Druid

Advanced Analytics and Machine Learning with Data Virtualization

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...

Opensocial Haifa Seminar - 2008.04.08

Better integrations through open interfaces

Test trend analysis: Towards robust reliable and timely tests

Mais de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Mais de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Último

WordPress Websites for Engineers: Elevate Your Brandgvaughan

"ML in Production",Oleksandr BaganFwdays

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Advanced Computer Architecture – An IntroductionDilum Bandara

Training state-of-the-art general text embeddingZilliz

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Building a Real-time Data Pipeline: Apache Kafka at LinkedIn

2. HADOOP SUMMIT 2013 Network update stream

4. HADOOP SUMMIT 2013 People you may know

7. HADOOP SUMMIT 2013 Point-to-point pipelines

8. HADOOP SUMMIT 2013 LinkedIn’s user activity data pipeline (circa 2010)

9. HADOOP SUMMIT 2013 Point-to-point pipelines

11. HADOOP SUMMIT 2013 Central data pipeline

13. HADOOP SUMMIT 2013

16. HADOOP SUMMIT 2013 What is a commit log?

19. HADOOP SUMMIT 2013 Usage at LinkedIn  16 brokers in each cluster  28 billion messages/day  Peak rates – Writes: 460,000 messages/second – Reads: 2,300,000 messages/second  ~ 700 topics  40-50 live services consuming user-activity data  Many ad hoc consumers  Every production service is a producer (for metrics)  10k connections/colo LinkedIn Corporation ©2013 All Rights Reserved 19

22. HADOOP SUMMIT 2013 Standardize on Avro in data pipeline LinkedIn Corporation ©2013 All Rights Reserved 22 { "type": "record", "name": "URIValidationRequestEvent", "namespace": "com.linkedin.event.usv", "fields": [ { "name": "header", "type": { "type": "record", "name": ”TrackingEventHeader", "namespace": "com.linkedin.event", "fields": [ { "name": "memberId", "type": "int", "doc": "The member id of the user initiating the action" }, { "name": ”timeMs", "type": "long", "doc": "The time of the event" }, { "name": ”host", "type": "string", ... ...

24. HADOOP SUMMIT 2013 Hadoop data load (Camus)  Open sourced: – https://github.com/linkedin/camus  One job loads all events  ~10 minute ETA on average from producer to HDFS  Hive registration done automatically  Schema evolution handled transparently

27. HADOOP SUMMIT 2013 Audit Trail

28. HADOOP SUMMIT 2013 Kafka replication (0.8)  Intra-cluster replication feature – Facilitates high availability and durability  Beta release available https://dist.apache.org/repos/dist/release/kafka/  Rolled out in production at LinkedIn last week LinkedIn Corporation ©2013 All Rights Reserved 28

29. HADOOP SUMMIT 2013 Join us at our user-group meeting tonight @ LinkedIn! – Thursday, June 27, 7.30pm to 9.30pm – 2025 Stierlin Ct., Mountain View, CA – http://www.meetup.com/http-kafka-apache-org/events/125887332/ – Presentations (replication overview and use-case studies) from:  RichRelevance  Netflix  Square  LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 29

Notas do Editor

Talk about our data pipeline, the motivations behind it and how we built it out using Apache Kafka.
LI like most web companies derives a lot of value from tracking user activity – page views, clicks, ad impressions and so on. In fact some of this activity data is visible directly in one form or another on your own network update stream. People in your network may add a new connection or share a URL – and you ideally want to see these updates as soon as possible, and ideally in real-timeThis user activity feed is a useful user-facing product in and of itself, but it really is much more important than that. LinkedIn is a data-centric company and this activity data is also a hugely valuable ingredient in other data-driven products.
Use data to provide a richer, more relevant experience to members, and that engages our users more in turn generating more activity data and we get into this perpetual self-feeding cycle of successive refinement.
And that principle manifests in products such as PYMK which is only as engaging or as useful as it is relevant. If you only see people who are not related to you or you don’t care about, then it is pointless. Relevant suggestions here obviously lead to connecting with that person, triggering a connect event, you may click on their profile or company page triggering more activity events. So each page view can directly or indirectly follow up in additional front-end reads and considerable downstream activity – i.e., calls into backend services. So a simple page view (which is intuitively a read operation) when tracked as user-activity, results in a bunch of writes within your activity data pipeline. The data pipeline is not solely for activity tracking.
It is important to also have a metrics data feed for tracking your system and application metrics, logs. This is critical for monitoring the health trends of your production services, both low-level and application-level. Log and service call data for tools such as service call graphs. User activity and system metrics are just two kinds of data that you might want to have in your data pipeline. There are a few others as well. And there are a whole bunch of data-driven systems that need to feed off these data streams.
So that is really the key problem we want to solve – integrating these different data pipelines and making it easily available preferrably in real-time to each data-driven system.and what happens at most companies including linkedin is that we end up building specialized data systems to handle these types of data and very soon you end up with an architecture that looks something like:
… where there is a different solution or pipeline for each type of data. In this picture, the data sources are above and the data-driven systems (the consumers are below). (For e.g., for operational metrics we used JMX feeding into Zenoss, we had a different user activity tracking system (which I’ll talk more about a little later), we used splunk for scraping and searching logs) And there are a number of data-driven systems that are directly user-facing, some mid-tier and some that are more backend. And for many of these systems it is important to have access to the data in near real-time. Take security systems for e.g., which would need to consume user activity events from the user tracking system and detect anomalous or malicious patterns and react quickly. Likewise, for search systems to provide more relevant search results it is important to make content that is indexed as fresh as possible. Recommendation systems can do a better job of providing more relevant results if signals from activity data are incorporated early on. So in order to fulfill these use cases we needed tight coupling between the sources of the various types of data and the various specialized data-driven systems that feed off that data. Cons: In order to have universal access to that data need O(n2) p2p pipelines. end up having to have this tight coupling between sources and systems and end points need to know how to talk to each other.Concrete example to drive home these points hopefully a little more clearly: linkedin’s old activity data pipeline
To drive home these points a little more clearly I’ll provide some details on our previous (specialized) user activity data pipeline. Front-end applications would post XML blobs containing activity data to a HTTP-based logging service – activity logs were scraped from this service and periodically rsync’d over to staging servers in the offline data centers where the ETL process takes place. Number of limitations with this pipeline.First, The logging service did not really provide real-time access to the data that was sent to it. It just served as a point of aggregation. So other data systems in the live data center couldn’t directly feed off this activity data – in fact as this diagram shows, ultimately the only consumer of the user activity were the offline systems.Second, the data flowing through this pipeline was raw xml data – producers used whatever structure they wanted, and there’s this one single DWH team whose job is to suck in all this raw activity data, clean it and make it beautiful and represents everything about your business. Cleaning the data produced by tons of producers. Not well-versed with the producer data. Producer doesn’t know what constitutes clean data that is amenable for treatment at the ETL stage.Data flow is fragile wrt schema changes, labor intensive,unscalable at the human layer: people down at the batch layer build data-rich products and depend on a number of data sources. Say, 50+ - so highly likel to break if data format changes in any one of these data flows. New application – file ticket and talk to consumers.Hard to verify correctness. Does it work? Is all the activity data getting collected?Inherently batch oriented process: rsyncs were periodic, ETL jobs are periodic – multi-hour delays in getting the cleaned data.Furthermore, the DWH is the only source of clean data. Irony is that this data is very important to a data-centric company. i.e., we don’t want to stop at reports, but we want this clean data made available to production services as soon as possible in order to power insight-driven products. Part of the reason why we built out our hadoop cluster. Even though it is unscalable to move this data around making the data just available to these systems is hugely useful. E.g., after setting up hadoop – unlocked a lot of possibilities. New computation was possible on data that would have been hard to do before. Many new products and analysis just came from putting together multiple pieces of data that had previously been locked up in specialized systems. People really wanted the data.Forces setting up solutions that are relatively heavy-weight/clumsy and not particularly effective.E.g., recommendation systems match jobs, match people you may know groups you may want to join, etc. The way it works is user events end up in hadoop, on which some offline processing and enrichment takes place and we generate pre-built RO stores. For various reasons it is not possible to update a RW store in the live DC from the offline DC. these pre-built RO stores are shipped to the production data center intermittently. If those jobs aren’t running frequently enough your recommendation system is stuck with signals from potentially activity data from the last run.
So that was just the user-activity pipeline. Similar issues plague other specialized pipelines.
Simple recipe: take all the organization’s data and put it into a central pipeline for real-time consumption.Multiple benefits:Data is integrated and made available and if it supports persistence it is available for a period of time.Decouples producers and consumers - only need to know how to talk to the central pipelineAdding a new data source or sink is simple and organizationally scalable.Should point out that we have had and still have a pipeline for database update streams – that’s databus. I won’t be going into that in this talk.
Since we were already using activemq at the time for ad hoc messaging purposes, we wondered if we could use it for the central data pipeline as well – i.e., the approach was to hook it up to the activity feed and just see what happens.So we tried activemq and rabbitmq.
Problems With JMS Messaging SystemsNot designed for high volume data especially with large backlogs of unconsumed data.IOW, persistence is not an ingrained conceptDifficult to scale out – no inherent support for distribution.Featuritis: xactions (exactly once semantics)
This borrows from the traditional database log concept – which is take all changes, update a bunch of stuff – tables, indexes, materialized views and so on, and do this In a way that is correct in the presence of failures – write a log of all that happens. Someone else can read this log, take and apply these updates. So that’s a log – it’s an append-only, totally ordered sequence of records indexed by time. Not too different from file – but purpose of a log is specific. It captures what happened and when and provides a persistent re-playable record of history.
Apart from decoupling which is a benefit of the central log, subscribers can consume at their own pace. Important – e.g., for hadoop which may be on an hourly schedule or down for maintenance. Because the log is persistent, it can resume from where it left off when it comes back up and even if there is a large backlog that won’t impact consumer (or broker) performance. Linear access pattern.
Brokers producers consumers, topics.Engineering for high throughput:Batching at producer and consumerCompressionReliance on pagecacheHorizontally scalable:Can add more producersCan add more brokers -> more partitions. Don’t need that many. (16 brokers per DC).Can add more consumer instances. E.g., can have one consumer reading three partitions, or three consumers sharing that load.Guarantees:Successfully published messages must not be lost and must be available for delivery even in the presence of broker failures.Consumer: At least once, most of the time exactly once. E2E latency is generally under a second.
Cluster size – some are big some are small. Tracking cluster is 16Writes/reads are batched – rps will be less.Currently at 700 topics – ever growing number.
This is an approximation of our topology – each large box is a data center. The main point I want to show is that we can mirror clusters to other DCs efficiently (with minutes delay) and in doing so the data pipeline can almost seamlessly cut across data center boundaries.So the hadoop clusters in the offline DC has near real-time access to the activity feed and can push back enriched data for consumption by production services. This is a powerful use case. It allows live services to take large amounts of data that has been cleaned and enriched in one way or the other by a batch-oriented system (hadoop in this case) and consume it in a stream oriented manner. i.e., before we had this data pipeline, we were left with few options but to push the batch on to production services that needed it but that may not always be feasible because it’s a push model – services may not be able to deal with sudden bursts of data being pushed from hadoop. Instead, the hadoop jobs can now shove data into the data pipeline which is mirrored back to the live DC and services can consume this data in a stream-oriented fashion at its own pace.E.g., this is used by a system called Faust (under development by the Voldemort team also part of DI @ LI) to improve the turn around time of incorporating recent activity data for use by rec systems. I mentioned earlier we had these jobs that shipped out entire pre-built RO stores – now the jobs can just write out updates to the data pipeline that are read and applied by Faust in the live DC.
Second idea is to pre-emptively prevent data fragility due to schema changes and ensure that only clean well-structured data gets into the pipeline in the first place.
Picture: show sample schemaHighlight that compat check is automaticSchema review to ensure best practices: well-named fields, has required header information and so on, amenable to future evolution.Compile time check to ensure (if update) it is compat with previous version – we have central repository of all schemas to aid in that verification.May seem restrictive to have compatibility model, but if you have 50+ services consuming a given data source then it makes sense especially if you intend to evolve schemas over time.Ensure that the producer cannot send an event with a schema that is invalid or incompatible with a previous version.A reference to the schema (hash of canonicalized version) is embedded in each message so a reader always uses same schema as writer.
The first two ideas facilitate a much more streamlined O(1) approach to ETL – by O(1) I mean with regard to human effort. Previously new event types would probably need some custom parsing work to be ETL’d. However, since we now have a central pipeline, and because all data in that pipeline uses backwards compatible schemas with a standardized encoding – avro in our case, a new event type is just that – a new event type. You just send it and it gets ETL’d. The ETL process knows how to read avro records and the effort to get your new data into Hadoop is literally zero.
Bunch of mappers that read from Kafka and write to HDFS.
Does it work? In order to measure we need a metric to measure.
Every message should be received by every consumer quickly.So we want to measure event loss, and we want to measure lag from producer to various consumers.
Producers keep track of how many messages it sent for various topics in every 10 minute time window. The time is taken from the event header of each message.It sends an audit event every 10 min. saying…Likewise Kafka cluster and ETL report counts.We have an app that reconciles these counts every few minutes.LEAD into 0.8 – discrepancy due to producer failures; typically when we upgrade the cluster. Unavailability to producers and consumers alike. No ack so don’t know if data made it or not.

Building a Real-time Data Pipeline: Apache Kafka at LinkedIn

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Building a Real-time Data Pipeline: Apache Kafka at LinkedIn

Semelhante a Building a Real-time Data Pipeline: Apache Kafka at LinkedIn (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Building a Real-time Data Pipeline: Apache Kafka at LinkedIn

Notas do Editor