Realtime Dashboards on Data Streams

1 © Hortonworks Inc. 2011–2018. All rights reserved
Interactive Realtime Dashboards on Data
Streams
Nishant Bangarwa
Hortonworks
Druid Committer, PMC

Sample Data Stream : Wikipedia Edits

Step by Step Breakdown
Consume Events
Enrich / Transform
(Add Geolocation
from IP Address)
Store Events
Visualize Events
Sample Event : [[Eoghan Harris]] https://en.wikipedia.org/w/index.php?diff=792474242&oldid=787592607 * 7.114.169.238 * (+167) Added fact

Required Components
 Event Flow
 Event Processing
 Data Store
 Visualization Layer

Event Flow

Event Flow : Requirements
Event
Producers
Queue
Event
Consumers
 Low latency
 High Throughput
 Failure Handling
 Message delivery guarantees –
 Atleast Once, Exactly once, Atmost Once
 Varies based on use-case, Exactly once being the holy grail and
the most difficult to achieve
 Scalability

Apache Kafka

Apache Kafka
 Low Latency
 High Throughput
 Message Delivery guarantees
 At-least once
 Exactly Once (Fully introduced in apache kafka v0.11.0 June 2017)
 Reliable design to Handle Failures
 Message Acks between producers and brokers
 Data Replication on brokers
 Consumers can Read from any desired offset
 Handle multiple producers/consumers
 Scalable

Event Processing

Event Processing : Requirements
 Consume-Process-Produce Pattern
 Enrich and Transform event streams
 Windowing
 Apply business logic
 Consume and Join multiple streams into single
 Failure Handling
 Scalability
Consume Process Produce

Kafka Streams
 Rich Lightweight Stream processing library
 Event-at-a-time
 Stateful processing : windowing, joining, aggregation operators
 Local state using RocksDb
 Backed by changelog in kafka
 Highly scalable, distributed, fault tolerant
 Compared to a standard Kafka consumer:
 Higher level: faster to build a sophisticated app
 Less control for very fine-grained consumption

Kafka Streams : Wikipedia Data Enrichment

© Hortonworks Inc. 2011 – 2016. All Rights Reserved
1
Data Store

Data Store

Data Store : Requirements
Processed
Events
Data Store Queries
 Ability to ingest Streaming data
 Power Interactive dashboards
 Sub-Second Query Response time
 Ad-hoc arbitrary slicing and dicing of data
 Data Freshness
 Summarized/aggregated data is queried
 Scalability
 High Availability

1
Druid
• Column-oriented distributed datastore
• Sub-Second query times
• Realtime streaming ingestion
• Arbitrary slicing and dicing of data
• Automatic Data Summarization
• Approximate algorithms (hyperLogLog, theta)
• Scalable to petabytes of data
• Highly available

Suitable Use Cases
• Powering Interactive user facing applications
• Arbitrary slicing and dicing of large datasets
• User behavior analysis
• measuring distinct counts
• retention analysis
• funnel analysis
• A/B testing
• Exploratory analytics/root cause analysis
• Not interested in dumping entire dataset

Druid: Segments
• Data in Druid is stored in Segment Files.
• Partitioned by time
• Ideally, segment files are each smaller than 1GB.
• If files are large, smaller time partitions are needed.
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5_2:
Friday
Segment 5_1:
Friday

1
Example Wikipedia Edit Dataset
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
Timestamp Dimensions Metrics

2
Data Rollup
timestamp page language city country count sum_added sum_deleted min_added max_added ….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12
Rollup by hour

2
Dictionary Encoding
• Create and store Ids for each value
• e.g. page column
⬢ Values - Justin Bieber, Ke$ha, Selena Gomes
⬢ Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2
⬢ Column Data - [0 0 0 1 1 2]
• city column - [0 0 0 1 1 1]

2
Bitmap Indices
• Store Bitmap Indices for each value
⬢ Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]
⬢ Ke$ha -> [3, 4] -> [0 0 0 1 1 0]
⬢ Selena Gomes -> [5] -> [0 0 0 0 0 1]
• Queries
⬢ Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]
⬢ language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]
• Indexes compressed with Concise or Roaring encoding

2
Approximate Sketch Columns
timestamp page userid language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65
2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53
timestamp page language city country count sum_added sum_delete
d
min_added Userid_sket
ch
….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch}
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch}
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch}
Rollup by hour

Approximate Algorithms
• Store Sketch objects, instead of raw column values
• Better rollup for high cardinality columns e.g userid
• Reduced storage size
• Use Cases
• Fast approximate distinct counts
• Approximate histograms
• Funnel/retention analysis
• Limitation
• Not possible to do exact counts
• filter on individual row values

Realtime
Nodes
Historical
Nodes
2
Druid Architecture
Batch Data
Event
Historical
Nodes
Broker
Nodes
Realtime
Index Tasks
Streaming
Data
Historical
Nodes
Handoff

Performance and Scalability : Fast Facts
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)

2
Companies Using Druid

Visualization Layer

Visualization Layer : Requirements
• Rich dashboarding capabilities
• Work with multiple datasoucres
• Security/Access control
• Allow for extension
• Add custom visualizations
Data Store Visualization
Layer
User
Dashboards

Superset
• Python backend
• Flask app builder
• Authentication
• Pandas for rich analytics
• SqlAlchemy for SQL toolkit
• Javascript frontend
• React, NVD3
• Deep integration with Druid

Superset Rich Dashboarding Capabilities: Treemaps

Superset Rich Dashboarding Capabilities: Sunburst

Superset UI Provides Powerful Visualizations
Rich library of dashboard visualizations:
Basic:
• Bar Charts
• Pie Charts
• Line Charts
Advanced:
• Sankey Diagrams
• Treemaps
• Sunburst
• Heatmaps
And More!

Wikipedia Real-Time Dashboard
Kafka
Connect
IP-to-
Geolocation
Processor
wikipedia-raw
topic
wikipedia-raw
topic
wikipedia-enriched
topic
wikipedia-enriched
topic

Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)

Project Websites
 Kafka - http://kafka.apache.org
 Druid - http://druid.io
 Superset - http://superset.incubator.apache.org

Thank you
Twitter - @NishantBangarwa
Email - nbangarwa@hortonworks.com
Linkedin - https://www.linkedin.com/in/nishant-bangarwa

Questions?

Realtime Dashboards on Data Streams

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Realtime Dashboards on Data Streams

Semelhante a Realtime Dashboards on Data Streams (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Realtime Dashboards on Data Streams

Notas do Editor