"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Practical Introduction to Hydra for Big Data Processing
1. Hydra - A Practical
Introduction
Big Data DC - @bigdatadc
Matt Abrams - @abramsm
March 4th 2013
2.
3. Agenda
•
What is Hydra?
•
Sample Data and Analysis Questions
•
Getting started with a local Hydra dev environment
•
Hydra’s Key Concepts
•
Creating your first Hydra job
•
Putting it all together
4. Hydra’s Goals
•
Support Streaming and Batch
Processing
•
Massive Scalability
•
Fault tolerant by design (bend but
do not break)
•
Incremental Data Processing
•
Full stack operational support
•
Command and Control
•
Alerting
•
Resource Management
•
Data/Task Rebalancing
•
Data replication and Backup
5. What Exactly is Hydra?
•
File System
•
Data Processing
•
Query System
•
Job/Cluster
Management
•
Operational Alerting
•
Open Source
6. Hydra - Terms
•
Job: a process for processing data
•
Task: a processing component of a job. A job can have
one to n tasks
•
Node: A logic unit of processing capacity available to a
cluster
•
Minion: Management process that runs on cluster nodes.
Acts as gate keeper for controlling task processes
•
Spawn: Cluster management controller and UI
8. Our Sample Data (Log-Synth)
3.535,
5214d63bab95687d,
166.144.203.186,
"the
then
good"
3.568,
5dbd9451948ad895,
88.120.153.226,
"know
boys"
4.206,
5dbd9451948ad895,
88.120.153.226,
"to"
4.673,
b967d99cad0b3e60,
88.120.153.226,
"seven"
4.900,
bd0d760fbb338955,
166.144.203.186,
"did
local
it"
9. What do we want to know?
•
What are the top IP addresses by request count?
•
What are the top IP address by unique user count?
•
What are the most common search terms?
•
What are the most common search terms in the slowest
5% of queries?
•
What are the daily number of unique searches, unique
users, unique IP addresses, and distribution of
response times (all approximates)?
18. BundleFilters
• Return
true or false
• Operate
on entire
rows
• Add/Remove
• Edit
• May
ValueFilters
• Operate
on single
volume values
• Return
columns
Column Values
include a call to
ValueFilter
a value or null
• No
visibility to full
row
• Often
take input from
BundleFilter
19. BundleFilter - Chain
// chain of bundle filters
{"op":"chain", “filter”:[
//LIST OF BUNDLE
//FILTERS
….
]}
33. Data Attachment Example
A single node that tracks the top 1000 unique search terms, the distinct count of
UIDs, and provides quantile estimation for the query time
38. Output - Tree
•
Output(s) can be trees
or data files
•
Trees represent data
aggregations that can
be queried
•
Files Output Targets
•
File System
•
Cassandra
•
HDFS
43. What are the top IP
Addresses By Record Count?
•
Exact
•
•
•
path: root/byip/+:+hits
ops: gather=ks;sort=1:n:d;limit=100
Approximate
•
path: root/byip/+$+uidcount
•
ops: gather=ks;sort=1:n:d;limit=100
44. What are the top IPs by
unique user count?
•
Exact
•
•
•
path: root/byip/+/+
ops: gather=kk;sort=0;gather=ku;sort=1:n:d
Approximate
•
path: root/byip/+$+uidcount
•
ops: gather=ks;sort=1:n:d;limit=100
45. What are the search terms
for the slowest 5%?
•
First get the 95th percentile query time
•
•
•
path: /root$+timeDigest=quantile(.95)
ops: num=c0,toint,v0,set;gather=a
Now find all queries then 95th percentile
•
path: /root/bytime/+/+:+hits
•
ops: num=c0,v950,gteq;gather=iks;sort=1:n:d
46. Daily Unqiue Searches, Users, IPs
and distribution of response times?
•
Query Path:
•
•
Ops:
•
•
root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(.
25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$
+timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits
gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999
Remote Ops:
•
num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num
=c7,toint,v7,set;num=c8,toint,v8,set;