Practical Introduction to Hydra for Big Data Processing

Hydra - A Practical
Introduction
Big Data DC - @bigdatadc
Matt Abrams - @abramsm
March 4th 2013

Agenda
•

What is Hydra?

•

Sample Data and Analysis Questions

•

Getting started with a local Hydra dev environment

•

Hydra’s Key Concepts

•

Creating your ﬁrst Hydra job

•

Putting it all together

Hydra’s Goals
•

Support Streaming and Batch
Processing

•

Massive Scalability

•

Fault tolerant by design (bend but
do not break)

•

Incremental Data Processing

•

Full stack operational support
•

Command and Control

•

Alerting

•

Resource Management

•

Data/Task Rebalancing

•

Data replication and Backup

What Exactly is Hydra?
•

File System

•

Data Processing

•

Query System

•

Job/Cluster
Management

•

Operational Alerting

•

Open Source

Hydra - Terms
•

Job: a process for processing data

•

Task: a processing component of a job. A job can have
one to n tasks

•

Node: A logic unit of processing capacity available to a
cluster

•

Minion: Management process that runs on cluster nodes.
Acts as gate keeper for controlling task processes

•

Spawn: Cluster management controller and UI

Our Sample Data (Log-Synth)
3.535,
5214d63bab95687d,
166.144.203.186,
"the
then
good"

3.568,
5dbd9451948ad895,
88.120.153.226,
"know
boys"

4.206,
5dbd9451948ad895,
88.120.153.226,
"to"

4.673,
b967d99cad0b3e60,
88.120.153.226,
"seven"

4.900,
bd0d760fbb338955,
166.144.203.186,
"did
local
it"

What do we want to know?
•

What are the top IP addresses by request count?

•

What are the top IP address by unique user count?

•

What are the most common search terms?

•

What are the most common search terms in the slowest
5% of queries?

•

What are the daily number of unique searches, unique
users, unique IP addresses, and distribution of
response times (all approximates)?

Setting up Hydra’s Local Stack

Vagrant
•

$
vagrant
init
precise32
http://
files.vagrantup.com/precise32.box

•

//
add:
config.vm.network
:forwarded_port,

guest:
5052,
host:
5052
to
your
Vagrantfile

•

$
vagrant
up

•

$
vagrant
ssh

Java7
•

$
sudo
apt-‐get
update

•

$
sudo
apt-‐get
install
python-‐software-‐
properties

•

$
sudo
add-‐apt-‐repository
ppa:webupd8team/java

•

$
sudo
apt-‐get
update

•

$
sudo
apt-‐get
install
oracle-‐java7-‐installer

RabbitMQ, Maven, Git, Make

•

$
sudo
apt-‐get
install
rabbitmq-‐server

•

$
sudo
apt-‐get
install
maven

•

$
sudo
apt-‐get
install
git

•

$
sudo
apt-‐get
install
make

Copy on Write
•

$
wget
http://xmailserver.org/fl-‐cow-‐0.10.tar.gz

•

$
tar
zxvf
fl-‐cow-‐0.10.tar.gz

•

$
cd
fl-‐cow-‐0.10

•

$
./configure
—prefix=/usr

•

$
make;
make
check

•

$
sudo
make
install

•

$
export
LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD

Hydra
•

$
git
clone
https://github.com/addthis/
hydra.git

•

$
cd
hydra;
mvn
clean
-‐Pbdbje
package

•

$
./hydra-‐uber/bin/local-‐stack.sh
start

•

$
start

•

$
seed

Stage Sample Data in Stream
Directory

•

$
mkdir
~/hydra/hydra-‐local/streams/log-‐synth

•

$
cp
$YOUR_SAMPLE_DATA_DIR
~/hydra/hydra-‐
local/streams/log-‐synth

BundleFilters
• Return

true or false

• Operate

on entire

rows
• Add/Remove
• Edit
• May

ValueFilters
• Operate

on single
volume values

• Return

columns

Column Values

include a call to
ValueFilter

a value or null

• No

visibility to full
row

• Often

take input from
BundleFilter

BundleFilter - Chain
// chain of bundle ﬁlters
{"op":"chain", “ﬁlter”:[
//LIST OF BUNDLE
//FILTERS
….
]}

BundleFilter - Existence

// false if UID column is null
{"op":"ﬁeld", "from":"UID"},

Bundle Filter - Concatenation

// joins FOO and BAR
// Stores output in new column “OUTPUT”
!

{"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},

BundleFilter - Equality
Testing

// FIELD_ONE == FIELD_TWO
!

{“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},

BundleFilter - Math!

// DUR = Math.round((end-start)/1000)
!

{"op":"num", "columns":["END", "START", "DUR"],

"deﬁne":"c0,c1,sub,v1000,ddiv,toint,v2,set"}

Stack Math - Sample Data
C0,START_TIME

C1,END_TIME

100,234

200,468

Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

200,468
100,234

Sub

200,468-100,234
=100,234

Stack Math

1000
100,234

DDIV

100,234/1000
=100.234

Stack Math

100.234

toint

100

Stack Math - Sample Result
C0,START_TIME

C1,END_TIME

C2,DURATION

100,234

200,468

100

ValueFilter - Glob
ValueFilter

{from:"SOURCE", ﬁlter:{op:”glob”, pattern:"Log_[0-9]*"}}

BundleFilter

ValueFilter - Chain, Split,
Index
ValueFilter

{op:"field", from:”LIST”,filter: {op:"chain", filter:[
{op:”split", split:"="},
{op:"index", index:0}
]}},
ValueFilter(s)

Data Attachments are
Hydra’s Secret Weapon
•

Top-K Estimator

•

Cardinality Estimation (HyperLogLog Plus)

•

Quantile Estimation (Q,T-Digest)

•

Bloom Filters

•

Multiset streaming summarization (CountMin Sketch)

Data Attachment Example
A single node that tracks the top 1000 unique search terms, the distinct count of
UIDs, and provides quantile estimation for the query time

Job Structure
• Jobs

have three
sections
• Source
• Map
• Output

Source
•

Deﬁnes the properties
of the input data set

•

Several built in source
types:
•

Mesh

•

Local File System

•

Kafka

Map
•

Select ﬁelds from
input record to
process

•

Apply ﬁlters to rows
and columns

•

Drop or expand rows

Output - Tree
•

Output(s) can be trees
or data ﬁles

•

Trees represent data
aggregations that can
be queried

•

Files Output Targets
•

File System

•

Cassandra

•

HDFS

What are the top IP
Addresses By Record Count?
•

Exact
•
•

•

path: root/byip/+:+hits
ops: gather=ks;sort=1:n:d;limit=100

Approximate
•

path: root/byip/+$+uidcount

•


What are the top IPs by
unique user count?
•

Exact
•
•

•

path: root/byip/+/+
ops: gather=kk;sort=0;gather=ku;sort=1:n:d

Approximate
•

path: root/byip/+$+uidcount

•


What are the search terms
for the slowest 5%?
•

First get the 95th percentile query time
•
•

•

path: /root$+timeDigest=quantile(.95)
ops: num=c0,toint,v0,set;gather=a

Now ﬁnd all queries then 95th percentile
•

path: /root/bytime/+/+:+hits

•

ops: num=c0,v950,gteq;gather=iks;sort=1:n:d

Daily Unqiue Searches, Users, IPs
and distribution of response times?
•

Query Path:
•

•

Ops:
•

•

root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(.
25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$
+timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits

gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999

Remote Ops:
•

num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num
=c7,toint,v7,set;num=c8,toint,v8,set;

But yeah, I could do that with CLI!

Related Open Source
Projects
•

Meshy - https://github.com/addthis/meshy

•

Codec - https://github.com/addthis/codec

•

Muxy - https://github.com/addthis/muxy

•

Bundle - https://github.com/addthis/bundle

•

Basis - https://github.com/addthis/basis

•

Column Compressor - https://github.com/addthis/
columncompressor

•

Cluster Boot Service - https://github.com/stewartoallen/cbs

Helpful Resources
•

Hydra - https://github.com/addthis/hydra

•

Hydra User Reference - http://ossdocs.addthiscode.net/hydra/latest/user-reference/

•

Hydra User Guide - http://oss-docs.addthiscode.net/
hydra/latest/user-guide/

•

IRC - #hydra

•

Mailing List - https://groups.google.com/forum/#!forum/
hydra-oss

Practical Introduction to Hydra for Big Data Processing

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Practical Introduction to Hydra for Big Data Processing

Semelhante a Practical Introduction to Hydra for Big Data Processing (20)

Último

Último (20)

Practical Introduction to Hydra for Big Data Processing