SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
Hydra - A Practical
Introduction
Big Data DC - @bigdatadc
Matt Abrams - @abramsm
March 4th 2013
Agenda
•

What is Hydra?

•

Sample Data and Analysis Questions

•

Getting started with a local Hydra dev environment

•

Hydra’s Key Concepts

•

Creating your first Hydra job

•

Putting it all together
Hydra’s Goals
•

Support Streaming and Batch
Processing

•

Massive Scalability

•

Fault tolerant by design (bend but
do not break)

•

Incremental Data Processing

•

Full stack operational support
•

Command and Control

•

Alerting

•

Resource Management

•

Data/Task Rebalancing

•

Data replication and Backup
What Exactly is Hydra?
•

File System

•

Data Processing

•

Query System

•

Job/Cluster
Management

•

Operational Alerting

•

Open Source
Hydra - Terms
•

Job: a process for processing data

•

Task: a processing component of a job. A job can have
one to n tasks

•

Node: A logic unit of processing capacity available to a
cluster

•

Minion: Management process that runs on cluster nodes.
Acts as gate keeper for controlling task processes

•

Spawn: Cluster management controller and UI
Hydra Cluster
Our Sample Data (Log-Synth)
3.535,	
  5214d63bab95687d,	
  166.144.203.186,	
  "the	
  then	
  good"	
  
3.568,	
  5dbd9451948ad895,	
  88.120.153.226,	
  "know	
  boys"	
  
4.206,	
  5dbd9451948ad895,	
  88.120.153.226,	
  "to"	
  
4.673,	
  b967d99cad0b3e60,	
  88.120.153.226,	
  "seven"	
  
4.900,	
  bd0d760fbb338955,	
  166.144.203.186,	
  "did	
  local	
  it"
What do we want to know?
•

What are the top IP addresses by request count?

•

What are the top IP address by unique user count?

•

What are the most common search terms?

•

What are the most common search terms in the slowest
5% of queries?

•

What are the daily number of unique searches, unique
users, unique IP addresses, and distribution of
response times (all approximates)?
Setting up Hydra’s Local Stack
Vagrant
•

$	
  vagrant	
  init	
  precise32	
  http://
files.vagrantup.com/precise32.box	
  

•

//	
  add:	
  config.vm.network	
  :forwarded_port,	
  
guest:	
  5052,	
  host:	
  5052	
  to	
  your	
  Vagrantfile	
  

•

$	
  vagrant	
  up	
  

•

$	
  vagrant	
  ssh
Java7
•

$	
  sudo	
  apt-­‐get	
  update	
  	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  python-­‐software-­‐
properties	
  

•

$	
  sudo	
  add-­‐apt-­‐repository	
  ppa:webupd8team/java	
  

•

$	
  sudo	
  apt-­‐get	
  update	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  oracle-­‐java7-­‐installer
RabbitMQ, Maven, Git, Make

•

$	
  sudo	
  apt-­‐get	
  install	
  rabbitmq-­‐server	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  maven	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  git	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  make
Copy on Write
•

$	
  wget	
  http://xmailserver.org/fl-­‐cow-­‐0.10.tar.gz	
  

•

$	
  tar	
  zxvf	
  fl-­‐cow-­‐0.10.tar.gz	
  

•

$	
  cd	
  fl-­‐cow-­‐0.10	
  

•

$	
  ./configure	
  —prefix=/usr	
  

•

$	
  make;	
  make	
  check	
  

•

$	
  sudo	
  make	
  install	
  

•

$	
  export	
  LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD
Hydra
•

$	
  git	
  clone	
  https://github.com/addthis/
hydra.git	
  

•

$	
  cd	
  hydra;	
  mvn	
  clean	
  -­‐Pbdbje	
  package	
  

•

$	
  ./hydra-­‐uber/bin/local-­‐stack.sh	
  start	
  

•

$	
  ./hydra-­‐uber/bin/local-­‐stack.sh	
  start	
  

•

$	
  ./hydra-­‐uber/bin/local-­‐stack.sh	
  seed
Stage Sample Data in Stream
Directory

•

$	
  mkdir	
  ~/hydra/hydra-­‐local/streams/log-­‐synth	
  

•

$	
  cp	
  $YOUR_SAMPLE_DATA_DIR	
  ~/hydra/hydra-­‐
local/streams/log-­‐synth
Pipes and Filters
BundleFilters
• Return

true or false

• Operate

on entire

rows
• Add/Remove
• Edit
• May

ValueFilters
• Operate

on single
volume values

• Return

columns

Column Values

include a call to
ValueFilter

a value or null

• No

visibility to full
row

• Often

take input from
BundleFilter
BundleFilter - Chain
// chain of bundle filters
{"op":"chain", “filter”:[
//LIST OF BUNDLE
//FILTERS
….
]}
BundleFilter - Existence

// false if UID column is null
{"op":"field", "from":"UID"},
Bundle Filter - Concatenation

// joins FOO and BAR
// Stores output in new column “OUTPUT”
!

{"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},
BundleFilter - Equality
Testing

// FIELD_ONE == FIELD_TWO
!

{“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},
BundleFilter - Math!

// DUR = Math.round((end-start)/1000)
!

{"op":"num", "columns":["END", "START", "DUR"], 

 "define":"c0,c1,sub,v1000,ddiv,toint,v2,set"}
Stack Math - Sample Data
C0,START_TIME

C1,END_TIME

100,234

200,468
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

200,468
100,234

Sub

200,468-100,234
=100,234
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

1000
100,234

DDIV

100,234/1000
=100.234
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

100.234

toint

100
Stack Math - Sample Result
C0,START_TIME

C1,END_TIME

C2,DURATION

100,234

200,468

100
ValueFilter - Glob
ValueFilter

{from:"SOURCE", filter:{op:”glob”, pattern:"Log_[0-9]*"}}

BundleFilter
ValueFilter - Chain, Split,
Index
ValueFilter

{op:"field", from:”LIST”,filter: {op:"chain", filter:[
{op:”split", split:"="}, 
{op:"index", index:0}
]}},
ValueFilter(s)
Data Attachments
Data Attachments are
Hydra’s Secret Weapon
•

Top-K Estimator

•

Cardinality Estimation (HyperLogLog Plus)

•

Quantile Estimation (Q,T-Digest)

•

Bloom Filters

•

Multiset streaming summarization (CountMin Sketch)
Data Attachment Example
A single node that tracks the top 1000 unique search terms, the distinct count of
UIDs, and provides quantile estimation for the query time
Putting it All Together
Job Structure
• Jobs

have three
sections
• Source
• Map
• Output
Source
•

Defines the properties
of the input data set

•

Several built in source
types:
•

Mesh

•

Local File System

•

Kafka
Map
•

Select fields from
input record to
process

•

Apply filters to rows
and columns

•

Drop or expand rows
Output - Tree
•

Output(s) can be trees
or data files

•

Trees represent data
aggregations that can
be queried

•

Files Output Targets
•

File System

•

Cassandra

•

HDFS
Lets put it all Together
Create Hydra Job
Run Job
Query
What are the top IP
Addresses By Record Count?
•

Exact
•
•

•

path: root/byip/+:+hits
ops: gather=ks;sort=1:n:d;limit=100

Approximate
•

path: root/byip/+$+uidcount

•

ops: gather=ks;sort=1:n:d;limit=100
What are the top IPs by
unique user count?
•

Exact
•
•

•

path: root/byip/+/+
ops: gather=kk;sort=0;gather=ku;sort=1:n:d

Approximate
•

path: root/byip/+$+uidcount

•

ops: gather=ks;sort=1:n:d;limit=100
What are the search terms
for the slowest 5%?
•

First get the 95th percentile query time
•
•

•

path: /root$+timeDigest=quantile(.95)
ops: num=c0,toint,v0,set;gather=a

Now find all queries then 95th percentile
•

path: /root/bytime/+/+:+hits

•

ops: num=c0,v950,gteq;gather=iks;sort=1:n:d
Daily Unqiue Searches, Users, IPs
and distribution of response times?
•

Query Path:
•

•

Ops:
•

•

root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(.
25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$
+timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits

gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999

Remote Ops:
•

num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num
=c7,toint,v7,set;num=c8,toint,v8,set;
But yeah, I could do that with CLI!
Related Open Source
Projects
•

Meshy - https://github.com/addthis/meshy

•

Codec - https://github.com/addthis/codec

•

Muxy - https://github.com/addthis/muxy

•

Bundle - https://github.com/addthis/bundle

•

Basis - https://github.com/addthis/basis

•

Column Compressor - https://github.com/addthis/
columncompressor

•

Cluster Boot Service - https://github.com/stewartoallen/cbs
Helpful Resources
•

Hydra - https://github.com/addthis/hydra

•

Hydra User Reference - http://ossdocs.addthiscode.net/hydra/latest/user-reference/

•

Hydra User Guide - http://oss-docs.addthiscode.net/
hydra/latest/user-guide/

•

IRC - #hydra

•

Mailing List - https://groups.google.com/forum/#!forum/
hydra-oss

Mais conteúdo relacionado

Mais procurados

HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係Kiwamu Okabe
 
Redis and its many use cases
Redis and its many use casesRedis and its many use cases
Redis and its many use casesChristian Joudrey
 
Redis as a message queue
Redis as a message queueRedis as a message queue
Redis as a message queueBrandon Lamb
 
Exploring, understanding and monitoring macOS activity with osquery
Exploring, understanding and monitoring macOS activity with osqueryExploring, understanding and monitoring macOS activity with osquery
Exploring, understanding and monitoring macOS activity with osqueryZachary Wasserman
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Redis - Usability and Use Cases
Redis - Usability and Use CasesRedis - Usability and Use Cases
Redis - Usability and Use CasesFabrizio Farinacci
 
Redis SoCraTes 2014
Redis SoCraTes 2014Redis SoCraTes 2014
Redis SoCraTes 2014steffenbauer
 
Kicking ass with redis
Kicking ass with redisKicking ass with redis
Kicking ass with redisDvir Volk
 
quickguide-einnovator-9-redis
quickguide-einnovator-9-redisquickguide-einnovator-9-redis
quickguide-einnovator-9-redisjorgesimao71
 
Object Storage with Gluster
Object Storage with GlusterObject Storage with Gluster
Object Storage with GlusterGluster.org
 
Paris Redis Meetup Introduction
Paris Redis Meetup IntroductionParis Redis Meetup Introduction
Paris Redis Meetup IntroductionGregory Boissinot
 
eZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedeZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedBertrand Dunogier
 
An Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAn Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAli MasudianPour
 
Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamCodemotion
 
Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Itamar Haber
 

Mais procurados (20)

HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係
 
Hadoop
HadoopHadoop
Hadoop
 
Red Hat Linux cheat sheet
Red Hat Linux cheat sheetRed Hat Linux cheat sheet
Red Hat Linux cheat sheet
 
OWASP Proxy
OWASP ProxyOWASP Proxy
OWASP Proxy
 
Redis and its many use cases
Redis and its many use casesRedis and its many use cases
Redis and its many use cases
 
Redis as a message queue
Redis as a message queueRedis as a message queue
Redis as a message queue
 
Exploring, understanding and monitoring macOS activity with osquery
Exploring, understanding and monitoring macOS activity with osqueryExploring, understanding and monitoring macOS activity with osquery
Exploring, understanding and monitoring macOS activity with osquery
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 
Redis - Usability and Use Cases
Redis - Usability and Use CasesRedis - Usability and Use Cases
Redis - Usability and Use Cases
 
Redis SoCraTes 2014
Redis SoCraTes 2014Redis SoCraTes 2014
Redis SoCraTes 2014
 
Kicking ass with redis
Kicking ass with redisKicking ass with redis
Kicking ass with redis
 
quickguide-einnovator-9-redis
quickguide-einnovator-9-redisquickguide-einnovator-9-redis
quickguide-einnovator-9-redis
 
Nginx-lua
Nginx-luaNginx-lua
Nginx-lua
 
Object Storage with Gluster
Object Storage with GlusterObject Storage with Gluster
Object Storage with Gluster
 
Paris Redis Meetup Introduction
Paris Redis Meetup IntroductionParis Redis Meetup Introduction
Paris Redis Meetup Introduction
 
eZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedeZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisited
 
Caching. api. http 1.1
Caching. api. http 1.1Caching. api. http 1.1
Caching. api. http 1.1
 
An Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAn Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL database
 
Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time stream
 
Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)
 

Semelhante a Practical Introduction to Hydra for Big Data Processing

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introductionAlex Su
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackke4qqq
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kitehuguk
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
OSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga TeamOSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga TeamNETWAYS
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppet
 
Icinga 2009 at OSMC
Icinga 2009 at OSMCIcinga 2009 at OSMC
Icinga 2009 at OSMCIcinga
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWAREFernando Lopez Aguilar
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationFIWARE
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life琛琳 饶
 

Semelhante a Practical Introduction to Hydra for Big Data Processing (20)

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introduction
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStack
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Spark etl
Spark etlSpark etl
Spark etl
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
OSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga TeamOSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga Team
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
 
Icinga 2009 at OSMC
Icinga 2009 at OSMCIcinga 2009 at OSMC
Icinga 2009 at OSMC
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Logstash
LogstashLogstash
Logstash
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 

Último

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Último (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Practical Introduction to Hydra for Big Data Processing

  • 1. Hydra - A Practical Introduction Big Data DC - @bigdatadc Matt Abrams - @abramsm March 4th 2013
  • 2.
  • 3. Agenda • What is Hydra? • Sample Data and Analysis Questions • Getting started with a local Hydra dev environment • Hydra’s Key Concepts • Creating your first Hydra job • Putting it all together
  • 4. Hydra’s Goals • Support Streaming and Batch Processing • Massive Scalability • Fault tolerant by design (bend but do not break) • Incremental Data Processing • Full stack operational support • Command and Control • Alerting • Resource Management • Data/Task Rebalancing • Data replication and Backup
  • 5. What Exactly is Hydra? • File System • Data Processing • Query System • Job/Cluster Management • Operational Alerting • Open Source
  • 6. Hydra - Terms • Job: a process for processing data • Task: a processing component of a job. A job can have one to n tasks • Node: A logic unit of processing capacity available to a cluster • Minion: Management process that runs on cluster nodes. Acts as gate keeper for controlling task processes • Spawn: Cluster management controller and UI
  • 8. Our Sample Data (Log-Synth) 3.535,  5214d63bab95687d,  166.144.203.186,  "the  then  good"   3.568,  5dbd9451948ad895,  88.120.153.226,  "know  boys"   4.206,  5dbd9451948ad895,  88.120.153.226,  "to"   4.673,  b967d99cad0b3e60,  88.120.153.226,  "seven"   4.900,  bd0d760fbb338955,  166.144.203.186,  "did  local  it"
  • 9. What do we want to know? • What are the top IP addresses by request count? • What are the top IP address by unique user count? • What are the most common search terms? • What are the most common search terms in the slowest 5% of queries? • What are the daily number of unique searches, unique users, unique IP addresses, and distribution of response times (all approximates)?
  • 10. Setting up Hydra’s Local Stack
  • 11. Vagrant • $  vagrant  init  precise32  http:// files.vagrantup.com/precise32.box   • //  add:  config.vm.network  :forwarded_port,   guest:  5052,  host:  5052  to  your  Vagrantfile   • $  vagrant  up   • $  vagrant  ssh
  • 12. Java7 • $  sudo  apt-­‐get  update     • $  sudo  apt-­‐get  install  python-­‐software-­‐ properties   • $  sudo  add-­‐apt-­‐repository  ppa:webupd8team/java   • $  sudo  apt-­‐get  update   • $  sudo  apt-­‐get  install  oracle-­‐java7-­‐installer
  • 13. RabbitMQ, Maven, Git, Make • $  sudo  apt-­‐get  install  rabbitmq-­‐server   • $  sudo  apt-­‐get  install  maven   • $  sudo  apt-­‐get  install  git   • $  sudo  apt-­‐get  install  make
  • 14. Copy on Write • $  wget  http://xmailserver.org/fl-­‐cow-­‐0.10.tar.gz   • $  tar  zxvf  fl-­‐cow-­‐0.10.tar.gz   • $  cd  fl-­‐cow-­‐0.10   • $  ./configure  —prefix=/usr   • $  make;  make  check   • $  sudo  make  install   • $  export  LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD
  • 15. Hydra • $  git  clone  https://github.com/addthis/ hydra.git   • $  cd  hydra;  mvn  clean  -­‐Pbdbje  package   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  start   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  start   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  seed
  • 16. Stage Sample Data in Stream Directory • $  mkdir  ~/hydra/hydra-­‐local/streams/log-­‐synth   • $  cp  $YOUR_SAMPLE_DATA_DIR  ~/hydra/hydra-­‐ local/streams/log-­‐synth
  • 18. BundleFilters • Return true or false • Operate on entire rows • Add/Remove • Edit • May ValueFilters • Operate on single volume values • Return columns Column Values include a call to ValueFilter a value or null • No visibility to full row • Often take input from BundleFilter
  • 19. BundleFilter - Chain // chain of bundle filters {"op":"chain", “filter”:[ //LIST OF BUNDLE //FILTERS …. ]}
  • 20. BundleFilter - Existence // false if UID column is null {"op":"field", "from":"UID"},
  • 21. Bundle Filter - Concatenation // joins FOO and BAR // Stores output in new column “OUTPUT” ! {"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},
  • 22. BundleFilter - Equality Testing // FIELD_ONE == FIELD_TWO ! {“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},
  • 23. BundleFilter - Math! // DUR = Math.round((end-start)/1000) ! {"op":"num", "columns":["END", "START", "DUR"], "define":"c0,c1,sub,v1000,ddiv,toint,v2,set"}
  • 24. Stack Math - Sample Data C0,START_TIME C1,END_TIME 100,234 200,468
  • 28. Stack Math - Sample Result C0,START_TIME C1,END_TIME C2,DURATION 100,234 200,468 100
  • 29. ValueFilter - Glob ValueFilter {from:"SOURCE", filter:{op:”glob”, pattern:"Log_[0-9]*"}} BundleFilter
  • 30. ValueFilter - Chain, Split, Index ValueFilter {op:"field", from:”LIST”,filter: {op:"chain", filter:[ {op:”split", split:"="}, {op:"index", index:0} ]}}, ValueFilter(s)
  • 32. Data Attachments are Hydra’s Secret Weapon • Top-K Estimator • Cardinality Estimation (HyperLogLog Plus) • Quantile Estimation (Q,T-Digest) • Bloom Filters • Multiset streaming summarization (CountMin Sketch)
  • 33. Data Attachment Example A single node that tracks the top 1000 unique search terms, the distinct count of UIDs, and provides quantile estimation for the query time
  • 34. Putting it All Together
  • 35. Job Structure • Jobs have three sections • Source • Map • Output
  • 36. Source • Defines the properties of the input data set • Several built in source types: • Mesh • Local File System • Kafka
  • 37. Map • Select fields from input record to process • Apply filters to rows and columns • Drop or expand rows
  • 38. Output - Tree • Output(s) can be trees or data files • Trees represent data aggregations that can be queried • Files Output Targets • File System • Cassandra • HDFS
  • 39. Lets put it all Together
  • 42. Query
  • 43. What are the top IP Addresses By Record Count? • Exact • • • path: root/byip/+:+hits ops: gather=ks;sort=1:n:d;limit=100 Approximate • path: root/byip/+$+uidcount • ops: gather=ks;sort=1:n:d;limit=100
  • 44. What are the top IPs by unique user count? • Exact • • • path: root/byip/+/+ ops: gather=kk;sort=0;gather=ku;sort=1:n:d Approximate • path: root/byip/+$+uidcount • ops: gather=ks;sort=1:n:d;limit=100
  • 45. What are the search terms for the slowest 5%? • First get the 95th percentile query time • • • path: /root$+timeDigest=quantile(.95) ops: num=c0,toint,v0,set;gather=a Now find all queries then 95th percentile • path: /root/bytime/+/+:+hits • ops: num=c0,v950,gteq;gather=iks;sort=1:n:d
  • 46. Daily Unqiue Searches, Users, IPs and distribution of response times? • Query Path: • • Ops: • • root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(. 25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$ +timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999 Remote Ops: • num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num =c7,toint,v7,set;num=c8,toint,v8,set;
  • 47. But yeah, I could do that with CLI!
  • 48. Related Open Source Projects • Meshy - https://github.com/addthis/meshy • Codec - https://github.com/addthis/codec • Muxy - https://github.com/addthis/muxy • Bundle - https://github.com/addthis/bundle • Basis - https://github.com/addthis/basis • Column Compressor - https://github.com/addthis/ columncompressor • Cluster Boot Service - https://github.com/stewartoallen/cbs
  • 49. Helpful Resources • Hydra - https://github.com/addthis/hydra • Hydra User Reference - http://ossdocs.addthiscode.net/hydra/latest/user-reference/ • Hydra User Guide - http://oss-docs.addthiscode.net/ hydra/latest/user-guide/ • IRC - #hydra • Mailing List - https://groups.google.com/forum/#!forum/ hydra-oss