SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Cardinality Estimation for
    Very Large Data Sets
                                      
                                      
    Matt Abrams, VP Data and Operations
                         March 25, 2013
THANKS FOR
COMING!
I build large scale distributed systems and work on
algorithms that make sense of the data stored in
them


Contributor to the open source project Stream-Lib,
a Java library for summarizing data streams
(https://github.com/clearspring/stream-lib)


Ask me questions: @abramsm
HOW CAN WE COUNT
THE NUMBER OF
DISTINCT ELEMENTS
IN LARGE DATA
SETS?
HOW CAN WE COUNT
THE NUMBER OF
DISTINCT ELEMENTS
IN VERY LARGE DATA
SETS?
GOALS FOR
COUNTING SOLUTION
Support high throughput data streams (up
to many 100s of thousands per second)
Estimate cardinality with known error
thresholds in sets up to around 1 billion (or
even 1 trillion when needed)
Support set operations (unions and
intersections)
Support data streams with large number of
dimensions
1 UID = 128 bits
513a71b843e54b73
In one month AddThis
    logs 5B+ UIDs

          2,500,000 * 2000
          = 5,000,000,000
That’s 596GB of
  just UIDS
NAÏVE SOLUTIONS

•  Select count(distinct
   UID) from table where
   dimension = foo
•  HashSet<K>
•  Run a batch job for each
   new query request
WE ARE NOT A BANK




    This means a estimate rather
    than exact value is acceptable.

                   http://graphics8.nytimes.com/images/2008/01/30/timestopics/
                   feddc.jpg
THREE INTUITIONS
•  It is possible to estimate the cardinality of a set
   by understanding the probability of a sequence
   of events occurring in a random variable (e.g.
   how many coins were flipped if I saw n heads in
   a row?)
•  Averaging the the results of multiple
   observations can reduce the variance
   associated with random variables
•  Applying a good hash function effectively de-
   duplicates the input stream
INTUITION




   What is the probability
   that a binary string
   starts with ’01’?
INTUITION




  (1/2)2    = 25%
INTUITION




(1/2)3      = 12.5%
INTUITION




Crude analysis: If a stream
has 8 unique values the hash
of at least one of them should
start with ‘001’
INTUITION




Given the variability of a single
random value we can not use
a single variable for accurate
cardinality estimations
MULTIPLE OBSERVATIONS HELP
REDUCE VARIANCE

By taking the mean of the standard
deviation of multiple random variables we
can make the error rate as small as desired
by controlling the size of m (the number
random variables)



    error = σ / m
THE PROBLEM WITH
MULTIPLE HASH
FUNCTIONS

•  It is too costly from a
   computational perspective to
   apply m hash functions to
   each data point
•  It is not clear that it is possible
   to generate m good hash
   functions that are independent
STOCHASTIC
AVERAGING
• Emulating the effect of m experiments
  with a single hash function
• Divide input stream h(Μ) into m sub-
  streams
            "1 2      m −1 %
            $ , ,...,
            #m m
                          ,1'
                       m &
• An average of the observable values for
  each sub-stream will yield a cardinality
  that improves in proportion to 1 / m as
  m increases
HASH FUNCTIONS
32 Bit         64 Bit       160 Bit                      Odds of a
Hash           Hash         Hash                         Collision
77163          5.06 Billion 1.42 *                       1 in 2
                            10^14
30084          1.97 Billion 5.55 *                       1 in 10
                            10^23
9292           609 million 1.71 *                        1 in 100
                            10^23
2932           192 million 5.41 *                        1 in 1000
                            10^22
         http://preshing.com/20110504/hash-collision-probabilities
HYPERLOGLOG
      (2007)
Counts up to 1 Billion in 1.5KB of space




            Philippe Flajolet (1948-2011)
HYPERLOGLOG (HLL)
•  Operates with a single pass
   over the input data set
•  Produces a typical error of of
            1.04 / m
•  Error decreases as m
   increases. Error is not a
   function of the number of
   elements in the set
HLL SUBSTREAMS

 HLL uses a single hash
 function and splits the result
 into m buckets
                              Bucket 1
                Hash
Input Values   Function
                          S   Bucket 2

                              Bucket m
HLL ALGORITHM
BASICS
•  Each substream maintains an Observable
 •  Observable is largest value p(x) which is the
    position of the leftmost 1-bit in a binary string x



•  32 bit hashing function with 5 bit “short bytes”
•  Harmonic mean
 •  Increases quality of estimates by reducing variance
WHAT ARE “SHORT BYTES”?
•  We know a priori that the value of a given
   substream of the multiset M is in the
   range

          0..(L +1− log 2 m)
•  Assuming L = 32 we only need 5 bits to
   store the value of the register
•  85% less memory usage as compared to
   standard java int (32 bits)
ADDING VALUES TO
HLL



        ρ ( xb+1 xb+2 ⋅⋅⋅)        index = 1+ x1 x2 ⋅⋅⋅ xb   2



•  The first b bits of the new value define the
   index for the multiset M that may be
   updated when the new value is added
•  The bits b+1 to m are used to determine
   the leading number of zeros (p)
ADDING VALUES TO
HLL
                   Observations




{M[1], M[2],..., M[m]}
The multiset is updated using the equation:

   M[ j] := max(M[ j], ρ (ω ))
                              Number of leading zeros + 1
INTUITION ON
EXTRACTING
CARDINALITY FROM HLL
•  If we add n unique elements to a stream then
   each substream will contain roughly n/m
   elements
•  The MAX value in each substream should be
   about log 2 ( n / m) (from earlier intuition re
   random variables)
•  The harmonic mean (mZ) of 2MAX is on the
   order of n/m
•  So m2Z is on the order of n ß That’s the
   cardinality!
HLL CARDINALITY
ESTIMATE
                                            −1
                          $     m
                                −M [ j ]
                                         '
             E := α m m ⋅ & ∑ 2
                       2
                          &              )
                                         )
                          % j=1          (
                  p 2
                (2 )      Harmonic Mean


•  m2Z has systematic multiplicative bias that needs to be
   corrected. This is done by multiplying a constant value
A NOTE ON LONG
RANGE CORRECTIONS
•  The paper says to apply a long range
   correction function when the estimate is
   greater than: E > 1 232
                       30
•  The correction function is:
      E * := −2 32 log(1− E / 2
•  DON’T DO THIS! It doesn’t work and
   increases error. Better approach is to
   use a bigger/better hash function
DEMO TIME!
Lets look at HLL in Action.


            http://www.aggregateknowledge.com/science/blog/hll.html
HLL UNIONS                       Root

•  Merging two or more HLL
   data structures is a                 MON   HLL
   similar process to adding
   a new value to a single
   HLL                                  TUE   HLL
•  For each register in the
   HLL take the max value of
   the HLLs you are merging             WED
                                              HLL
   and the resulting register
   set can be used to
   estimate the cardinality of          THU   HLL
   the combined sets

                                        FRI   HLL
HLL INTERSECTION
        C = A + B − A∪B



            A           C       B




     You must understand the properties
     of your sets to know if you can trust
     the resulting intersection
HYPERLOGLOG++
•  Google researches have recently released an
   update to the HLL algorithm
•  Uses clever encoding/decoding techniques to
   create a single data structure that is very
   accurate for small cardinality sets and can
   estimate sets that have over a trillion elements
   in them
•  Empirical bias correction. Observations show
   that most of the error in HLL comes from the
   bias function. Using empirically derived values
   significantly reduces error
HLL++ DELTA
 ENCODING


{1024,1027,1028,1030,1033,1035}

                {0, 3,1, 2, 3, 2}
 By using delta encoding fewer bits are required to
 represent array making it easier to fit larger sets in
 memory
OTHER PROBABILISTIC
DATA STRUCTURES
•  Bloom Filters – set membership
   detection
•  CountMinSketch – estimate number
   of occurrences for a given element
•  TopK Estimators – estimate the
   frequency and top elements from a
   stream
REFERENCES
•  Stream-Lib -
   https://github.com/clearspring/stream-lib
•  HyperLogLog -
   http://citeseerx.ist.psu.edu/viewdoc/summary?
   doi=10.1.1.142.9475
•  HyperLogLog In Practice -
   http://research.google.com/pubs/pub40671.html
•  Aggregate Knowledge HLL Blog Posts -
   http://blog.aggregateknowledge.com/tag/
   hyperloglog/
THANKS!


     AddThis is hiring!

Mais conteúdo relacionado

Mais procurados

Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissectionChenYiHuang5
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!ChenYiHuang5
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksBruno Gonçalves
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksChenYiHuang5
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
 
finding Min and max element from given array using divide & conquer
finding Min and max element from given array using  divide & conquer finding Min and max element from given array using  divide & conquer
finding Min and max element from given array using divide & conquer Swati Kulkarni Jaipurkar
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big DataGianvito Siciliano
 
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...홍배 김
 
Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model Saurab Dulal
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing홍배 김
 
Quantum Computing for app programmer
Quantum Computing for app programmerQuantum Computing for app programmer
Quantum Computing for app programmerKyunam Cho
 
Statistics - SoftMax Equation
Statistics - SoftMax EquationStatistics - SoftMax Equation
Statistics - SoftMax EquationAndrew Ferlitsch
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber SecurityAltoros
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingGabriele Angeletti
 
Machine learning interviews day4
Machine learning interviews   day4Machine learning interviews   day4
Machine learning interviews day4rajmohanc
 

Mais procurados (20)

Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissection
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural Networks
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
finding Min and max element from given array using divide & conquer
finding Min and max element from given array using  divide & conquer finding Min and max element from given array using  divide & conquer
finding Min and max element from given array using divide & conquer
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
 
Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing
 
Lec30
Lec30Lec30
Lec30
 
Unit 3
Unit 3Unit 3
Unit 3
 
Quantum Computing for app programmer
Quantum Computing for app programmerQuantum Computing for app programmer
Quantum Computing for app programmer
 
Statistics - SoftMax Equation
Statistics - SoftMax EquationStatistics - SoftMax Equation
Statistics - SoftMax Equation
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber Security
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive Hashing
 
Machine learning interviews day4
Machine learning interviews   day4Machine learning interviews   day4
Machine learning interviews day4
 

Destaque

Testi ultima versione modifiche
Testi ultima versione modificheTesti ultima versione modifiche
Testi ultima versione modificheBruna Rossi
 
Presentation how it_works_test_for_slide_share
Presentation how it_works_test_for_slide_sharePresentation how it_works_test_for_slide_share
Presentation how it_works_test_for_slide_sharevantageschool
 
Data opsdc clearspringmetrics
Data opsdc clearspringmetricsData opsdc clearspringmetrics
Data opsdc clearspringmetricsabramsm
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Startedabramsm
 
Metodología de aprendizaje
Metodología de aprendizaje Metodología de aprendizaje
Metodología de aprendizaje Herberth Pinto
 
If SharePoint had Warning Labels
If SharePoint had Warning LabelsIf SharePoint had Warning Labels
If SharePoint had Warning LabelsJoanne Klein
 
Johan Hjalmer Dahlstrom
Johan Hjalmer DahlstromJohan Hjalmer Dahlstrom
Johan Hjalmer Dahlstromtwotacky1
 

Destaque (7)

Testi ultima versione modifiche
Testi ultima versione modificheTesti ultima versione modifiche
Testi ultima versione modifiche
 
Presentation how it_works_test_for_slide_share
Presentation how it_works_test_for_slide_sharePresentation how it_works_test_for_slide_share
Presentation how it_works_test_for_slide_share
 
Data opsdc clearspringmetrics
Data opsdc clearspringmetricsData opsdc clearspringmetrics
Data opsdc clearspringmetrics
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
 
Metodología de aprendizaje
Metodología de aprendizaje Metodología de aprendizaje
Metodología de aprendizaje
 
If SharePoint had Warning Labels
If SharePoint had Warning LabelsIf SharePoint had Warning Labels
If SharePoint had Warning Labels
 
Johan Hjalmer Dahlstrom
Johan Hjalmer DahlstromJohan Hjalmer Dahlstrom
Johan Hjalmer Dahlstrom
 

Semelhante a 2013 open analytics_countingv3

Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Understanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECLUnderstanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECLHPCC Systems
 
358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15sumitbardhan
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Logic Circuits Design - "Chapter 1: Digital Systems and Information"Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Logic Circuits Design - "Chapter 1: Digital Systems and Information"Ra'Fat Al-Msie'deen
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...AmirParnianifard1
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowOswald Campesato
 
Convolutional Error Control Coding
Convolutional Error Control CodingConvolutional Error Control Coding
Convolutional Error Control CodingMohammed Abuibaid
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierDesign of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierVLSICS Design
 
Programming in python
Programming in pythonProgramming in python
Programming in pythonIvan Rojas
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithmK Hari Shankar
 
from_data_to_differential_equations.ppt
from_data_to_differential_equations.pptfrom_data_to_differential_equations.ppt
from_data_to_differential_equations.pptashutoshvb1
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
Advance algorithm hashing lec II
Advance algorithm hashing lec IIAdvance algorithm hashing lec II
Advance algorithm hashing lec IISajid Marwat
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowOswald Campesato
 

Semelhante a 2013 open analytics_countingv3 (20)

Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Understanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECLUnderstanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECL
 
358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Logic Circuits Design - "Chapter 1: Digital Systems and Information"Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Logic Circuits Design - "Chapter 1: Digital Systems and Information"
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
 
PhD defense talk slides
PhD  defense talk slidesPhD  defense talk slides
PhD defense talk slides
 
Convolutional Error Control Coding
Convolutional Error Control CodingConvolutional Error Control Coding
Convolutional Error Control Coding
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierDesign of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic Multiplier
 
C++ and Deep Learning
C++ and Deep LearningC++ and Deep Learning
C++ and Deep Learning
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Programming in python
Programming in pythonProgramming in python
Programming in python
 
Big O Notation
Big O NotationBig O Notation
Big O Notation
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
from_data_to_differential_equations.ppt
from_data_to_differential_equations.pptfrom_data_to_differential_equations.ppt
from_data_to_differential_equations.ppt
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Advance algorithm hashing lec II
Advance algorithm hashing lec IIAdvance algorithm hashing lec II
Advance algorithm hashing lec II
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
 

2013 open analytics_countingv3

  • 1. Cardinality Estimation for Very Large Data Sets Matt Abrams, VP Data and Operations March 25, 2013
  • 2. THANKS FOR COMING! I build large scale distributed systems and work on algorithms that make sense of the data stored in them Contributor to the open source project Stream-Lib, a Java library for summarizing data streams (https://github.com/clearspring/stream-lib) Ask me questions: @abramsm
  • 3. HOW CAN WE COUNT THE NUMBER OF DISTINCT ELEMENTS IN LARGE DATA SETS?
  • 4. HOW CAN WE COUNT THE NUMBER OF DISTINCT ELEMENTS IN VERY LARGE DATA SETS?
  • 5. GOALS FOR COUNTING SOLUTION Support high throughput data streams (up to many 100s of thousands per second) Estimate cardinality with known error thresholds in sets up to around 1 billion (or even 1 trillion when needed) Support set operations (unions and intersections) Support data streams with large number of dimensions
  • 6.
  • 7. 1 UID = 128 bits 513a71b843e54b73
  • 8. In one month AddThis logs 5B+ UIDs 2,500,000 * 2000 = 5,000,000,000
  • 9. That’s 596GB of just UIDS
  • 10. NAÏVE SOLUTIONS •  Select count(distinct UID) from table where dimension = foo •  HashSet<K> •  Run a batch job for each new query request
  • 11. WE ARE NOT A BANK This means a estimate rather than exact value is acceptable. http://graphics8.nytimes.com/images/2008/01/30/timestopics/ feddc.jpg
  • 12.
  • 13. THREE INTUITIONS •  It is possible to estimate the cardinality of a set by understanding the probability of a sequence of events occurring in a random variable (e.g. how many coins were flipped if I saw n heads in a row?) •  Averaging the the results of multiple observations can reduce the variance associated with random variables •  Applying a good hash function effectively de- duplicates the input stream
  • 14. INTUITION What is the probability that a binary string starts with ’01’?
  • 16. INTUITION (1/2)3 = 12.5%
  • 17. INTUITION Crude analysis: If a stream has 8 unique values the hash of at least one of them should start with ‘001’
  • 18. INTUITION Given the variability of a single random value we can not use a single variable for accurate cardinality estimations
  • 19. MULTIPLE OBSERVATIONS HELP REDUCE VARIANCE By taking the mean of the standard deviation of multiple random variables we can make the error rate as small as desired by controlling the size of m (the number random variables) error = σ / m
  • 20. THE PROBLEM WITH MULTIPLE HASH FUNCTIONS •  It is too costly from a computational perspective to apply m hash functions to each data point •  It is not clear that it is possible to generate m good hash functions that are independent
  • 21. STOCHASTIC AVERAGING • Emulating the effect of m experiments with a single hash function • Divide input stream h(Μ) into m sub- streams "1 2 m −1 % $ , ,..., #m m ,1' m & • An average of the observable values for each sub-stream will yield a cardinality that improves in proportion to 1 / m as m increases
  • 22. HASH FUNCTIONS 32 Bit 64 Bit 160 Bit Odds of a Hash Hash Hash Collision 77163 5.06 Billion 1.42 * 1 in 2 10^14 30084 1.97 Billion 5.55 * 1 in 10 10^23 9292 609 million 1.71 * 1 in 100 10^23 2932 192 million 5.41 * 1 in 1000 10^22 http://preshing.com/20110504/hash-collision-probabilities
  • 23. HYPERLOGLOG (2007) Counts up to 1 Billion in 1.5KB of space Philippe Flajolet (1948-2011)
  • 24. HYPERLOGLOG (HLL) •  Operates with a single pass over the input data set •  Produces a typical error of of 1.04 / m •  Error decreases as m increases. Error is not a function of the number of elements in the set
  • 25. HLL SUBSTREAMS HLL uses a single hash function and splits the result into m buckets Bucket 1 Hash Input Values Function S Bucket 2 Bucket m
  • 26. HLL ALGORITHM BASICS •  Each substream maintains an Observable •  Observable is largest value p(x) which is the position of the leftmost 1-bit in a binary string x •  32 bit hashing function with 5 bit “short bytes” •  Harmonic mean •  Increases quality of estimates by reducing variance
  • 27. WHAT ARE “SHORT BYTES”? •  We know a priori that the value of a given substream of the multiset M is in the range 0..(L +1− log 2 m) •  Assuming L = 32 we only need 5 bits to store the value of the register •  85% less memory usage as compared to standard java int (32 bits)
  • 28. ADDING VALUES TO HLL ρ ( xb+1 xb+2 ⋅⋅⋅) index = 1+ x1 x2 ⋅⋅⋅ xb 2 •  The first b bits of the new value define the index for the multiset M that may be updated when the new value is added •  The bits b+1 to m are used to determine the leading number of zeros (p)
  • 29. ADDING VALUES TO HLL Observations {M[1], M[2],..., M[m]} The multiset is updated using the equation: M[ j] := max(M[ j], ρ (ω )) Number of leading zeros + 1
  • 30. INTUITION ON EXTRACTING CARDINALITY FROM HLL •  If we add n unique elements to a stream then each substream will contain roughly n/m elements •  The MAX value in each substream should be about log 2 ( n / m) (from earlier intuition re random variables) •  The harmonic mean (mZ) of 2MAX is on the order of n/m •  So m2Z is on the order of n ß That’s the cardinality!
  • 31. HLL CARDINALITY ESTIMATE −1 $ m −M [ j ] ' E := α m m ⋅ & ∑ 2 2 & ) ) % j=1 ( p 2 (2 ) Harmonic Mean •  m2Z has systematic multiplicative bias that needs to be corrected. This is done by multiplying a constant value
  • 32. A NOTE ON LONG RANGE CORRECTIONS •  The paper says to apply a long range correction function when the estimate is greater than: E > 1 232 30 •  The correction function is: E * := −2 32 log(1− E / 2 •  DON’T DO THIS! It doesn’t work and increases error. Better approach is to use a bigger/better hash function
  • 33. DEMO TIME! Lets look at HLL in Action. http://www.aggregateknowledge.com/science/blog/hll.html
  • 34. HLL UNIONS Root •  Merging two or more HLL data structures is a MON HLL similar process to adding a new value to a single HLL TUE HLL •  For each register in the HLL take the max value of the HLLs you are merging WED HLL and the resulting register set can be used to estimate the cardinality of THU HLL the combined sets FRI HLL
  • 35. HLL INTERSECTION C = A + B − A∪B A C B You must understand the properties of your sets to know if you can trust the resulting intersection
  • 36. HYPERLOGLOG++ •  Google researches have recently released an update to the HLL algorithm •  Uses clever encoding/decoding techniques to create a single data structure that is very accurate for small cardinality sets and can estimate sets that have over a trillion elements in them •  Empirical bias correction. Observations show that most of the error in HLL comes from the bias function. Using empirically derived values significantly reduces error
  • 37. HLL++ DELTA ENCODING {1024,1027,1028,1030,1033,1035} {0, 3,1, 2, 3, 2} By using delta encoding fewer bits are required to represent array making it easier to fit larger sets in memory
  • 38. OTHER PROBABILISTIC DATA STRUCTURES •  Bloom Filters – set membership detection •  CountMinSketch – estimate number of occurrences for a given element •  TopK Estimators – estimate the frequency and top elements from a stream
  • 39. REFERENCES •  Stream-Lib - https://github.com/clearspring/stream-lib •  HyperLogLog - http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.142.9475 •  HyperLogLog In Practice - http://research.google.com/pubs/pub40671.html •  Aggregate Knowledge HLL Blog Posts - http://blog.aggregateknowledge.com/tag/ hyperloglog/
  • 40. THANKS! AddThis is hiring!

Notas do Editor

  1. 2.5M people
  2. Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  3. Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  4. Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  5. Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  6. Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.