Chandan Rajpurohit

An Artist With Technical Skills

Let us build a web analytics tool where one data point is number of unique users that visited URL. Problem that we face implementing this is web-scale, as you may have million of users.

A naive MapReduce implementation of aggregation will be to use hashtable to store and count number of unique users. Hashtable for million users will exhaust JVM heap.

To solve such problem we can use probabilistic algorithm such as HyperLogLog as it takes significantly smaller memory compared to heap. Only disadvantage of using such algorithm is accuracy that can be tuned.

HyperLogLog is somewhat similar to BloomFilter with key difference that HyperLogLog estimates count, whereas BloomFilter provides membership capabilities.

Potential application of HyperLogLog are link-based spam detection on web and mining of large datasets.

HyperLogLog is a probabilistic cardinality estimator. It relaxes constraint of exactly calculating the number of elements in set and instead estimates the number of elements.

Data structures that support exact set cardinality calculations needs storage which is proportional to the number of elements, which may not be optimal while working with large datasets. Probabilistic cardinality structures take less memory than their exact cardinality counterparts, and are applicable in times where the cardinality can be off by few percentage points.

It can perform cardinality estimation for counts beyond 10^9 using 1.5 KB of memory with error rate of 2%. It works by counting the maximum number of consecutive zeros in has and using probabilities to predict the count of all the unique items.

There are two parameters of HyperLogLog :

1:- The number of buckets, usually expressed by number, b, which is used to determine number of buckets by 2^b. Therefore, each increment in b doubles the number of buckets. The lower bond of b is set to 4.

2:- The number of bits used to represent the maximum number of consecutive zeros in a bucket.

As a result, The size of HyoperLogLog is calculated by 2^b * bits-per-bucket.

for b=11 and number of bits per bucket is 5 then size will be 10240 bits or 1.25 KB.

Calculating unique counts using HyperLogLog

Java implementation of HyperLogLog is called is java-hll and is available in GitHub.

For example We use simple case of where data consists of an array of numbers, and Google’s Guava library is used to create a hash for each number and add it to HyperLogLog.

CODE:

HashFunction hasher = Hashing.murmur3_128(); #Use guava 128 bit murmur hash algorithm

final Integer []data = new Integer[]{1,2,3,4,5,1,2,3,1,4,5,6}; #data to calculate distinct count

final HLL hll = new HLL(

13, #number of bucket b

5 #number of bits per bucket

);

for(int item:data){

final long hashedValue = hasher.newHasher().putInt(item).hash().asLong(); #calculate hash of an item