Count-Distinct problem (also known as Cardinality Estimation) is the problem of finding the number of distinct elements in a data stream consisting of repeated elements.

The naive solution of the problem would require keeping track of all the distinct elements and compare the elements of the stream to this set of distinct elements. The time complexity of this solution would O(1) and space complexity will be O(N) if a hash set is used to implement it.

HyperLogLog(HLL) is an algorithm that allows responding to such queries in O(1) time complexity and O(log(log(N)) space complexity. The idea behind the algorithm is based on the probability of having N leading 0s in a bit string. The probability of having N leading zeroes in a bit string is 1/2^{N} or it can be said that in a uniformly distributed data set, 2^{N} elements need to be encountered to find N leading 0s.

**Flajolet-Martin Algorithm**

If there are *N* elements in a data set and there is a hash function *H()*, for each element *e* in the data set, *H(e)* is computed. To track the number of distinct elements, store the largest number of leading zeroes encountered in H(e). If the hash function is uniformly distributed then, the number of distinct elements encountered will be approximately 2^{L}, where L is the largest number of leading 0s encountered. This algorithm is called the Flajolet-Martin Algorithm.

As can be seen in the above diagram, the output of the algorithm was close to the actual number of distinct elements. But it is also evident that if the only element encountered in the stream was *helena* the output would still have been 8 which would be far from the accurate value.

**LogLog and SuperLogLog Algorithm**

The approximation above seems to be biased for small data sets, it can be further improved by using multiple hash functions and then using the average of L for each function to compute. Computing multiple hash functions can be computationally expensive, so instead of using multiple hash functions, let's use the first few bits of the hashed output to divide the output into buckets as shown in the below diagram.

This is the LogLog Algorithm. But as seen in the above example the value is still very high compared to the actual number of elements. In order to further reduce the bias, 70% of the top bucket values can be ignored. This is known as the SuperLogLog algorithm. The output, in this case, would be 11 which is much closer to the output of the LogLog algorithm.

**Finally, HyperLogLog**

Finally, HyperLogLog attempts to further reduce the bias. It simply replaces the arithmetic mean with the harmonic mean of the values. This is because the harmonic mean is much more immune to outliers as can be seen in the below diagram.

One of the benefits of using HLL is the ease of parallelization it provides. If there are multiple hosts, each host can use the same hash function and number of buckets. In such a case, to merge the results, just take the maximum value of each bucket and apply the above algorithm. The result would be the same as computing everything on a single host.

Redis supports HyperLogLog as a data structure that can be used to count the number of distinct elements. More details can be found here. There is also a Java library called stream-lib which supports HLL operations.

]]>A 'sketch' is a data structure that supports a pre-defined set of queries and updates on the data stored in it while consuming substantially less amount of space (often exponentially lesser) than would be needed to store every piece of data which has been observed. Sketching algorithms are used to support such queries using sketches.

These algorithms are commonly used to solve problems in distributed systems where real-time stream processing is needed. Some examples of such use case would be counting the number of views on a video or checking whether a video has been already watched by a user. Evaluating such queries would require storage of all historic data as well as high computation time. To avoid this, sketching algorithms take a hit on accuracy and provide an approximate solution to the queries while consuming lesser space and often time.

Let's now jump into one such data structure: *Bloom Filter*

Bloom filter is a probabilistic data structure that is used to answer the following types of queries:

- Has the video been already seen by the user?
- Is the username available?
- Is this a malicious URL?
- Here's a link to a few more use cases for Bloom Filters

Considering the probabilistic aspect of the data structure, when asked about whether an element is present in a set, it can tell either the element is *definitely not* (true-negative) or *maybe* present (true-positive or false-positive).

Now let's dive deep into the details of storage and computation to understand its behaviour. We will now define a problem that will be used to explain the working of the bloom filter.

**Problem Statement**: We have a system that is responsible for the storage of user names of all users, we need to build a system that can quickly check if a particular username is already being used by some other user.

If we were to achieve this without a bloom filter, we would simply store the usernames in a database and query the DB to check if the username is present in the database. If yes, then it is being used else no. This is a fairly simple approach, but the major issue with this approach is that the memory requirements of the index which would be needed to efficiently respond to such queries are directly proportional to the number of usernames.

This is where bloom filters showcase their value. They allow systems to respond to such queries in O(1) space and time.

The bloom filter is a k-bit bitmap that is initially set with all bits as 0. Additionally, n hash functions are needed with a range of 0 to k-1. When a new string is added to the bloom filter, each of the hash functions is used to generate a hash and the corresponding bit is then set to 1 in the bitmap. This will be done for each of the string which is provided as input to the bloom filter.

Here's a diagram that shows how this would work :

Now, if there is a candidate c that needs to be tested for presence, the same n hash functions will be used to generate a hash and if all the bits are set to 1 already, then it returns that the candidate c might be present in the set. Otherwise, if any one of the bits is set to 0, it returns false.

This approach using bloom filter requires significantly low storage compared to the approach discussed earlier. For example, to store 1 million items with 5 hash function and a probability of 0.01, approximately 1MB of storage is needed. This kind of storage also allows caching of the bloom filter in memory.

There are also other extensions over the bloom filters like *Counting Bloom Filter*. In the above explanation, only set and check operations were supported. If the use case requires support for delete operation too, then Counting Bloom Filter can be used. In this case, instead of setting bits, integers can be used to represent every bit and the set and delete operation would lead to the addition or deletion of 1 respectively, check function would then check if all hash values are greater than 0 for returning true.

One of the questions which arise now is why is it unable to determine accurately if a candidate is present in the set. This is because of a well-known issue with hashing: *collisions*. There can be cases where two candidates c1 and c2 might get the same bits set to 1 and in such scenarios, if c1 was added to the set and c2 is tested for presence, the bloom filter will return a false positive. This is also known as the Bias of the bloom filter.

To reduce the probability of such false positives, the bloom filter size should be carefully decided based on the number of expected entries and acceptable collision probabilities. This is a trade-off that needs to be considered while building the system. Here's a tool which can be used to calculate the bitmap size based on the number of items expected and the probability of false positives.

*NOTE* : Rebalancing is not possible in bloom filters as the history is not stored. Proper estimation of the number of entries expected in future and acceptable probabilities need to be made while using it.

As discussed above, the storage requirements of the bloom filter are very low and overall logic can be easily implemented. However, there are already battle-tested implementations available like Bloom Filter in Google Guava which can be used to keep a host-level bloom filter. Alternatively, if the use case requires a distributed instance, Redis supports it via an extension called Redis Bloom.

]]>Let's first discuss some of the use cases where such a component is needed.

- Rate Limiting : Each client is assigned a quote per time unit. For effective rate limiting, each host must communicate to all other hosts the usage metrics of each client. Hosts will use this information to update its local quota for the client.
- Membership Management : Each host might need to know about all the other hosts which are part of the service. Membership of hosts can be validated via heartbeats. We need efficient ways to ensure all hosts have updated membership information.
- Consensus : Multiple hosts together might need to make a decision together. In order to achieve this, they will need to exchange messages with each other.

Now let's discuss some of the architectures which can be used to solve this problem.

In Full Mesh Pattern, each host in the fleet will communicate/broadcast the information to every other host in the fleet.

To achieve this, we would need a service registry where all hosts are registered when they join the fleet. The service registry can be implemented with existing technologies like ZooKeeper and can be used by the hosts to identify the rest of the hosts in the fleet. Alternatively, if supported, we could query the load balancer to get list of hosts registered with it.

This approach is pretty straightforward to implement and is a viable option to start with. But as the size of the fleet grows this approach becomes a scalability bottleneck. The number of connections has a quadratic dependency on the number of hosts. Something similar had happened during the Kinesis outage of November 2020. An alternate to avoid such failures would be to look into cell based architecture, but that's a discussion for another day.

Gossip Protocol spreads messages to all hosts in a way similar to how an epidemic spreads. Each host will randomly select a few of its peers and share the updates with them on a configured frequency. This way each peer will eventually receive the piece of information.

Cassandra uses this protocol to discover location and state information of other nodes in the cluster. WeaveNet is a framework which internally uses gossip protocol to transmit membership information to all hosts. This article explains how it can be integrated with ECS and can be used to broadcast information to all hosts.

Caches are a commonly used component of distributed systems. For this use case too, a distributed cache like Redis can be used to store the data which needs to be shared among the hosts. All hosts can update and read the information from this cache cluster. Since most distributed systems already have cache cluster, they can be easily extended to this use case. AWS ElastiCache for Redis can be to implement such a pattern.

Another alternative is to assign a leader to the cluster. This leader will be responsible for fetching the data from all hosts and then relaying back the updates to them. This leads us to another issue, how to select a leader? Leader selection is a well known problem and can be achieved via algorithms like Paxos and Raft. But it is very complicated to implement such algorithms with proper failure handling. Alternatively, Apache ZooKeeper can be used to select the leader.

Leader selection algorithms ensure that exactly one leader is selected in the cluster at a time. But some use cases like the rate limiting problem can still work when more than one leaders are selected. Since coordination service solution explained above sacrifices availability for consistency, an alternate where random node is selected as a leader can be used in use cases where availability is of higher importance. In such cases, there might be more than one leader selected in the cluster at a time.

]]>