Learned bloom filters

30 minute read

At the intersection of two domains: machine learning and data structures lie learned data structures. Learned data structures offer improved performance as compared to their classical counterparts by making use of trends in the input data. We will explore Learned Bloom Filters which build on classical Bloom Filters and offer trade-offs which can be used to achieve optimum performance in some applications.

Bloom Filters

Motivation

A membership query returns a yes/no answer to a question of the form “Is this element \(x\) present in a given set \(S\)?”

Searching algorithms like linear search and binary search can be used to answer this question by traversing the set S and looking for the element \(x\) in \(O(|S|)\) or \(O(log|S|)\) time respectively. However, this is not very scalable as the size of set \(S\), \(|S|\), increases. This is because linear search will take too much time while binary search will require the set to be sorted which has its own costs generally not less than \(O(|S|log|S|)\).

An alternative is to use more space and less time with the help of hashing and structures like HashSets and HashMaps. Hashing transforms elements/keys into values for constant time access. They do this by storing key-value pairs as in a HashMap. But for a large set \(S\), this trade-off may not always be desirable. As a result, in cases where some error in answering a membership query is acceptable, an approximate membership query is used.

The approximate membership query also answers the same question “Is this element x present in a given set S?” but the answer yes/no may not always be true. That is, for all elements \(x\) which are present in set \(S\), the answer will always be yes and thus correct. However, for some elements \(x\) which are not present in set \(S\), the answer may be yes and thus incorrect i.e. \(x\) may be a false positive. The answer to approximate membership queries can thus be “definitely not present” or “maybe present”. Bloom filters answer such approximate membership queries. They are generally used for tasks like malicious URL detection, virus scanning etc.

Design

A simple bloom filter is characterised by the following:

  • Size of the bloom filter (\(m\))
  • The number of hash functions (\(k\))
  • A false positive rate (\(fpr\))

A simple bloom filter consists of the following:

  • An m-bit Array: To store \(0/1\) bits at each of the m indices. The indices can be taken to be from \(1\) to \(m\) or \(0\) to \(m-1\). We will use \(0\) to \(m-1\).
  • k Hash Functions: To map an element to some value. The hash functions can either be forced to have codomain \([0, m-1]\) or \((v\%m)\) can be used as the final value for any value \(v \in \mathbb{R}\) generated by a hash function. These hash functions should be independent and uniformly distributed. There are different types of hash functions like murmur, Jenkins hashes, xxHash and the choice depends on the application/dataset/use case etc. In general, a good hash function has many collisions among keys and many collisions among non-keys but few collisions among keys and non-keys [1].

Given a set \(S\) with elements from some universe \(U\), a simple bloom filter of size \(m\) having $k$ hash functions works in the following way:

set all m bits in m-bit array arr to 0
for each element x ∈ S
    for each hash function hi for i = 1, 2, …, k
	    find hi(x)
	    set arr[hi(x)] to 1, if not already 1

For all elements in set \(S\), a bloom filter sets the bit at the value given by each of the $k$ hash functions to 1.

For answering approximate membership queries, say for given set $Q$, the above bloom filter does the following:

for each query q ∈ Q
    for each hash function hi for i = 1, 2, …, k
	    find hi(q)
	    check value at arr[hi(q)]
	    if arr[hi(q)] = 0
		    return NOT PRESENT
		if arr[hi(q)] = 1
			continue
	return PRESENT

To check whether an element is present in set \(S\), the bloom filter checks all the bits at values given by each of the $k$ hash functions. If all the bits are 1, the bloom filter returns that the element is present in \(S\), otherwise, the element is said to not be present in set \(S\).

Both the insertion and querying take \(O(k)\) time where $k$ is a pre-defined constant defining the number of hash functions in the bloom filter, thus, leading to constant insertion and querying time.

False Positives

Let us see how we get false positives in a bloom filter.

Consider a bloom filter of size $m = 5$ which stores English language words with $k = 3$ hash functions, each having codomain as $[0, 4]$. Let the independent and uniformly distributed hash functions be chosen such that they produce the following values for these inputs:

Input h1 h2 h3
cat 0 1 2
bat 1 3 1
rat 2 3 0
dog 3 1 2
hat 0 1 4

Let \(S\) = {cat, bat, rat} and $Q$ = {cat, dog, hat}.

The bloom filter insertion is described below:

Initial state, all bits are set to 0 | 0| 1 | 2 | 3 | 4 | |–|–|–|–|–| | 0 |0 | 0 | 0 | 0 |

On adding cat, bits 0, 1 and 2 are set to 1 | 0| 1 | 2 | 3 | 4 | |–|–|–|–|–| | 1 |1 | 1 | 0 | 0 |

On adding bat, 1 is already set to 1, bit 3 is set to 1 | 0| 1 | 2 | 3 | 4 | |–|–|–|–|–| | 1 |1 | 1 | 1 | 0 |

On adding rat, bits 2, 3, 1 are already set to 1 | 0| 1 | 2 | 3 | 4 | |–|–|–|–|–| | 1 |1 | 1 | 1 | 0 |

The bloom filter querying is described below:

Initial state of the bloom filter after being built using \(S\) = {cat, bat, rat} | 0| 1 | 2 | 3 | 4 | |–|–|–|–|–| | 1 |1 | 1 | 1 | 0 |

We look at the hash function values given in the table above corresponding to each query.

Querying for cat, bits 0, 1, 2 are set, thus returns “Cat is PRESENT”

Querying for dog, bits 1, 2, 3 are set, thus returns “Dog is PRESENT”

On checking whether dog is in set \(S\), the bloom filter will return yes even though it is not present in set \(S\). This is because all bits corresponding to dog are set to 1.

Querying for hat, bits 0 and 1 are set but bit 4 is not set, thus returns “Hat is NOT PRESENT”

Thus dog was a false positive i.e. an element not in \(S\) yet said to be in \(S\) by the bloom filter.

The False Positive Rate (FPR)

For a bloom filter with \(m\) bits and $k$ hash functions, the false positive rate is given by:

[FPR = (1 - (1- \frac{1}{m}) ^{nk}) ^k]

Let us try to understand this. While inserting an element, the probability that a hash function maps to a particular bit is $\frac{1}{m}$. After an insertion, the probability that a bit will still not be set is $(1- \frac{1}{m}) ^{k}$ since we use $k$ independent hash functions for inserting a single element. For $n$ insertions, the probability that a bit will still not be set is $(1- \frac{1}{m}) ^{nk}$. For a false positive, the $k$ bits to which it maps should be set. Thus, the false positive rate becomes $(1 - (1- \frac{1}{m}) ^{nk}) ^k$. The FPR can be approximately written as:

[FPR = (1 - e ^{-\frac{kn}{m}}) ^k]

The above approximation is a result of $(1- \frac{1}{m}) \approx \frac{1}{e}$ for very large \(m\), which can be derived as follows: \(p = (1- \frac{1}{m}) ^{m}\) \(\ln p = m\ln (1- \frac{1}{m})\) \(\lim_{m \to \infty} \frac{\ln (1- \frac{1}{m})}{\frac{1}{m}} = \lim_{m \to \infty} \frac{-1}{(1- \frac{1}{m})} = -1\) \(\lim\_{m \to \infty} (1- \frac{1}{m})^{m} = e^{-1} = \frac{1}{e}\)

We note that the above false positive rate is not the exact FPR which is strictly greater than the expression above. This difference is due to the incorrect assumption that the events of bits being set to 1 in the \(m\)-bit array are independent. However, for a large \(m\) and small $k$, the difference between the expression above and the exact FPR is negligible [4]. write this in mathjax format

Optimal \(m\) and $k$

The false positive above was seen due to hash collisions. We can avoid hash collisions by simply increasing \(m\) - the size of the bit array but this means increased space usage. In other words, bloom filters offer a size-accuracy tradeoff. Thus choosing \(m\) and $k$ are an important decision so as to minimise space usage and at the same time avoid hash collisions while maintaining a desirable FPR. Let us see what optimal \(m\) and $k$ look like.

Keeping other things constant, the bloom filter problem can be written as: \(\underset{k}{\operatorname{min}} \ FPR = \underset{k}{\operatorname{min}} \ln FPR = \underset{k}{\operatorname{min}} \ k \ln (1 - e ^{-\frac{kn}{m}})\) The minima for the above gives the optimal number of hash functions, $k$, in a simple bloom filter as: \(k = \frac{m}{n} ln2\)

The FPR at this optimal value of $k$ is given by: \(FPR = (1 - e ^{-\frac{mn \ln 2}{mn}}) ^{\frac{m}{n} ln2} = (1 - e ^{- \ln 2})^{\frac{m}{n} ln2} = \frac{1}{2}^{\frac{m}{n} ln2}\) \(\ln FPR = -\frac{m}{n} (\ln 2)^2\)

The optimal value of m obtained at this optimal value of $k$ is given by rearranging the above equation for \(m\):

\[m = -\frac{n \ln \text{FPR}}{(\ln 2)^2}\]

A bloom filter does not store the elements of the set but only information about their existence in the form of the m-bit array and k hash functions.

Learned Bloom Filters

Motivation

To use trends in input data to train a machine learning model for membership queries on a given set \(S\).

This is modelled as a classification task. Based on a score provided by the learned model \( f \), an element \( x \) is considered to be in set \( S \) if and only if its score \( f(x) \) surpasses the threshold \( \tau \). However, as is typical with machine learning algorithms, there might be non-zero error in this classification. This error might include not only false positives but also false negatives.To address the false negatives, a classical backup bloom filter is employed. This filter aids in checking the approximate membership of the elements that the learned model deemed as negatives.

Design

Learned Bloom Filter

For any queried element \( q \), it's termed as a key if it resides within the set \( S \) and as a non-key if it's absent from the set \( S \).Upon receiving a query \( q \), the learned model \( f \) undertakes a binary classification task. The classification of \( q \) as either a key or a non-key is contingent on its score \( f(q) \). Specifically, \( q \) is identified as a key if \( f(q) \) exceeds the threshold \( \tau \).All the queries \( q \) for which the score \( f(q) \) is less than \( \tau \) (classified as non-keys) are forwarded to the classical bloom filter. This filter then leverages its \( m \)-bit array to decide the status of \( q \) as either a key or a non-key.The model \( f \) undergoes training on a dataset which is drawn from the same distribution as the query set. The choice of the threshold \( \tau \) is influenced by heuristics, with an aim to attain the targeted accuracy level.The classical bloom filter is populated using all the elements which the model \( f \) labels as non-keys. For answering approximate membership queries, say for given set \( Q \), the learned bloom filter does the following:

For each query \( q \) in \( Q \):

\[ \begin{align*} 1. & \text{ Compute } f(q) \text{ using model } f. \\ 2. & \text{ If } f(q) \geq \tau: \text{ Return } \mathbf{PRESENT}. \\ 3. & \text{ Else: } \\ & \quad \text{ For each } h_i \text{ in } B: \\ & \quad \quad \text{ If } \text{arr}[h_i(q)] = 0: \text{ Return } \mathbf{NOT PRESENT}. \\ & \quad \text{ Return } \mathbf{ PRESENT}. \end{align*} \]

The False Positive Rate

For a learned bloom filter with the following considerations:

  • a set of keys \( K \), where \( |K| = m \)
  • false positive probability \( F_p \) for non-keys
  • false negative probability \( F_n \) for keys in \( K \) leading to \( F_nm \) false negatives in \( K \)
  • size of the backup bloom filter \( B = bm \) bits
  • size of the learned function \( |f| = \zeta \) bits
  • number of bits per stored key in backup bloom filter \( B = \frac{b}{F_n} \)
  • assuming the FPR of \( B \) decreases as \( \alpha^j \) where \( j \) is the number of bits per stored key

The false positive rate (FPR) of a learned bloom filter is the combination of the learned model \( f \) and the backup bloom filter \( B \). Where \( FPR_f = F_p \) as defined above. The backup bloom filter manages \( \frac{b}{F_n} \) bits per stored key and thus has an FPR of \( \alpha^\frac{b}{F_n} \). \( B \) gives such false positives with a probability of \( 1-F_p \).

\[FPR_B = (1 - F_p) \alpha^\frac{b}{F_n}\]

Thus, for a learned bloom filter:

\[FPR = F_p + (1 - F_p) \alpha^\frac{b}{F_n} \]

Size of the Learned Bloom Filter

The learned bloom filter presents a tradeoff between size and accuracy.

Matching the previously defined learned bloom filter (LBF) with its classical counterpart, a simple bloom filter (BF) with size \( bm + \zeta \) bits, where the number of bits per stored key are \( (b + \frac{\zeta}{m}) \), we have:

\[ FPR_{BF} = \alpha^{(b + \frac{\zeta}{m})} \] \[ FPR_{LBF} = F_p + (1 - F_p) \alpha^\frac{b}{F_n} \]

We can see an improvement in performance (i.e., a reduced FPR) for the same space in the LBF compared to a BF if:

\[ F_p + (1 - F_p) \alpha^\frac{b}{F_n} \le \alpha^{(b + \frac{\zeta}{m})} \] \[ \frac{\zeta}{m} \le \log_\alpha (F_p + (1 - F_p) \alpha^\frac{b}{F_n}) - b \]

Practically, one can test different \( F_p \) and \( F_n \) values for different thresholds and distinct learned functions of varying sizes. Using these equations, one can determine if a learned bloom filter provides better performance than a simple classical bloom filter. Various variants of a learned bloom filter exist and are further discussed below.

Sandwiched Learned Bloom Filter

Motivation

To reduce the workload and hence size of the backup filter.

The initial filter in a sandwiched bloom filter rejects all negatives and passes only positives to the learned model. Each negative rejected by this initial filter is one less negative passed on to the learned model. As a result, most false positives are already rejected by the time they reach the backup filter which can then be small in size. Here we exploit the fact that bloom filters can detect negatives with no error.

Design

Sandwiched Learned Bloom Filter

A sandwiched learned bloom filter has an initial bloom filter which rejects negatives and through which all positives are passed to the learned model. The learned bloom filter then declares all positives and passes all negatives to the backup filter. This backup filter finally rejects negatives and declares positives as in the case of a learned bloom filter.

The following algorithm is followed by a sandwiched bloom filter for a query set \( Q \):

for each query q ∈ Q
    check if q is in S using initial filter
    if yes
        find score f(q) obtained using learned model f
        if f(q) >= τ_j
            return PRESENT 		## q may be a false positive
        else
            check if q is in S using backup filter
            if yes
                return PRESENT
            else
                return NOT PRESENT
     else
        return NOT PRESENT

In a sandwiched learned bloom filter, the learned model is sandwiched between two classical bloom filters.

The False Positive Rate

Continuing with the notations used above for a learned bloom filter, in a sandwiched learned bloom filter (SLBF), we essentially divide the \( bm \) bits used for the backup bloom filter into \( b_1m \) bits for the initial bloom filter and \( b_2m \) bits for the backup filter. As a result, the FPR for a SLBF is obtained from the initial bloom filter, learned model, and backup filter. Consider the part of the SLBF consisting of the learned model and backup filter only. This is a learned bloom filter with the backup filter having \( b_2 \) bits per stored key. The FPR for this part of the SLBF is thus \( F_p + (1 − F_p) \alpha ^\frac {b_2}{F_n} \). The initial bloom filter has an FPR of \( \alpha^{b_1} \) and it is with this probability that elements reach the learned model \( f \) and then the backup filter. Thus, for a sandwiched bloom filter,

\[ FPR = \alpha^{b_1}(F_p + (1 − F_p) \alpha ^\frac {b_2}{F_n}) \]

It is interesting to note that keeping other things constant, when minimizing the FPR wrt \( b_1 \), we see that the optimal number of bits for the backup filter \( b_2 \) are given by:

\[ b_2 = F_n \log_\alpha \frac{F_p}{(1-F_p)(\frac{1}{F_n} - 1)} \]

This is a fixed constant independent of the total number of bits \( b \). Hence, once the backup filter reaches an appropriate size, we can safely increase the size of the initial bloom filter using the rest of the space. A sandwiched learned bloom filter is more robust when compared to a learned bloom filter due to the presence of the initial filter.

From another perspective

In a sandwiched learned bloom filter, we are essentially dividing some \( m' = bm \) bits between the initial filter and the backup filter as compared to allotting all the \( m' \) bits to the backup filter as in the case of a learned bloom filter.

Suppose we assign \( b_1m \) bits to the initial filter and \( b_2m \) bits to the backup filter such that \( b_1 + b_2 = b \). The false positive rate is then given as:

\[ FPR = \alpha^{b_1}(F_p + (1-F_p)\alpha^{b_2/F_n}) \]

We see that bits assigned to the initial filter reduce false positives \( (F_p) \) arising from the learned function as well as the false positives \( (1-F_p) \) arising subsequent to the learned function, while the backup filter only reduces false positives arising subsequent to the learned function. Thus, increasing bits in the initial filter offers improved performance in terms of a lower FPR.

Adaptive Learned Bloom Filters

Motivation

To divide elements into buckets based on scores assigned by the learned model and to thus use optimized false positive rates for different buckets.

Let the learned function in a simple learned bloom filter assign score \(f(x)\) to an element \(x\) and the condition for classifying \(x\) as a key be \(f(x) \ge \tau\). We note that keys will generally have higher scores as compared to non-keys. The score \(f(x)\) can thus be used to partition all elements \(x\) into different regions, each with a different number of keys and non-keys. For regions with more number of keys we can allow for higher false positive rates and for regions with more non-keys we want lower false positive rates. This can be done using a different number of hash functions for each region.

Design

Adaptive Learned Bloom Filter

An adaptive learned bloom filter (Ada-BF) is characterised by the following:

  • A learned model \(f\)
  • \(g\) number of regions
  • A simple bloom filter of size \(m\)
  • \(k\) independent hash functions
  • Thresholds \(\tau_{j}\ \forall j = 0, 1, 2, .. , g\)

An adaptive learned bloom filter divides elements into \(g\) regions based on the score \(f(x)\) assigned to elements by the learned model \(f\). The \(j^{th}\) group, has elements \(x\) with scores \(f(x) \in [\tau_{j-1}, \tau_{j}]\) and uses only \(k_j\) number of hash functions from the universal \(k\) hash functions to test their membership in the Bloom Filter. This ensures that each region can be tuned to have a different FPR based on its key, non-key distribution.

The False Positive Rate

Let us say there are \(n_j\) keys in group \(j\). Empirically, \(n_j = \sum_{j = 1}^{n} I(\tau_{j-1} \le f(x_i | x_i \in S) < \tau_j)\) where \(I\) is the indicator function. Lower values of \(k_j\) are used for groups with smaller number of keys i.e. smaller \(n_j\).

The expected FPR of a region \(i\) is given by:

\[ E(FPR*i) = (1 - (1-\frac{1}{m})^{\sum*{j = 1}^{g}n_jk_j})^{k_i} = \alpha^{k_i} \]

The expected FPR is, with high probability, very close to its easily calculable expectation and the two are thus not strictly differentiated between in this discussion [3].

The expected FPR of the Ada-BF is given by:

\[ E(FPR*i) = \sum*{i = 1}^{g}p*iE(FPR_i) = \sum*{j = 1}^{g} p_i\alpha^{k_i} \]

Empirically, \(p_i = \frac{number \ of\ non-keys\ in \ group_i}{number\ of \ non-keys\ in \ training \ set \ S}\).

For simplifying tuning the many parameters in Ada-BF, we can fix the following:

  • \(\tau_0 = 0\)
  • \(\tau_g = 1\)
  • the ratio of non-keys in regions as \(\frac{p_j}{p_{j+1}} = c\ \forall j = 1, 2, ..., g-1\)
  • the ratio of number of hash functions used in regions as \(\frac{k_j}{k_{j+1}} = 1 \ \forall j = 1, 2, ..., g-1\)

leaving only 3 parameters: \(c\), \(K_g = K_{min}\) and \(K_1 = K_{max}\). For a large enough \(g\), if the number of keys across groups increases with an exponential decrease in \(p_j\) with increasing \(j\) which is satsified for most real-world datasets, the Ada-BF has a smaller FPR than the LBF for the same space used.

A LBF is an Ada-BF with \(g = 2\). The first region has elements with score \(f(x) \in [\tau_0 = 0, \tau_1 = \tau], k_1 = k\) hash function and the second region has \(f(x) \in [\tau_1 = \tau, \tau_2 = 1], k_2 = 0\) hash functions.

Disjoint Adaptive Learned Bloom Filters

Disjoint Adaptive Learned Bloom Filter

The motivation for the disjoint adaptive learned bloom filter is same as that for an adaptive learned bloom filter i.e. to lower FPR of the filter by lowering FPRs of groups with more number of non-keys. The difference is in design where instead of using different number of hash functions to tune the FPR of a region, we use independent classical bloom filters for each region. Total \(m\) bits are divided into \(m_i\) bits which form the arrays for bloom filters for each region \(i = 1, 2, ..., g\). For a large enough \(g\), if the number of keys across groups increases with an exponential decrease in \(p_j\) with increasing \(j\), a Disjoint Ada-BF offers a lower FPR as compared to a LBF while using the same space. Similar methods like in the Ada-BF are used for parameter tuning.

Partitioned Learned Bloom Filters

Motivation

The Partitioned Learned Bloom Filter (PLBF) also builds around the same motivation as the Ada-BF and Disjoint Ada-BF. We use information conveyed by the score \(f(x)\) obtained from the learned model \(f\) for an element \(x\) to divide elements into regions.

Design

Partitioned Learned Bloom Filter

Its design is similar to that of Disjoint Ada-BF as a PLBF also has specialised bloom filters for different score based regions of elements. Region boundaries are given by \(\tau_i \ \forall \ i = 1,2, ..., g\).

Given a set \(S\) with elements from some universe \(U\), insertion in a partitioned learned bloom filter works in the following way:

    for each x ∈ S
        find score f(x) obtained using learned model f
    	find interval j s.t. τ_j−1 <= f(x) <= τ_j
    		    insert x in BF_j as follows
    		    for each hash function hji for i = 1, 2, …, kj
    				find hji(x)
    			    set arr_j[hji(x)] to 1, if not already 1

The score from the learned model \(f(x)\) is used to insert element \(x\) to \(BF_j\) if \(f(x) \in [\tau_{j−1}, \tau_j ]\).

For answering approximate membership queries, say for given set Q, the PLBF does the following:

    for each q ∈ Q
        find score f(q) obtained using learned model f
    	find interval j s.t. τ_j−1 <= f(q) <= τ_j
    	    query BF_j for q as follows
    	    for each hash function hji for i = 1, 2, …, kj
    	    find hji(q)
    	    check value at arr_j[hji(q)]
    	    if arr_j[hji(q)] = 0
    		    return NOT PRESENT
    		if arr_j[hji(q)] = 1
    			continue
    	return PRESENT

To test the membership of an element, the BF in the region to which it is mapped is queried.

The False Positive Rate

Consider the following for a PLBF with false positive rate FPR:

  • Fraction of keys with scores \(f(x) < \tau\) as \(G(\tau)\) having density function \(g(\tau)\)
  • Probability that non-key has score \(f(x) \le \tau\) as \(H(\tau)\)

If non-key queries are chosen uniformly at random, non-key query distribution is same as non-key distribution. Thus, \(H(\tau)\) is also the fraction of non-keys with scores \(f(x) \le \tau\) and \(h(\tau)\) the corresponding density. [6] which describes a PLBF, uses an optimization problem for parameter tuning given by:

\[ \underset{\tau_i, \ FPR_i} \min (\sum_{i = 1}^{g} |S| (G(\tau_i) - G(\tau_{i-1})) c \log_2\frac{1}{FPR_i}) + \text{Size of Learned Model} \] \[ \text{subject to} \] \[ \sum_{i = 1}^{g} (H(\tau_i) - H(\tau_{i-1})) FPR_i \le FPR \] \[ FPR_i \le FPR \ \ \ \forall \ i = 1, 2, ..., g \] \[ (\tau_i - \tau_{i-1}) \ge 0 \ \ \ \forall \ i = 1, 2, ..., g; \ \tau_0 = 0, \tau_g = 1 \]

The objective is to minimise the space used by the PLBF such that the total false positive rate is less than or equal to desired FPR, the FPR for each region is less than this desired FPR and each region \(i = 1, 2, ...g \) has non-decreasing boundary values between \(0\) to \(1\).

The decrease in space used by a PLBF as compared to a BF with the same FPR is linearly proportional to the key and non-key distributions, \(g(\hat{t})\) and \(h(\hat{t})\) respectively of the regions and is given by:

\[ c * |S| * D_{KL} (g(\hat{t}), h(\hat{t}) ) - \text{Size of Learned Model} \]

Thus, the PLBF may offer improved performance when compared to a simple bloom filter.

Sandwiched Learned Bloom Filter and its equivalent PLBF

Sandwiched Learned Bloom Filter and its equivalent PLBF

Consider a sandwiched learned bloom filter consisting of an initial bloom filter with false positive rate \( fpr_0 \), a backup filter with \( fpr_1 \) and a learned model \( f \) which classifies based on threshold \( \tau \). An equivalent PLBF has \( g = 2 \) regions \( \tau_0 = 0, \tau_1 = \tau \) and \( \tau_2 = 1 \). The first region has a BF with \( FPR_1 = fpr_0 \times fpr_1 \) while the second region has a BF with \( FPR_2 = fpr_0 \).

Stable Bloom Filters

Motivation

The above discussed Bloom Filters were designed for a static set \(S\). In the case of an ever-growing set \(S\) as in the case of a data stream, the FPR of a bloom filter built using \(S\) would be \(FPR = \alpha^\frac{m}{|S|}\). As the size of set \(S\) increases, \(\lim_{|S| \to \infty} FPR = \lim_{|S| \to \infty} \alpha^\frac{m}{|S|} = 1\) i.e. the false positive rate for the bloom filter approaches 1. For a non-trivial false positive rate, we use Stable Bloom Filters which can handle any number of insertions in set \(S\) efficiently by introducing some false negatives.

Design

A stable bloom filter (SBF) consists of an array \(arr\) of \(m\) counters instead of an array of \(m\) bits as in a simple bloom filter. Each counter has \(d\) bits. As a result, each counter in a SBF can store values between \(0\) and \(Max = 2^{d}-1\). The other components and characteristics of the stable bloom filter are the same as that for a simple bloom filter.

Given a set \(S\) with elements from some universe \(U\), insertion in a stable bloom filter having \(m\) counters with \(d\) bits each and \(k\) hash functions and some value \(P \in [1, m]\) works in the following way:

  for each x in S
    for p = 1, 2, 3, ..., P
        idx = random(1, m)
        if (arr[idx] > 0)
            (arr[idx] = arr[idx] - 1)
    for each hash function (hi) for \(i = 1, 2, …, k\)
        find hi(x)
        set all bits in arr[hi(x)] to 1 i.e. set arr[hi(x)] to Max, if not already Max

For all elements in set \(S\), a stable bloom filter, randomly selects \(P\) counters and decrements them by 1 if they are non-zero. Then, similar to a simple bloom filter, the counter at the value given by each of the \(k\) hash functions is set to \(Max\) which is equivalent to setting each of the \(d\) bits of the counter to 1.

For answering approximate membership queries, say for given set \(Q\), the above bloom filter does the following:

 for each query q in Q
    for each hash function hi for i = 1, 2, …, k
        find hi(q)
        check value at arr[hi(q)]
        if arr[hi(q)] = 0
            return NOT PRESENT
        else
            continue
    return PRESENT

To check whether an element is present in set \(S\), the stable bloom filter checks all the counters at values given by each of the \(k\) hash functions. If all the counters are non-zero, the bloom filter returns that the element is in \(S\), otherwise, the element is said to not be present in set \(S\).

The False Positive Rate

For the \(m\)-counter array \(arr\) of a stable learned bloom filter, any counter is decremented with probability \(1/P\) and set to \(Max\) with probability \(1/k\). The probability \(p_0^{(n)}\) that a particular counter is \(0\) after \(n\) insertions satisfies:

\[ \lim_{n \to \infty} p_0^{(n)} = \left(\frac{1}{1 + \frac{1}{P(\frac{1}{k} - \frac{1}{m})}}\right) ^ {Max} \]

Thus the false positive rate for a stable learned bloom filter is given by:

\[ \lim_{n \to \infty} FPR_{SBF} = \lim_{n \to \infty} \left(1 - p_0^{(n)}\right)^k = \lim_{n \to \infty} \left(1 - \left(\frac{1}{1 + \frac{1}{P(\frac{1}{k} - \frac{1}{m})}}\right) ^ {Max}\right)^k \ \overset{(m>>k)} \approx \left(1 - \left(\frac{1}{1 + \frac{k}{P} }\right) ^ {Max}\right)^k \]

We may note the following here:

  • As \(P\) increases, i.e. as we clear more and more counters in an iteration, more counters become \(0\). Since more counters are now zero, the chances of finding all the \(k\) counters to be non-zero for a query becomes less. As a result, we have a smaller FPR.
  • In contrast, as \(Max\) increases i.e. the number of bits, \(d\), allotted to each counter increases, the FPR increases.
  • As the number of hash functions \(k\) increases, the ability to distinguish queries from one another increases leading to a decrease in FPR but at the same time, we also fill the counter array \(arr\) faster leading to fewer zero counters and thus an increase in the FPR.
  • For \(m>>k\), the upper bound on FPR does not depend on \(m\) i.e. the number of counters in the SBF or the space used by SBF the does not impact its FPR much.

The above is defined for a stable SBF. A SBF is stable when the expected fraction of \(0\)s reaches a constant as the number of elements inserted approaches infinity [8].

The False Negative Rate

An important difference in the design of the SBF as compared to a simple bloom filter is the clearing of counters i.e. setting all the \(d\) bits of a counter to \(0\) which may take place when decrementing counters during a new insertion. This leads to loss of information about keys in the SBF. As a result, we have a non-zero false negative rate i.e. an SBF may return NOT PRESENT for a key with some non-zero probability.

We see a false negative when one or more of the \(k\) counters corresponding to a key have become zero due to new insertions in the SBF. The False Negative Rate for a given query \(q_i\), \(FNR_i\), is given by:

\[ FNR_i = 1 - \prod_{j = 1}^{k} (1 - Pr(arr[h_j(x_i)] = 0 | \delta_i)) \]

where \(Pr(arr[h_j(x_i)] = 0 | \delta_i)\) is the probability that the counter at \(arr[h_j(x_i)]\) becomes \(0\) after \(\delta_i\) number of new insertions. Thus, the false negative rate of a SBF also depends on the insertion and query order.

Single Stable Learned Bloom Filters

Motivation

To use trends in input data to train a machine learning model for membership queries on an ever-growing set \(S\) as in the case of a data stream.

Design

Single Stable Learned Bloom Filter

A Single Stable Learned Bloom Filter (s-SLBF) is the learned variant of a stable bloom filter. It consists of a learned model \(f\) which classifies an element \(x\) as a key when the score \(f(x) \ge \tau\) and as a "non-key" otherwise. All "non-keys" are then sent to a stable learned bloom filter which makes the final decision on their membership.

The False Positive Rate

For a learned model \(f\) with false positive rate \(F_p\), the expected value of the False Positive Rate (FPR) of a s-SLBF is given by:

\[ E(FPR) = F_p + (1 - F_p)(1 - (\frac{1}{1 + \frac{k}{P} }) ^ {Max})^k \]

Here \((1 - (\frac{1}{1 + \frac{k}{P} }) ^ {Max})^k\) is the probability with which the SBF observes a false positive and the elements which reach the SBF with probability \((1-F_p)\) are "non-keys" as classified by \(f\).

Similar to the LBF, a s-SLBF may offer improved performance for the same space used when compared to a SBF. The convergence rate of a s-SLBF defines how fast the filter approaches its stable FPR. With the same FPR and convergence rate, a s-SLBF uses less space than a SBF. Also, keeping other things constant, a higher False Negative Rate of the s-SLBF needs more elements and more counters for the same convergence rate.

Grouping Stable Learned Bloom Filters

Motivation

A Grouping Stable Learned Bloom Filter (g-SLBF) is the PLBF variant for a SBF.

Design

Grouping Stable Learned Bloom Filter

A g-SLBF has independent bloom filters for different score based regions. It consists of a classifier \(f\), \(g\) SBF’s where the \(j^{th}\) SBF - \(SBF_j\) is described by its size \(m_j\), number of hash functions \(k_j\), number of counters to be cleared in every insertion \(P_j\) and maximum value stored in its counters \(Max_j\).

The False Positive Rate

Assuming a decreasing distribution of keys and an increasing distribution of non-keys across regions from \(i = 1\) to \(i = g\), the expected FPR of a g-SLBF is given as:

\[ E(FPR*i) = \sum_{i = 1}^{g} p*i(1 - (\frac{1}{1 + \frac{k_i}{P_i} }) ^ {Max})^k = \sum_{i = 1}^{g} p*i\alpha^{i} \le \frac{1}{g} \sum_{i = 1}^{g} \alpha^{i} \]

The FPR of a g-SLBF is bounded by the mean FPR of the SBF in each region. This upper bound is independent of the FPR of the learned model. As a result, the g-SLBF is more robust and offers an improvement in performance as compared to a s-SBLF.The false negative rate, FNR, of a g-SLBF also depends on the order of insertion and querying of elements. A g-SLBF is stable only when the SBF's in all regions are stable and the convergence rate of a g-SLBF is equal to that of its slowest SBF.Both insertion and querying take constant time for all variants of the SBF.

Note: A s-SBLF is a g-SLBF with \(g = 2\). The first region has elements with score \(f(x) \in [\tau_0 = 0, \tau_1 = \tau], SBF_1\) and the second region has \(f(x) \in [\tau_1 = \tau, \tau_2 = 1]\) and a \(SBF_2\) which always returns PRESENT.

References

  1. Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 489–504. DOI:10.1145/3183713.3196909
  2. Bloom filters - introduction and implementation. GeeksforGeeks. (2022, April 21). Link
  3. Michael Mitzenmacher. 2018. A model for learned bloom filters, and optimizing by sandwiching. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, 462–471. Link
  4. Prosenjit Bose, Hua Guo, Evangelos Kranakis, Anil Maheshwari, Pat Morin, Jason Morrison, Michiel Smid, Yihui Tang. 2008. On the false-positive rate of Bloom filters. Information Processing Letters, Volume 108, Issue 4, 2008, Pages 210-213, ISSN 0020-0190. DOI: 10.1016/j.ipl.2008.05.018
  5. Zhenwei Dai, Anshumali Shrivastava. 2019. Adaptive Learned Bloom Filter (Ada-BF): Efficient Utilization of the Classifier. Computing Research Repository (CoRR '20). arXiv.org. DOI:10.48550/arXiv.1910.09131
  6. Kapil Vaidya, Eric Knorr, Tim Kraska, Michael Mitzenmacher. 2020. Partitioned Learned Bloom Filter. Computing Research Repository (CoRR '20). arXiv.org. DOI:10.48550/arXiv.2006.03176
  7. Qiyu Liu, Libin Zheng, Yanyan Shen, and Lei Chen. 2020. Stable learned bloom filters for data streams. Proc. VLDB Endow. 13, 12 (August 2020), 2355–2367. DOI:10.14778/3407790.3407830
  8. Fan Deng and Davood Rafiei. 2006. Approximately detecting duplicates for streaming data using stable bloom filters. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data (SIGMOD '06). Association for Computing Machinery, New York, NY, USA, 25–36. DOI:10.1145/1142473.1142477

Aditi

Aditi

Hello I am aditi, a B.Tech. student at DCLL.

Comments

  Write a comment ...