Implementing Bloom-Filter with Redis

background

One recent project was a click log（10 100 million/ day） real-time computing， The architecture is simply to use theflunted Go and collect the raw logs from the front-end machine， Then send it toKafka，Spark Consume the log and calculate the results of the save toRedis。

Kafka ofProducer harmonyConsumer The configuration on the end is asynchronous and guarantees that no messages are lost， So when a timeout occurs， It could lead to re-sending or double consumption of messages， Need to guarantee power at the point of consumption, etc.。Spark The consumption logic is mainly based on several dimensions for counting calculations， therefore， We need to de-duplicate the count before we calculate it to ensure we don't double count。

Considering the large size of the de-duplicated data， because of10 billion in magnitude， and our business scenario allowsFP（False-Positive， false positive， i.e., actually non-duplicate data， Misclassified as duplicate data）， It is therefore natural to consider the use ofBloom-Filter（ Blum filter） This is extremely space efficient， and with very low time complexity as well， There is a certain amount of misjudgment（ controllable） of the algorithm。

Bloom-Filter

presentation

The Bloom filter is made by Barton . proposed by Blume in 1970. It is actually a very long binary vector and a series of random mapping functions.

The idea of Bloom filter is simple and elegant. We assume a vector filter with k hash functions and m-bit bits: the input is processed as follows.

Calculation of hash values using k hash functions.

(a) Take the remainder of each hash value over m to obtain k positions in the filter.

will thisk six locationsbit place as1

The operation to determine whether an input is in the filter is as follows.

Calculation of hash values using k hash functions.

(a) Take the remainder of each hash value over m to obtain k positions in the filter.

See if all positions are 1, if so return true, otherwise return false

The following figure illustrates.

False alarm rate calculation

Without developing the mathematical analysis of False positives in detail here, only the conclusion is given.

P approx ( 1 - e ^ {- frac n}) ^ k

P≈(1−e

n

mk

)

k

When m/n is fixed, choose

k = frac nm ln2

k=

m

n

ln2

An integer in the vicinity will minimize P (False positive possibility). [1]

Application Scenarios

Given a set S (note that the set here is a set in the traditional sense: the elements are different from each other. (Multiset is not considered in this paper), given an element e, it is necessary to determine whether e ∈ Se ∈ S holds. (generally referred to in academic circles as the issue of MEMBERSHIP)

Crawler: whether the URL has been crawled (massive url, False Positive allowed - what's the harm in one less crawl)

spam message： There are at least a few billion spam addresses worldwide， We've all had the experience of being misjudged as spam[2]

In practice we need to select the parameters m and k for the data level of the business and the requirement of the amount of false positives.

Below, we look at the performance of the misclassification rate for different conditions of m/n,k.

False-Positive-Ratio table (with memory space occupation)

set upn because of10 100 million， set upm separately30、50，k separately8、16， Results in the table below：

1 billion is the amount of data for 1 day, and assuming that the data is evenly distributed over 24 hours, that's about 7 million for 10 minutes, let m be 30 and 50 respectively, and k be 8 and 16 respectively, resulting in the following table.

From the above table， desirabilitym/n because of50， k because of16， Ability to meet operational requirements（ miscarriage of justice：1e-9）。

The above, theoretical preparation is sufficient, followed by a generic Redis-based implementation solution. First we need to understand the SETBIT method of Redis first.

SETBIT method for the Redis data structure String

SETBIT key offset value

right key The stored string value， Sets or clears the bit at the specified offset(bit)。

The bit is set or cleared depending on the value parameter, which can be either 0 or 1.

proper key non-existent， Automatically generates a new string value。

The string will be stretched (grown) to ensure that it can save the value at the specified offset. When the string value is stretched, the blank positions are filled with zeros.

offset The parameter must be greater than or equal to 0 ， less than, < 2^32 ( bit mapping is limited to 512 MB)。

Available versions：>= 2.2.0

time complexity: O(1)

Return value: the bit where the specified offset was originally stored.

The above is taken fromRedis brochure， it is (thus) clear that，SETBIT The method can be used forstring typevalue act asbit Level of operation， but (not)Bloom filter It is also forbit operate， So we can useSETBIT come trueBloom filter。

Here we'll implement a generic Bloom filter based on PHP, step by step.

Demo based on phpredis

BKDRHash

BKDRHash is a hash function that is both good to remember and outstanding[3]，C The language description is as follows：

Bloom filter The algorithm requires multipleHash function， We can giveBKRDHash Set the differentseed to complete multipleHash count， as followsPHP Code shown。

BRDKHash implementation for php

The getBKDRHashSeed function is used to get the different seeds, n is taken in order from 0 to k-1, resulting in k seeds, which are passed into BKDRHash to calculate the k hashCode.

Implementation code

The above code is the class implementation of Bloom filter.

For testing purposes, we generated 1000w random strings (length [4,12], all lowercase letters) by the above code and wrote them to the sample.txt file.

Look how many duplicates there are of.

The test script is as follows.

Test parameters.

m=2^32=4294967296(m/n = 4294967296/10000000 ≈ 429.50)

k=8

Test results

total time spent：1h4m45s

Redis QPS: 20592/s (one add operation requires k (8) requests to redis)

Correctness.

Number of miscarriages of justice: 0

optimisation

In the above code, each time you add a piece of data to Bloom filter, you need to request redis k times, and the performance is lost in network IO, so let's optimize this part first.

redis ofpipelining presentation

Redis Pipelining saves RTT (Round Trip Time) by sending multiple commands at once and executing them in order, returning the results.

Each SETBIT is independent and there is no connection between them, there is no need to ensure their atomicity, so there is no need to use the multi approach, from the relevant information to check, the efficiency of using pipelining is improved by about 10 times, while multi will reduce the efficiency instead.

optimisation posterior class

optimisation post Test results

Total elapsed time: 13m21s

Redis QPS：12000/s

Correctness.

Number of miscarriages of justice: 0

The speed has increased.5 times (multiplier)！

again optimisation

Just now the official Redis documentation inside the introduction to SETBIT has this sentence.

bit mapping is limited to 512 MB

Scroll back to the above article

False-Positive-Ratio table (with memory space occupation)

You can see that if m is 50 billion, the Bloom filter will take up about 5.82 GB of memory space, significantly checking past Redis' bit mapping range limit.

Therefore we need to do a distributed adaptation of this Bloom filter implementation to build multiple bit tables depending on the size of m. Different inputs will be sharding to the corresponding bit tables.

Distributed Bloom-Filter

Considering that the memory of a single redis instance is capped, we can design two levels of sharding.

The first level shards the different inputs to the corresponding redis instances

The second level sharding the input to the corresponding key (different keys represent different Bloom filters)

Optimized demo (full code)

In summary, we have implemented a generic Bloom filter based on Redis.

Project repository

Github: https://github.com/0x5446/BloomFilter-redis

Author: Tian Feng (Mr. Tian was formerly the technical manager of 360 Encyclopedia, now in business)

Recommended>>
1、Algorithm find the distance between two nodes in the graph
2、Blockchains version of Wikipedia already has more entries than this one
3、Tips for writing a quick data management plan
4、Cybersecurity industry to reach 45713 billion this year
5、I can finally stop needing QQ screenshots

已推荐到看一看 和朋友分享想法
最多200字，当前共 发送

已发送

确定
分享你的想法...
取消

确定
最多200字，当前共

发送中