Implementing Bloom-Filter with Redis
background
One recent project was a click log(10 100 million/ day) real-time computing, The architecture is simply to use theflunted Go and collect the raw logs from the front-end machine, Then send it toKafka,Spark Consume the log and calculate the results of the save toRedis。
Kafka ofProducer harmonyConsumer The configuration on the end is asynchronous and guarantees that no messages are lost, So when a timeout occurs, It could lead to re-sending or double consumption of messages, Need to guarantee power at the point of consumption, etc.。Spark The consumption logic is mainly based on several dimensions for counting calculations, therefore, We need to de-duplicate the count before we calculate it to ensure we don't double count。
Considering the large size of the de-duplicated data, because of10 billion in magnitude, and our business scenario allowsFP(False-Positive, false positive, i.e., actually non-duplicate data, Misclassified as duplicate data), It is therefore natural to consider the use ofBloom-Filter( Blum filter) This is extremely space efficient, and with very low time complexity as well, There is a certain amount of misjudgment( controllable) of the algorithm。
Bloom-Filter
presentation
The Bloom filter is made by Barton . proposed by Blume in 1970. It is actually a very long binary vector and a series of random mapping functions.
The idea of Bloom filter is simple and elegant. We assume a vector filter with k hash functions and m-bit bits: the input is processed as follows.
Calculation of hash values using k hash functions.
(a) Take the remainder of each hash value over m to obtain k positions in the filter.
will thisk six locationsbit place as1
The operation to determine whether an input is in the filter is as follows.
Calculation of hash values using k hash functions.
(a) Take the remainder of each hash value over m to obtain k positions in the filter.
See if all positions are 1, if so return true, otherwise return false
The following figure illustrates.
False alarm rate calculation
Without developing the mathematical analysis of False positives in detail here, only the conclusion is given.
P approx ( 1 - e ^ {- frac n}) ^ k
P≈(1−e
−
n
mk
)
k
When m/n is fixed, choose
k = frac nm ln2
k=
m
n
ln2
An integer in the vicinity will minimize P (False positive possibility). [1]
Application Scenarios
Given a set S (note that the set here is a set in the traditional sense: the elements are different from each other. (Multiset is not considered in this paper), given an element e, it is necessary to determine whether e ∈ Se ∈ S holds. (generally referred to in academic circles as the issue of MEMBERSHIP)
Crawler: whether the URL has been crawled (massive url, False Positive allowed - what's the harm in one less crawl)
spam message: There are at least a few billion spam addresses worldwide, We've all had the experience of being misjudged as spam[2]
In practice we need to select the parameters m and k for the data level of the business and the requirement of the amount of false positives.
Below, we look at the performance of the misclassification rate for different conditions of m/n,k.
False-Positive-Ratio table (with memory space occupation)
set upn because of10 100 million, set upm separately30、50,k separately8、16, Results in the table below:
1 billion is the amount of data for 1 day, and assuming that the data is evenly distributed over 24 hours, that's about 7 million for 10 minutes, let m be 30 and 50 respectively, and k be 8 and 16 respectively, resulting in the following table.
From the above table, desirabilitym/n because of50, k because of16, Ability to meet operational requirements( miscarriage of justice:1e-9)。
The above, theoretical preparation is sufficient, followed by a generic Redis-based implementation solution. First we need to understand the SETBIT method of Redis first.
SETBIT method for the Redis data structure String
SETBIT key offset value
right key The stored string value, Sets or clears the bit at the specified offset(bit)。
The bit is set or cleared depending on the value parameter, which can be either 0 or 1.
proper key non-existent, Automatically generates a new string value。
The string will be stretched (grown) to ensure that it can save the value at the specified offset. When the string value is stretched, the blank positions are filled with zeros.
offset The parameter must be greater than or equal to 0 , less than, < 2^32 ( bit mapping is limited to 512 MB)。
Available versions:>= 2.2.0
time complexity: O(1)
Return value: the bit where the specified offset was originally stored.
The above is taken fromRedis brochure, it is (thus) clear that,SETBIT The method can be used forstring typevalue act asbit Level of operation, but (not)Bloom filter It is also forbit operate, So we can useSETBIT come trueBloom filter。
Here we'll implement a generic Bloom filter based on PHP, step by step.
Demo based on phpredis
BKDRHash
BKDRHash is a hash function that is both good to remember and outstanding[3],C The language description is as follows:
Bloom filter The algorithm requires multipleHash function, We can giveBKRDHash Set the differentseed to complete multipleHash count, as followsPHP Code shown。
BRDKHash implementation for php
The getBKDRHashSeed function is used to get the different seeds, n is taken in order from 0 to k-1, resulting in k seeds, which are passed into BKDRHash to calculate the k hashCode.
Implementation code
The above code is the class implementation of Bloom filter.
For testing purposes, we generated 1000w random strings (length [4,12], all lowercase letters) by the above code and wrote them to the sample.txt file.
Look how many duplicates there are of.
The test script is as follows.
Test parameters.
m=2^32=4294967296(m/n = 4294967296/10000000 ≈ 429.50)
k=8
Test results
total time spent:1h4m45s
Bloom-Filter Add QPS: 2574/s
Redis QPS: 20592/s (one add operation requires k (8) requests to redis)
Correctness.
Number of miscarriages of justice: 0
optimisation
In the above code, each time you add a piece of data to Bloom filter, you need to request redis k times, and the performance is lost in network IO, so let's optimize this part first.
redis ofpipelining presentation
Redis Pipelining saves RTT (Round Trip Time) by sending multiple commands at once and executing them in order, returning the results.
Each SETBIT is independent and there is no connection between them, there is no need to ensure their atomicity, so there is no need to use the multi approach, from the relevant information to check, the efficiency of using pipelining is improved by about 10 times, while multi will reduce the efficiency instead.
optimisation posterior class
optimisation post Test results
Total elapsed time: 13m21s
Bloom-Filter Add QPS: 12000/s
Redis QPS:12000/s
Correctness.
Number of miscarriages of justice: 0
The speed has increased.5 times (multiplier)!
again optimisation
Just now the official Redis documentation inside the introduction to SETBIT has this sentence.
bit mapping is limited to 512 MB
Scroll back to the above article
False-Positive-Ratio table (with memory space occupation)
You can see that if m is 50 billion, the Bloom filter will take up about 5.82 GB of memory space, significantly checking past Redis' bit mapping range limit.
Therefore we need to do a distributed adaptation of this Bloom filter implementation to build multiple bit tables depending on the size of m. Different inputs will be sharding to the corresponding bit tables.
Distributed Bloom-Filter
Considering that the memory of a single redis instance is capped, we can design two levels of sharding.
The first level shards the different inputs to the corresponding redis instances
The second level sharding the input to the corresponding key (different keys represent different Bloom filters)
Optimized demo (full code)
In summary, we have implemented a generic Bloom filter based on Redis.
Project repository
Github: https://github.com/0x5446/BloomFilter-redis
Author: Tian Feng (Mr. Tian was formerly the technical manager of 360 Encyclopedia, now in business)