If you are here, you probably don’t need to be reminded about the nature of a Bloom filter. If you need to learn more, just visit the wikipedia page to learn more. This module implements a Bloom filter in python that’s fast and uses mmap files for better scalability. Did I mention that it’s fast?
Here’s a quick example:
from pybloomfilter import BloomFilter
bf = BloomFilter(10000000, 0.01, 'filter.bloom')
with open("/usr/share/dict/words") as f:
for word in f:
bf.add(word.rstrip())
print 'apple' in bf
#outputs True
That wasn’t so hard, was it? Now, there are a lot of other things we can do. For instance, let’s say we want to create a similar filter with just a few pieces of fruit:
fruitbf = bf.copy_template("fruit.bloom")
fruitbf.update(("apple", "banana", "orange", "pear"))
print fruitbf.to_base64()
"eJzt2k13ojAUBuA9f8WFyofF5TWChlTHaPzqrlqFCtj6gQi/frqZM2N7aq3Gis59d2ye85KTRbhk"
"0lyu1NRmsQrgRda0I+wZCfXIaxuWv+jqDxA8vdaf21HIOSn1u6LRE0VL9Z/qghfbBmxZoHsqM3k8"
"N5XyPAxH2p22TJJoqwU9Q0y0dNDYrOHBIa3BwuznapG+KZZq69JUG0zu1tqI5weJKdpGq7PNJ6tB"
"GKmzcGWWy8o0FeNNYNZAQpSdJwajt7eRhJ2YM2NOkTnSsBOCGGKIIYbY2TA663GgWWyWfUwn3oIc"
"fyLYxeQwiF07RqBg9NgHrG5ba3jba5yl4zS2LtEMMcQQQwwxmRiBhPGOJOywIPafYhUwqnTvZOfY"
"Zu40HH/YxDexZojJwsx6ObDcT7D8vVOtJBxiAhD/AjMmjeF2Wnqd+5RrHdo4azPEzoANabiUhh0b"
"xBBDDDHEENsf8twlrizswEjDhnTbzWazbGKpQ5k07E9Ox2iFvXBZ2D9B7DawyqLFu5lshhhiiGUK"
"a4nUloa9yxkwR7XhgPPXYdhRIa77uDtnyvqaIXalGK02ufv3J36GmsnG4lquPnN9gJo1VNxqgYbt"
"ji/EC8s1PWG5fuVizW4Jox6/3o9XxBBDDLFbwcg9v/AwjrPHtTRsX34O01mxLw37bhCTjJk0+PLK"
"08HYd4MYYojdKmYnBfjsktEpySY2tGGZzWaIIfYDGB271Yaieaat/AaOkNKb"
All of the reference information is available below:
As I already mentioned, there are a couple reasons to use this module:
- It natively uses mmaped files.
- It natively does the set things you want a Bloom filter to do.
- It is Fast (see Benchmarks).
I have a simple benchmark in test/speedtest.py which compares this module to the good pybloom module:
(pybloom module)
pybloom load took 0.76436 s/run
pybloom tests took 0.16205 s/run
Errors: 0.25% positive 0.00% negative
(this module)
pybloomfilter load took 0.05423 s/run
pybloomfilter tests took 0.00659 s/run
Errors: 0.26% positive 0.00% negative
In this test we just looked at adding words from a dictionary file, then testing to see if each word of another file was in the dictionary.
Since this package natively uses mmap files, no serialization is needed. Therefore, if you have to do a lot of moving between disks etc, this module is an obvious win.
You do not need Cython to install from sources, since I keep a cached version of the c output in the source distribution. Thus, to install you should only need to run:
$ sudo pip install pybloomfiltermmap
You can also download the latest tar file from the github tags. Once you download it, you should only have to run:
$ sudo python setup.py install
to build and install the module.
To develop you will need Cython. The setup.py script should automatically build from Cython source if the Cython module is available.