Welcome to Python BloomFilter’s documentation!

If you are here, you probably don’t need to be reminded about the nature of a Bloom filter. If you need to learn more, just visit the wikipedia page to learn more. This module implements a Bloom filter in python that’s fast and uses mmap files for better scalability. Did I mention that it’s fast?

Here’s a quick example:

from pybloomfilter import BloomFilter

bf = BloomFilter(10000000, 0.01, 'filter.bloom')

with open("/usr/share/dict/words") as f:
    for word in f:
        bf.add(word.rstrip())

print 'apple' in bf
#outputs True

That wasn’t so hard, was it? Now, there are a lot of other things we can do. For instance, let’s say we want to create a similar filter with just a few pieces of fruit:

fruitbf = bf.copy_template("fruit.bloom")
fruitbf.update(("apple", "banana", "orange", "pear"))
print fruitbf.to_base64()
"eJzt2k13ojAUBuA9f8WFyofF5TWChlTHaPzqrlqFCtj6gQi/frqZM2N7aq3Gis59d2ye85KTRbhk"
"0lyu1NRmsQrgRda0I+wZCfXIaxuWv+jqDxA8vdaf21HIOSn1u6LRE0VL9Z/qghfbBmxZoHsqM3k8"
"N5XyPAxH2p22TJJoqwU9Q0y0dNDYrOHBIa3BwuznapG+KZZq69JUG0zu1tqI5weJKdpGq7PNJ6tB"
"GKmzcGWWy8o0FeNNYNZAQpSdJwajt7eRhJ2YM2NOkTnSsBOCGGKIIYbY2TA663GgWWyWfUwn3oIc"
"fyLYxeQwiF07RqBg9NgHrG5ba3jba5yl4zS2LtEMMcQQQwwxmRiBhPGOJOywIPafYhUwqnTvZOfY"
"Zu40HH/YxDexZojJwsx6ObDcT7D8vVOtJBxiAhD/AjMmjeF2Wnqd+5RrHdo4azPEzoANabiUhh0b"
"xBBDDDHEENsf8twlrizswEjDhnTbzWazbGKpQ5k07E9Ox2iFvXBZ2D9B7DawyqLFu5lshhhiiGUK"
"a4nUloa9yxkwR7XhgPPXYdhRIa77uDtnyvqaIXalGK02ufv3J36GmsnG4lquPnN9gJo1VNxqgYbt"
"ji/EC8s1PWG5fuVizW4Jox6/3o9XxBBDDLFbwcg9v/AwjrPHtTRsX34O01mxLw37bhCTjJk0+PLK"
"08HYd4MYYojdKmYnBfjsktEpySY2tGGZzWaIIfYDGB271Yaieaat/AaOkNKb"

Reference

All of the reference information is available below:

Why pybloomfilter

As I already mentioned, there are a couple reasons to use this module:

  • It natively uses mmaped files.
  • It natively does the set things you want a Bloom filter to do.
  • It is Fast (see Benchmarks).

Benchmarks

Simple load and add speed

I have a simple benchmark in test/speedtest.py which compares this module to the good pybloom module:

(pybloom module)
pybloom load took 0.76436 s/run
pybloom tests took 0.16205 s/run
Errors: 0.25% positive 0.00% negative

(this module)
pybloomfilter load took 0.05423 s/run
pybloomfilter tests took 0.00659 s/run
Errors: 0.26% positive 0.00% negative

In this test we just looked at adding words from a dictionary file, then testing to see if each word of another file was in the dictionary.

Serialization

Since this package natively uses mmap files, no serialization is needed. Therefore, if you have to do a lot of moving between disks etc, this module is an obvious win.

Install

You do not need Cython to install from sources, since I keep a cached version of the c output in the source distribution. Thus, to install you should only need to run:

$ sudo pip install pybloomfiltermmap

You can also download the latest tar file from the github tags. Once you download it, you should only have to run:

$ sudo python setup.py install

to build and install the module.

Develop

To develop you will need Cython. The setup.py script should automatically build from Cython source if the Cython module is available.