[http://en.wikipedia.org/wiki/Set_(abstract_data_type) Set Data Types] store objects without any particular ordering.
[http://en.wikipedia.org/wiki/Bloom_filter Bloom Filters] are very space-efficient set structures, but allow for false-positive membership tests. They use $O\left(\frac{n}{word}\right)$ space. They also cannot handle intersection or element removal.
They are useful as quick-but-inaccurate whitelists before more rigorous filters. They are also useful to filter logging a large number of events containing some kind of identification. In this situation, the bloom filter keeps the overall size of the log files down.
Structure:
Increasing N reduces the false-positive rate, but also slows down data operations.
Increasing k requires more hash collisions for a false-positive query, but also fills up more of the total array, increasing the probability of all collisions.
Since this probability of (a particular cell is set to 1 after storing n elements) is known, the number of cells filled in the final array is a [http://en.wikipedia.org/wiki/Multinomial_distribution Multinomial distribution]. So the expected number of cells filled in the final array is $n\left( 1 - \left( 1 - \frac{1}{N}\right)^{nk}\right)$.
Allows removals. Each cell stores a count of the number of set elements mapped to it instead of just a 1 or a 0. Membership tests are the same, just check that all counts are greater than zero. False-positives are still possible. Adding an element increments all k counts. Removal decrements all k counts.
With high probability, all counts are $O\left(\frac{\log(nk)}{\log\log(nk)}\right)$. So need $O(\log\log n)$ bits per cell in order to not overflow the counts.
NOT space efficient relative to a standard Bloom Filter. Based on the counting idea, but is able to “read back” members of small sets with high probability. Has useful applications in streaming data structures e.g. the Straggler Detection Problem.
As in the counting Bloom Filter, maps keys to multiple cells in a table. Each cell counts the number of keys that hit it, the sum of the keys that hit it, and the sum of the hashes for each key (using a hash function independent of the one used to map keys to cells).
A cell is 'pure' if: $\frac{hash (\sum keys)}{# keys} = \frac{\sum hash(keys)}{# keys}$. A pure cell has high probability of only being hit by a single key. With high probability, if some number of distinct keys stored in the IBF is less than ( a constant * table size ), it can be correctly decoded.
Similar to the sketch used in a streaming data structure, can use IBFs as a concise representation of certain large objects.
Represent large sets by their Invertible Bloom Filters. If you have sets A and B, decode(IBF(A) - IBF(B)) where the minus operation is a cell-wise subtraction. If the difference between the sets is small, then it represents an IBF which can be decoded. This has applications in version control systems for identifying changes between locations using an amount of data proportional to the change.
Sketch for sets that allows estimation of set similarity, regardless of the size of the difference. Originally developed as a way for AltaVista to identify duplicate web pages.
Uses the [http://en.wikipedia.org/wiki/Jaccard_index Jaccard Index]: $J(A,B) = \frac{|A \cap B|}{|A \cup B|}$. J = 0 if A and B are disjoint, and equals 1 if they are the same.
single-element sketch:
[http://en.wikipedia.org/wiki/MinHash Min Hash] does this, but with a slightly larger sketch for finer-grained Jaccard estimation.
dist(A,B) = $\frac{|sketch(A \cup B) \cap sketch(A) \cap sketch(B)|}{|sketch(A \cup B)|}$