# Why Simple Hash Functions Work:

Exploiting the Entropy in a Data Stream

Michael Mitzenmacher and Salil
Vadhan

### Abstract

Hashing is fundamental to many algorithms and data
structures widely used in practice. For
theoretical analysis of hashing, there have been two main approaches. First,
one can assume that the hash function is truly random, mapping each data item
independently and uniformly to the range.
This idealized model is unrealistic because a truly random hash function
requires an exponential number of bits to describe. Alternatively, one can provide rigorous
bounds on performance when explicit families of hash functions are used, such
as 2-universal or O(1)-wise independent families. For such families, performance guarantees are
often noticeably weaker than for ideal hashing.

In practice, however, it is commonly observed that simple
hash functions, including 2-universal hash functions, perform as predicted by
the idealized analysis for truly random hash functions. In this paper, we try
to explain this phenomenon. We
demonstrate that the strong performance of universal hash functions in practice
can arise naturally from a combination of the randomness of the hash function
and the data. Specifically, following
the large body of literature on random sources and randomness extraction, we
model the data as coming from a “block source,” whereby each new
data item has some “entropy” given the previous ones. As long as the (Renyi) entropy per data item
is sufficiently large, it turns out that the performance when choosing a hash
function from a 2-universal family is essentially the same as for a truly
random hash function. We describe results for several sample applications,
including linear probing, balanced allocations, and Bloom filters.

### Versions

- Manuscript, July 2007. [ps][pdf]
- Revised full version, October
2007. [ps][pdf]
- Extended abstract in
*Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete
Algorithms (SODA),* 20-22 January 2008. [pdf][official
ACM page]
- Revised full version, June 2010. [pdf]
- Slides from half-hour talk [pps]

[ back to
Salil Vadhan's research]