Prior Page Next Page This Chapter Prior Chapter

Hashing

Hashing is the name given to a function that chops up, scrambles, and compresses or expands its argument until it is unrecognizable. The original idea was to create a memory-less function that would output a uniform randomly distributed number for each input. So that each input key, would end up with a characteristic number.

In some sense encryption performs the same function as hashing; with hashing one gets a fixed number of `random bits', where encryption produces `random bits' the same length as the data. Both hashing and encryption always produce the same `random bits' given the same input data. If you hash `an apple' one day you will get the same random integer as if you hash it the next day. With encryption if you encrypt `my salary is 30,000' one day you will get the same encrypted string as if you do it the next day (assuming you use the same key, of course).

The difference is only that hashing produces a single integer, where encryption produces some thing as long or longer than the input. That difference is important because it means that several keys can end up with the same hash value, but you would be in an impossible situation if several input messages ended up with the same encrypted form, because no one would be able to decrypt them.

For hashing the set of keys that constitute the domain should be close to uniform randomly distributed in the hash range. Traditionally this goal has been viewed as impossible to achieve in practice. Although everyone has always recognizes that a good specialized hash function can be created for each particular case. As we will see this tailoring of the hash function is unnecessary if you use a good one to start with.

For any hash function, including the good ones presented here, a worst case can be invented where all the keys fall into some particular bin. But if the hash function is independent of the keys, and the function introduces enough independent information to saturate the output range, then the function will have an exceedingly small chance of giving a bad distribution for any reasonable size set of keys drawn from any source. That is the hash function will work in general.

Hashing offers the problem of writing a function that changes a key (any information) into a random number. The traditional mathematical view is that there are no single random numbers, there are only sequences of numbers that are random. But hashing is just a function; it provides no sequence. So this statement of the problem does not fit into a traditional mathematical view of randomness. It would be useful if mathematics could shed some light on this process, because hashing is the most powerful (and valuable) technique for data retrieval known.

Most of the algorithms you will learn over the next three years are mostly academic games, things that will broaden your view, teach you what is possible, and teach you to analyze and appreciate clever algorithms. But they are only of academic interest. Hashing on the other hand has little theory to recommend it, but it is the fastest, the most practical, and most general technique you will learn in computer studies for retrieving data. Learning to use it and to bend it to your requirements will reward you by making most of the programs you will write for the rest of your life easier to build.

These notes will cover three reasonable hashing algorithms, the first one, rcshash, fails to introduce sufficient noise to saturate its output so it will fail occasionally, but it is fast and easy to code and it will work for keys consisting of short ASCII character strings, a common key type for retrieval. The other two functions work in general on any data but they both require digital function tables to be created for them. This can either be done before execution and compiled into a program, if the language you use allows specification of static integer arrays. For Pascal, which does not allow initialized static arrays, the digital tables must be created during execution before hashing the first key.

Don't use the hash functions you find in most standard textbooks; they tend to mislead you. Often these published functions will work well in one program and fail drastically in another. This is because these published hash algorithms are sensitive to their parameters, especially the final division that creates proper range for the output, the b in the example below. Often the textbook shows as an example of an uneven distribution created by a few keys; it claims that the distribution will `even out' as more keys are added. Don't believe these statements, unless you have tested the function for your application, and you are sure of what you are doing. Just stick with the hash functions given here. These have been tested and retested; they are known to work in general and they will produce good results for any independent keys drawn from any domain.

The three algorithms given here go from fast to moderate in speed. Two of the algorithms are easy to break (in an encryption sense) that is it is reasonably easy to go backwards through the function and determine a key that generates the same hash. The third one is also a good one-way function that could be used for public-key authentication.

Because of its lack of a mathematical foundation hashing has a few firmly entrenched myths common in the computer science community. One is that there is no good widely applicable hashing algorithm, that you have to hand craft one for each job. If you stick to the guidelines and algorithms given below your programs will shine and provide predictable performance, often orders of magnitude faster than others who follow standard textbook advice.