I have a lot of material related to hashing.
A hash function for hash table lookup should be fast, and it shouldcause as few collisions as possible. If you know the keys you will behashing before you choose the hash function, it is possible to getzero collisions -- this is called perfecthashing. Otherwise, the best you can do is to map an equal numberof keys to each possible hash value and make sure that similar keysare not unusually likely to map to the same value.
The standard reference for this is Knuth's "The Art of ComputerProgramming", volume 3 "Sorting and Searching", chapter 6.4. Herecommends the hash
for (hash=len; len--;) { hash = ((hash<<5)^(hash>>27))^*key++; } hash = hash % prime;Unfortunately, that hash is only mediocre. The problem is theper-character mixing: it only rotates bits, it doesn't really mixthem. Every input bit affects only 1 bit of hash until the final %.If two input bits land on the same hash bit, they cancel each otherout. Also, % can be extremely slow (230 times slower than addition ona Sparc).
I had an article in Dr. Dobb's Journal on hash functions forhash table lookup. A generally good, fast hash function is LOOKUP3.C. Simon Stapleton provided thumb2 assembly for lookup3.c. More recentlyI published SpookyHash, which is many times faster than lookup3 on 64-bit platforms, even for short keys.
If you are using a language with no shift or xor, like BASIC, tryPearson's hash.
I have a short spiel on 1-bit errorcorrection codes prompted by a discussion on the A-Phi-O alumnimailing list.
Error Correction Codes are checksums of length m+2d+1 thatassume that no more than d of the m+2d+1 values will change. If d orless values change, the checksum and the modified text can be used todeduce what the original m values were. Values in the checksum can beamong the d values changed. If more than d values changed, then youlose. It is easy to produce two documents with the same checksum (ifyou try).
Error correction codes, I hear, have recently become much better.Look at McEliece's pageon Turbo Codes and David MacKay's competing research on Gallager codes. These easily beat the oldconvolution codes, Goppa codes, Viterbi codes and Reed Solomon codes,I am told.
I just noticed that you could use Gallager codes as public keyencryption, by sending something random as the message then puttingyour real message in the noise. This is similar to McEliece's publickey encryption scheme. It's also an example of a public key schemethat could be broken by finding the smallest basis for a lattice.Algorithms for reducing bases of lattices are here.
Here's the problem. You've got two documents that are supposed to bethe same, but you're not sure. One of them is somewhere inaccessible,either a remote version or a previous version of the local document,so you can't compare them directly. So you want to find a checksumfor both and compare them, and if the checksums are the same then youwill be convinced the documents are the same. It should beincomprehensibly rare for two documents to have the same checksum bychance (that works out to probability of 2-128 or less). Weassume that nobody is maliciously trying to produce documents with thesame checksum. That same assumption holds forerror correction codes, which are often moreuseful than simple checksums.
Cryptographic hashes produce hash values of 256 bits or greater, and are more than good enough quality for checksums. The current recommendationis SHA-3. Their main problem is they're slow, 12.5 cycles per byte.
Noncryptographic checksums are SpookyHash, and Google's CityHash.They produce 128-bit results. They're good if you have under 2n keys and can tolerate a probability of 22n-128 of getting at least one collision. SpookyHash is .3 cycles per byte on 64-bit platforms. CityHash is .16 cycles per byte, but only on 64-bit platforms with the SSE CRC32 instruction. Only use a noncryptographic checksum if you don't have an opponent. If you need the probability of getting any collisions at all to be less than 2-32, um, I'd say the chance of you having an unknown opponent are greater than that.
CRCs have a place too. A CRC has some quality issues that are tolerablein many circumstances, but it has this amazing property that you can calculate CRC(A+B) (the CRC of A with B concatenated to the end of it) straight from CRC(A), CRC(B), length(A), and length(B), without actually looking at A, B. CRCs come in 32-bit, 64-bit, 128-bit variants.
These are like checksums, but they aredesigned to even thwart malicious people. A one-way function isconsidered broken if you can find two keys that have the same hashvalue. (If hash values have n bits, it should take 2ntries to find a key mapping to a given hash value, andsqrt(2n) tries to find two keys that map to some commonhash value.)
The current recommendation is SHA-3 (Keccak).
Go to the newsgroup sci.crypt todiscuss one-way functions.
A block cipher is a reversible function g:KxB->C, whichmaps a key in K and a block in B into a block in C. Usually B and Care the same set, so the block cipher permutes B in a key-specificway. There should be no way to deduce the key given any number ofpairs (b,g(b)) in (B,C), and no way to deduce g(b) from b,or b from g(b), without the key. No efficient way, that is.
Go to the newsgroup sci.crypt to discuss block ciphers. You canfind newsgroups through google groups.
See Wei Dai's page for source code or benchmarks for mostpopular block ciphers. Also see Ron Rivest's security page for more cryptographic pointers.
I tried designing one ofthese. It has a mixing function that is called several times,then wraps a key around that. Calling the mixing function 100 timesis secure. Calling the mixing function 1 time isn't secure.Somewhere in between is the smallest number of iterations that issecure. My current guess is 12. Designing block ciphers is likethat. Sufficient security is easy, it's just a question ofperformance, and of proving security (as in, unbreakable under allknown attacks) at that level of performance.
I also wrote code to findcharacteristics in block ciphers, choosemagic constants, and test forbias in supposedly random sequences. Go here to see how to add a key to apseudorandom permutation, making it a block cipher.
Pseudorandom number generators mix up a state like hashfunctions, but they don't take any input. They have a state of theirown, and they just keep churning it. They produce random-lookingsequences, which can be used as fake data.
The standard reference for this is Knuth's "The Art of ComputerProgramming", volume 2 "Seminumerical Algorithms", chapter 3. Herecommends the random number generator
for (i=0; i<55; ++i) rsl[i] = rsl[i] + rsl[(i+24) % 55];although not quite as succinctly as that. It's really quite good, andit's hard to beat it for speed. Properly optimized, it takes about 8instructions to produce each 32 bit word. The whole 32 bit word canbe used, although the low-order bit is less random than the other 31bits. Any set of 55 consecutive results has no statisticallysignificant patterns, because every such set of results is possible.
If you have an application that is sensitive to obscure biases (likeevery result being the sum of the results 31 and 55 before it) then abetter generator to use is ISAAC,which takes 19 instructions to produce each 32 bit word, and has noknown biases. I also have a smallgenerator, which initializes and runs faster than ISAAC and theMersenne Twister and has only a four-word internal state, but stillpasses all known tests.
Here are some tests fordetecting bias in sequences that are supposed to be random. Thestandard test suite is Marsaglia's DIEHARD (I need to add a link), butthat's pretty easy to pass because it doesn't test a long enoughsequence.
A cryptographic pseudorandom number generator is a randomnumber generator where it is intractible to deduce the generator'sinternal state given any number of its results. To use such a randomnumber generator as a stream cipher, give Alice and Bob theinternal state of the generator. Alice generates the pseudorandomsequence and adds it to a message which she sends. Bob generates thepseudorandom sequence and subtracts it from the message he receives.Eve, who is watching the stuff being sent, can't read the message.
Go to the newsgroup sci.crypt to discuss stream ciphers.
RC4 is a popular stream cipher. It is used in Lotus Notes andNetscape and (its name at least) is the property of RSA. ISAAC is one I wrote. I have code for ISAAC and a paper comparing ISAAC and RC4.There's also an externalcollection of analysis of RC4 on the web.
Why there are no perpetualmotion machines
Oracle SQL tricks
Noncolliding orbits that fill 3-space
A rather odd color chooser
Table of Contents
联系客服