Tuesday, November 19, 2013

An overview of hash function





A hash function is any algorithm that maps data of variable length to data of a fixed length.

 

Methodologies:


Division Method - 

An integer x is to divide x by M and then to use the remainder modulo M. In this case, the hash function is

 

Middle Square Method

Since integer division is usually slower than integer multiplication, avoiding division can potentially 
improve the running time of the hashing algorithm. Division can be avoided by making use of the fact
 that a computer does finite-precision integer arithmetic. 

 
unsigned int const k = 10; // M==1024
unsigned int const w = bitsizeof (unsigned int);

unsigned int h (unsigned int x)
    { return (x * x) >> (w - k); }

Since x is an unsigned int, the product x * x is also an unsigned int. If 32-bit integers are used, the product is also a 32-bit integer. The final result is obtained by shifting the product w-k bits to the right, where w is the number of bits in an integer. By definition, the right shift inserts zeroes on the left. Therefore, the result always falls between 0 and M-1

Multiplication Method

A very simple variation on the middle-square. Instead of multiplying the key x by itself, the key is multiplied by a carefully chosen constant a, and then extract the middle k bits from the result. In this case, the hashing function is




A variant also exist called Fibonacci hashing, where relation between Fibonacci number and golden ratio is used.


Properties of Good Hashing Function:


A general purpose good hash function must satisfy some properties:

-          Must be deterministic. A given input value it must always generate the same hash value.
Exclude:  hash function depends on external variable parameter or memory address of the object


-          The hash function uses all the input data. If the hash function doesn't use all the input data, then slight variations to the input data would cause an inappropriate number of similar hash values resulting in too many collisions.


-          The hash function "uniformly" distributes the data across the entire set of possible hash values. Every hash value in the output range should be generated with roughly the same probability. Again, if the hash function does not uniformly distribute the data across the entire set of possible hash values, a large number of collisions will result, cutting down on the efficiency of the hash table.
The uniformity of the distribution of hash values can be evaluated by the chi-squared test.

-          The hash function generates very different hash values for similar strings. In real world applications, many data sets contain very similar data elements. Hash function should able to distribute these over a hash table properly

 

Application:


·         Cyclic redundancy check (CRC) : data integrity error checking ; checksum

Ex. BSD checksum, CRC64, Luhn Algorithm etc.
  • Database Indexing, searching ,comparison, deleting duplicates  
  • Caching
  • Sets
  • Data randomization , used statistics, data sampling, card/roulette/chess etc. games
  • Cryptography :
-          Constructing message authentication code (MAC)
-          Password verification
-          File or data identification
-          Pseudorandom generation ,key derivation and key updating
-          Digital signature
-          Time stamping
Ex. Sha 3, MD 5, Tiger, Whirlpool etc.
·         Geometric Hashing - used to solve proximity problem in computer graphics, computational geometry and related fields. Used in image search, facial recognition etc.


Hash Keeper:   In 1996 NDIC of USA, created a database of hash values labeling ‘good ‘and ‘bad’. The data base is free to use by law enforcement agencies around the world.


MD5

The MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function that produces a 128-bit (16-byte) hash value. Specified in RFC 1321, MD5 has been utilized in a wide variety of security applications, and is also commonly used to check data integrity. MD5 was designed by Ron Rivest in 1991 to replace an earlier hash function, MD4. An MD5 hash value is typically expressed as a hexadecimal number, 32 digits long.
The MD5 algorithm is designed to be quite fast on 32-bit machines. In addition, the MD5 algorithm does not require any large substitution tables; the algorithm can be coded quite compactly.

MD5 is intended for use with digital signature applications, which require that large files must be compressed by a secure method before being encrypted with a secret key, under a public key cryptosystem. MD5 became a standard, Internet Engineering Task Force (IETF) Request for Comments (RFC) 1321. According to the standard, it is "computationally infeasible" that any two messages that have been input to the MD5 algorithm could have as the output the same message digest, or that a false message could be created through apprehension of the message digest. MD5 is the third message digest algorithm created by Rivest. All three (the others are MD2 and MD4) have similar structures, but MD2 was optimized for 8-bit machines, in comparison with the two later formulas, which are optimized for 32-bit machines. The MD5 algorithm is an extension of MD4, which the critical review found to be fast, but possibly not absolutely secure. In comparison, MD5 is not quite as fast as the MD4 algorithm, but offers much more assurance of data security.

 

MD5 is widely used in checksum purpose.  Protocols that’s uses md5 as authenticator:

  • SNMP V2
  • IPv6 * / IPng*
  • IPng ESH* (uses DES)
  • IPv4
  • SIPP* (progenitor of IPng)
  • OSPF*
  • RIP-II*
  • RIPng*
  • TCP
  • SOCKS V5
  • WWW's Secure HyperText Transfer Protocol
  • WWW's SimpleMD5
*uses md5 explicitly

 

Algorithm Overview:

 

MD5 is a block-chained digest algorithm, computed over the data in phases of 512-byte blocks organized as little-endian 32-bit words. The first block is processed with an initial seed, resulting in a digest that becomes the seed for the next block. When the last block is computed, its digest is the digest for the entire stream. This chained seeding prohibits parallel processing of the blocks.
Each 512-byte block is digested in 4 phases. Each phase consists of 16 basic steps, for a total of 64 basic steps. Each step updates one word of a 4-word accumulated digest, using the entire intermediate digest as well as block data and constants. In general, each basic step depends on the output of the prior step, defeating simple parallelization of the steps. The basic structure of the steps is shown below (<<< denotes rotate). The accumulated digest is denoted by {A,B,C,D}, as in RFC-1321 :
        A =     B+((A+ F(B,C,D) + X[i++] + k1)<<<k2)  
        D =     A+((D+ F(A,B,C) + X[i++] + k1)<<<k2)  
        C =     D+((C+ F(D,A,B) + X[i++] + k1)<<<k2) 
        B =     C+((B+ F(C,D,A) + X[i++] + k1)<<<k2)
There are 16 steps based on each of 4 logical functions; 4 based on F are shown here. The constants k1 and k2 are not necessarily identical in basic steps, and are not relevant to this analysis. The logical functions (^ denotes xor) are:
        F(x, y, z) = (((x) & (y)) | ((~x) & (z)))
        G(x, y, z) = (((x) & (z)) | ((y) & (~z)))
        H(x, y, z) = ((x) ^ (y) ^ (z))
        I(x, y, z) = ((y) ^ ((x) | (~z)))

MD5 hashes

The 128-bit (16-byte) MD5 hashes (also termed message digests) are typically represented as a sequence of 32 hexadecimal digits.
MD5("")
= d41d8cd98f00b204e9800998ecf8427e
 
 
MD5("The quick brown fox jumps over the lazy dog")
= 9e107d9d372bb6826bd81d3542a419d6
Even a small change in the message will (with overwhelming probability) result in a mostly different hash, due to the avalanche effect. For example, adding a period to the end of the sentence:
MD5("The quick brown fox jumps over the lazy dog.")
= e4d909c290d0fb1ca068ffaddf22cbd0
 

 

Performance:

 
 
 

 


Security:


Collision:
 In 1993, Den Boer and Bosselaers gave an early, although limited, result of finding a "pseudo-collision" of the MD5 compression function; that is, two different initialization vectors which produce an identical digest.
In 1996, Dobbertin announced a collision of the compression function of MD5 (Dobbertin, 1996). While this was not an attack on the full MD5 hash function, it was close enough for cryptographers to recommend switching to a replacement, such as SHA-1 or RIPEMD-160.
The size of the hash value (128 bits) is small enough to contemplate a birthday attack. MD5CRK was a distributed project started in March 2004 with the aim of demonstrating that MD5 is practically insecure by finding a collision using a birthday attack.
MD5CRK ended shortly after 17 August 2004, when collisions for the full MD5 were announced .The analytical attack was reported to take only one hour on an IBM p690 cluster.
On 1 March 2005, researchers demonstrated construction of two X.509 certificates with different public keys and the same MD5 hash value, a demonstrably practical collision. The construction included private keys for both public keys. A few days later, Vlastimil Klima described an improved algorithm, able to construct MD5 collisions in a few hours on a single notebook computer. On 18 March 2006, Klima published an algorithm that can find a collision within one minute on a single notebook computer, using a method he calls tunneling.

In 2009, the United States Cyber Command used an MD5 hash value of their mission statement as a part of their official emblem.
On 24 December 2010, Tao Xie and Dengguo Feng announced the first published single-block (512 bit) MD5 collision. Previous collision discoveries relied on multi-block attacks. For "security reasons", Xie and Feng did not disclose the new attack method. They have issued a challenge to the cryptographic community, offering a US$ 10,000 reward to the first finder of a different 64-byte collision before 1 January 2013. Marc Stevens responded to the challenge and published colliding single-block messages as well as the construction algorithm and sources.
Preimage vulnerability
In April 2009, a preimage attack against MD5 was published that breaks MD5's preimage resistance. This attack is only theoretical, with a computational complexity of 2123.4 for full preimage.
Other vulnerabilities
A number of projects have published MD5 rainbow tables online, which can be used to reverse many MD5 hashes into strings that collide with the original input, usually for the purposes of password cracking.
The use of MD5 in some websites' URLs means that search engines such as Google can also sometimes function as a limited tool for reverse lookup of MD5 hashes.
 Both these techniques are rendered ineffective by the use of a sufficiently long salt.




 



Conclusion:

In 2011, an informational RFC was approved to update the security considerations in RFC 1321 (MD5) and RFC 2104 (HMAC-MD5)
CMU Software Engineering Institute now says that MD5 "should be considered cryptographically broken and unsuitable for further use"
Performance analysis shows MD5 algorithm, whether in hardware or software, cannot keep pace with existing IPv4 implementations. As a result, it violates the requirements of IPv6. MD5 cannot be implemented in existing technology at rates in excess of 100 Mbps, and cannot be implemented in special-purpose CMOS hardware feasibly at rates in excess of 175 Mbps. MD5 cannot be used to support IP authentication in existing networks at existing rates.

Md6, an upgraded version of Md5, is currently under development, which was first introduced in SHA-3 competition in 2008. Sha 2 (256/512) is generally considered better algorithm than md5, though collision and preimage attacks are theoretically possible. Newly introduced Sha -3 (Keccak) is currently industry top and regarded almost bulletproof.






References

1.       http://en.wikipedia.org/wiki/Hash_function
2.       http://mathworld.wolfram.com/HashFunction.html
3.       http://www.sparknotes.com/cs/searching/hashtables/section2.rhtml
4.       http://www.brpreiss.com/books/opus4/html/page9.html
5.       http://www.cs.hmc.edu/~geoff/classes/hmc.cs070.200101/homework10/hashfuncs.html
6.       http://www.partow.net/programming/hashfunctions/
7.       http://www.unixwiz.net/techtips/iguide-crypto-hashes.html
8.       http://en.wikipedia.org/wiki/MD5
9.       http://marc-stevens.nl/research/md5-1block-collision/
10.   http://www.ietf.org/rfc/rfc1321.txt
11.   http://www.not-implemented.com/comparing-hash-algorithms-md5-sha1-sha2/
12.   http://searchsecurity.techtarget.com/definition/MD5