Creating a very fast "hash"

Arokh

Centurion
Joined
Apr 11, 2006
Messages
124
Hi

I need to identify video files and that as fast as possible.

Since I didn't find any hash which can generate an id within a second,
I only hashed a very small part of the file.

Basically I did this:
Code:
Divide the data into 64 equal large sections
At the beginning of each section, read the first 8bytes and concat them together.
Take a MD5 hash from those 512bytes

Is there something I can do to optimize the "hash"?
Or maybe there is already such a hash which is generated within a second.
 
Depending on how happy you are about false positives or not then a simple check like a crc-32 may be good enough http://groups.google.co.uk/group/mi...ages.csharp/msg/cd16c628369cfa4e?dmode=source has some example code that might be worth investigating.

One possible solution is to compare crcs and if you get a match then use a more accurate hashing routine like md5 or sha.

Alternatively you could generate hashes over a small part of the file (take a particular amount of bytes from the start) and if you get a match then generate a hash over the next chunk from the file.

For a file of any size however you are always going to get a potential delay when doing this kind of thing.
 
Mhh sounds a bit tricky to implement but it is a good idea to eliminate false positives.
 
How large are the files you are looking at hashing? Could the hash / comparison not be done in a background thread while the UI is doing something useful?

Given any reasonably large file size you are going to struggle to get a hash under 1 second, disk IO alone could account for more than that.
 
The sizes range from 100MB to 4GB.
But mostly sizes are around 120MB, 240MB, 350MB, 700MB, 1GB, 2GB, 4GB.
Where the bigger ones are less frequent.

Since I only read 64 * 8 bytes regardless of the filesize, the hash is done quickly even for big files.

I'm already using a seperate thread so it is not that bad if it takes a bit longer to get the hash.
But the faster the hash is the sooner the application can complete its job.

But I guess you're right, the ~600ms it takes to hash with my current version,
most of the time is spend reading the file, so there is nothing much left to optimize.
 
Back
Top