Creating a very fast "hash"

Arokh · July 25, 2008

Hi

I need to identify video files and that as fast as possible.

Since I didn't find any hash which can generate an id within a second,

I only hashed a very small part of the file.

Basically I did this:

Divide the data into 64 equal large sections
At the beginning of each section, read the first 8bytes and concat them together.
Take a MD5 hash from those 512bytes

Is there something I can do to optimize the "hash"?

Or maybe there is already such a hash which is generated within a second.

PlausiblyDamp · July 28, 2008

Depending on how happy you are about false positives or not then a simple check like a crc-32 may be good enough http://groups.google.co.uk/group/microsoft.public.dotnet.languages.csharp/msg/cd16c628369cfa4e?dmode=source has some example code that might be worth investigating.

One possible solution is to compare crcs and if you get a match then use a more accurate hashing routine like md5 or sha.

Alternatively you could generate hashes over a small part of the file (take a particular amount of bytes from the start) and if you get a match then generate a hash over the next chunk from the file.

For a file of any size however you are always going to get a potential delay when doing this kind of thing.

Arokh · July 28, 2008

Mhh sounds a bit tricky to implement but it is a good idea to eliminate false positives.

PlausiblyDamp · July 29, 2008

How large are the files you are looking at hashing? Could the hash / comparison not be done in a background thread while the UI is doing something useful?

Given any reasonably large file size you are going to struggle to get a hash under 1 second, disk IO alone could account for more than that.

Arokh · July 29, 2008

The sizes range from 100MB to 4GB.

But mostly sizes are around 120MB, 240MB, 350MB, 700MB, 1GB, 2GB, 4GB.

Where the bigger ones are less frequent.

Since I only read 64 * 8 bytes regardless of the filesize, the hash is done quickly even for big files.

I'm already using a seperate thread so it is not that bad if it takes a bit longer to get the hash.

But the faster the hash is the sooner the application can complete its job.

But I guess you're right, the ~600ms it takes to hash with my current version,

most of the time is spend reading the file, so there is nothing much left to optimize.

Sign In

Creating a very fast "hash"

Recommended Posts

Arokh

PlausiblyDamp

Arokh

PlausiblyDamp

Arokh

Join the conversation

Browse

Activity