Jump to content
Xtreme .Net Talk

Recommended Posts

Posted

Hi

 

I need to identify video files and that as fast as possible.

 

Since I didn't find any hash which can generate an id within a second,

I only hashed a very small part of the file.

 

Basically I did this:

Divide the data into 64 equal large sections
At the beginning of each section, read the first 8bytes and concat them together.
Take a MD5 hash from those 512bytes

 

Is there something I can do to optimize the "hash"?

Or maybe there is already such a hash which is generated within a second.

  • Administrators
Posted

Depending on how happy you are about false positives or not then a simple check like a crc-32 may be good enough http://groups.google.co.uk/group/microsoft.public.dotnet.languages.csharp/msg/cd16c628369cfa4e?dmode=source has some example code that might be worth investigating.

 

One possible solution is to compare crcs and if you get a match then use a more accurate hashing routine like md5 or sha.

 

Alternatively you could generate hashes over a small part of the file (take a particular amount of bytes from the start) and if you get a match then generate a hash over the next chunk from the file.

 

For a file of any size however you are always going to get a potential delay when doing this kind of thing.

Posting Guidelines FAQ Post Formatting

 

Intellectuals solve problems; geniuses prevent them.

-- Albert Einstein

  • Administrators
Posted

How large are the files you are looking at hashing? Could the hash / comparison not be done in a background thread while the UI is doing something useful?

 

Given any reasonably large file size you are going to struggle to get a hash under 1 second, disk IO alone could account for more than that.

Posting Guidelines FAQ Post Formatting

 

Intellectuals solve problems; geniuses prevent them.

-- Albert Einstein

Posted

The sizes range from 100MB to 4GB.

But mostly sizes are around 120MB, 240MB, 350MB, 700MB, 1GB, 2GB, 4GB.

Where the bigger ones are less frequent.

 

Since I only read 64 * 8 bytes regardless of the filesize, the hash is done quickly even for big files.

 

I'm already using a seperate thread so it is not that bad if it takes a bit longer to get the hash.

But the faster the hash is the sooner the application can complete its job.

 

But I guess you're right, the ~600ms it takes to hash with my current version,

most of the time is spend reading the file, so there is nothing much left to optimize.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...