Arokh Posted July 25, 2008 Posted July 25, 2008 Hi I need to identify video files and that as fast as possible. Since I didn't find any hash which can generate an id within a second, I only hashed a very small part of the file. Basically I did this: Divide the data into 64 equal large sections At the beginning of each section, read the first 8bytes and concat them together. Take a MD5 hash from those 512bytes Is there something I can do to optimize the "hash"? Or maybe there is already such a hash which is generated within a second. Quote
Administrators PlausiblyDamp Posted July 28, 2008 Administrators Posted July 28, 2008 Depending on how happy you are about false positives or not then a simple check like a crc-32 may be good enough http://groups.google.co.uk/group/microsoft.public.dotnet.languages.csharp/msg/cd16c628369cfa4e?dmode=source has some example code that might be worth investigating. One possible solution is to compare crcs and if you get a match then use a more accurate hashing routine like md5 or sha. Alternatively you could generate hashes over a small part of the file (take a particular amount of bytes from the start) and if you get a match then generate a hash over the next chunk from the file. For a file of any size however you are always going to get a potential delay when doing this kind of thing. Quote Posting Guidelines FAQ Post Formatting Intellectuals solve problems; geniuses prevent them. -- Albert Einstein
Arokh Posted July 28, 2008 Author Posted July 28, 2008 Mhh sounds a bit tricky to implement but it is a good idea to eliminate false positives. Quote
Administrators PlausiblyDamp Posted July 29, 2008 Administrators Posted July 29, 2008 How large are the files you are looking at hashing? Could the hash / comparison not be done in a background thread while the UI is doing something useful? Given any reasonably large file size you are going to struggle to get a hash under 1 second, disk IO alone could account for more than that. Quote Posting Guidelines FAQ Post Formatting Intellectuals solve problems; geniuses prevent them. -- Albert Einstein
Arokh Posted July 29, 2008 Author Posted July 29, 2008 The sizes range from 100MB to 4GB. But mostly sizes are around 120MB, 240MB, 350MB, 700MB, 1GB, 2GB, 4GB. Where the bigger ones are less frequent. Since I only read 64 * 8 bytes regardless of the filesize, the hash is done quickly even for big files. I'm already using a seperate thread so it is not that bad if it takes a bit longer to get the hash. But the faster the hash is the sooner the application can complete its job. But I guess you're right, the ~600ms it takes to hash with my current version, most of the time is spend reading the file, so there is nothing much left to optimize. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.