Reply content:
First, at least there is no need to read the entire file into memory first. In PHP, for example, if someone MD5 (file_get_contents (big_file_name)) it is really inappropriate.
Because MD5 is calculated as a chunk for each of the bits. So you can read a portion of the content (at least the number of bits, more appropriate is st_blksize), do those chunk part of the calculation, then read off a portion of the content to continue the calculation. MD5 algorithm itself is chunked, and many other similar algorithms such as SHA-1 is also, so you can support streaming computing, read a piece of a piece, and finally one time to generate a full hash, no memory explosion possible.
Most languages will provide streaming HashAlgorithm APIs, PHP also provides md5_file, and check the documentation to see if it is streaming inside. My previous practice is to take 100m 1m as MD5. Then the whole is MD5 once more. The feeling is also quite cute. Time spent a lot, not yet collide. First of all, MD5 is a standard, providing a ready-made algorithm (the canonical name is the MD5 algorithm. RFC 1321 the MD5 message-digest algorithm), we only need to translate into C, Java, Python, JS and other code.
The code suggests looking for it from the Internet, no need to build wheels.
Also, download a validator to test the correctness of the code before stepping over the pit.
http://www. FREESOFT.ORG/CIE/RFC/13 21/
According to my guess each big network disk TB level MD5 algorithm should be like this, the upstairs several said file MD5 is the file stream block calculated, then the network disk want to get TB level file MD5 must read the entire file stream to get, but do so inefficient, computing time is a problem. But everyone ignores a problem, the file in the upload process is also chunked upload, these uploaded fragments are actually file streams. Then you can allocate the time of calculation MD5 to each fragment. So each upload a fragment on the calculation of a bit, such as upload completed, the file MD5 also even out. OKTB level MD5 is not a problem. Upload complete MD5 nature will come out. Do not know my guess everybody has other opinion not.
Just now, a guy put forward all the transmission is also how the second pass. The most basic of the second pass is to first calculate the MD5 and then to the back end (may need more kinds of hash) I studied for a long time front end no way to complete the super-large file MD5, now use HTML5 API can calculate the MD5 of any size file, but time is quite long. I have no solution. I also didn't think of how those nets could get MD5 quickly in front of the wheel. There are thinking methods or interested can add my QQ 1,221,382,991 to study a bit. Recently doing a breakpoint continuation and second-pass project. The head, middle, and tail take a section of 1M data to fetch MD5.
This is a joke, the correct answer is, there is no good way, or regular md5+ cache it. If you do md5 a 1G size file, it's about what level of time it is. http://www. atool.org/file_hash.php
JS write the file hash, look at his code, you will know how to calculate the large file MD5 ...
Must not read in the first time, or the browser is rotten ... This problem can be referred to how the major network disk is done