How does one calculate md5 for a large file? -Php Tutorial
Source: Internet
Author: User
Php Chinese network (www.php.cn) provides the most comprehensive basic tutorial on programming technology, introducing HTML, CSS, Javascript, Python, Java, Ruby, C, PHP, basic knowledge of MySQL and other programming languages. At the same time, this site also provides a large number of online instances, through which you can better learn programming... Reply content: first, at least there is no need to read the entire file into the memory. For example, in php, md5 (file_get_contents (big_file_name) is indeed inappropriate.
Because md5 is calculated every 512 bits as a chunk. Therefore, you can read a part of the content (at least 512 bits, which is more suitable for st_blksize) each time, calculate those chunk parts, and then read and take a part of the content for further computation. The MD5 algorithm itself is segmented. many other similar algorithms, such as SHA-1, support Stream Computing, read a piece of computing, and finally generate a complete hash at a time, there is no possibility of memory explosion.
Most languages provide APIs for streaming HashAlgorithm. php also provides md5_file, and it can be viewed as streaming in the file. My previous practice is to take the first 1 m of every M for md5. then the entire md5 is performed again. I feel pretty cute. It took a lot of time and has not collided yet. To put it simply, the md5 algorithm is standardized and ready-made (the standard name is the md5 algorithm) is provided. RFC 1321 The MD5 Message-Digest Algorithm), we only need to translate it into c, java, python, js and other code.
The code is recommended to be searched on the Internet, so there is no need to build a wheel.
In addition, download a validator to test the code correctness ..
Http://www.freesoft.org/CIE/RFC/1321/
I guess that the TB-level md5 algorithm for all major network disks should be like this. several people upstairs said that file md5 is calculated by file streams in blocks, to obtain the md5 of a TB-level file, the network disk must read the file stream of the entire file. However, the efficiency is very low and the computing time is a problem. However, we ignored a problem. files are also uploaded in parts, and these uploaded fragments are actually file streams. The md5 calculation time can be apportioned to each shard. In this way, each part is uploaded, and the md5 value of the file is calculated. OkTB-level MD5 is not a problem. After the upload is complete, the md5 will come out. I don't know if you have any other ideas.
Just now, my dear friend suggested how to transfer data in seconds after the transfer is complete. In seconds, the most basic thing is to first calculate the md5 value on the front end and then pass it to the backend (more hash values may be required). I have studied for a long time that the front end cannot complete the MD5 of a large file within seconds, now we can use the html5 api to calculate the md5 of any file size, but it takes a long time. I have no solution. I didn't think about how to quickly obtain md5 on the front-end for those network disks. If you want to find a method or are interested, you can join QQ 122138299 to study it. Recently, resumable data transfer and second data transfer are ongoing. The header, center, and tail each take 1 MB of data for md5.
The above is a joke. the correct answer is: there is no good way to do it. Instead, we should use the regular md5 + cache. If you do md5 on a 1g size file, the time consumed is probably what level of http://www.atool.org/file_hash.php
The file hash written by js, read its code, and you will know how to calculate the md5 of a large file...
It must not have been read in at one time, or the browser may have crashed... For this question, you can refer to how the major network disks are created.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.