The beauty of Mathematics series 13 information fingerprint and its application

Source: Internet
Author: User
Tags generator hash sha1

August 3, 2006 11:17:00 published by: Wu, Google researcher

Any piece of information text can correspond to a not too long random number, as a fingerprint (fingerprint) that distinguishes it from other information. As long as the algorithm is well designed, the fingerprints of any two pieces of information are hard to duplicate, just like human fingerprints. Information fingerprint is widely used in encryption, information compression and processing.

As we mentioned in the article on graph theory and web crawler, in order to prevent the repetition of downloading the same Web page, we need to record the URLs (URLs) that have been visited in the hash table. But storing URLs directly in a hash table as a string is a waste of memory space and a wasted look-up time. Now the URLs are generally longer, for example, if Google or Baidu in the search for mathematical beauty, the corresponding URL length of 100 characters or more. Here is a link to Baidu

Http://www.baidu.com/s?ie=gb2312&bs=%CA%FD%D1%A7%D6%AE%C3%C0&sr=&z=&cl=3&f=8
&wd=%ce%e2%be%fc+%ca%fd%d1%a7%d6%ae%c3%c0&ct=0

Assuming that the average URL length is 100 characters, then the storage of 20 billion URLs itself needs at least 2 TB, that is, 2000 GB of capacity, considering that the hash table storage efficiency is generally only 50%, the actual need for more than 4 TB of memory. Even if these URLs are placed in the computer's memory, the search for strings is inefficient because the URL length is not fixed. Therefore, if we can find a function, the 20 billion URLs are randomly mapped to a 1282-byte integer space of 16 bytes, for example, the long string above corresponds to a random number as follows:

893249432984398432980545454543

So each URL only needs to occupy 16 bytes instead of the original 100. This reduces the memory requirements of the storage URLs to the original 1/6. This 16-byte random number is called the Information Fingerprint (fingerprint) of the URL. It can be proved that as long as the algorithm that produces the random number is good enough, it is almost impossible to have the same fingerprint of two strings, as it is impossible to have the same fingerprint of two people. Because the fingerprint is a fixed 128-bit integer, the calculation of the lookup is much smaller than the string comparison. When a web crawler downloads a webpage, it turns the URL of the page that it visited into an information fingerprint, saves it in a hash table, calculates its fingerprint whenever a new URL is encountered, and then compares whether the fingerprint is already in the hash table to determine whether to download the page. This integer lookup is more than dozens of times times faster than the original string lookup.

The key algorithm for generating information fingerprint is the pseudo-random number generator algorithm (PRNG). The earliest PRNG algorithm was proposed by the father of the computer, von Neumann. His method is very simple, that is, a number of square qiatouquwei, take the middle of the number. For example, a four-bit binary number 1001 (equivalent to the decimal 9), whose square is 01010001 (decimal 81) Qiatouquwei the middle of the remaining four bits 0100. Of course, the numbers produced by this method are not very random, meaning that two different messages are likely to have the same fingerprint. The mersennetwister algorithm that is commonly used now is much better.

The use of information fingerprint far more than the weight of the website, the information fingerprint of the twin brother is the password. One characteristic of the information fingerprint is its irreversibility, that is to say,
Can not be based on the information fingerprint to launch the original information, this nature, is the network encryption is needed for transmission. For example, a website can identify different users based on the user's cookie, which is the fingerprint of the information. However, the website cannot understand the user's identity according to the information fingerprint, so as to protect the user's privacy. On the Internet, the reliability of encryption depends on whether it is difficult to artificially find the same fingerprint information, such as whether a hacker can randomly generate user's cookie. From the point of view of Mersennetwister, the algorithm is not good, because it produces a random number of correlations.

Encrypted pseudo-random number generator (CSPRNG) is used for encryption on the Internet. Commonly used algorithms have MD5 or SHA1, which can change the indefinite length of information into fixed-length 1282-or 1602-digit random numbers. It is worth mentioning that SHA1 was previously thought to be free of loopholes, and has now been proved by Professor Xiao of China as flawed. But you don't have to panic because it's not the same as hackers that really break your registration information.

Although the information fingerprint history is very long, but the real widespread application is after having the internet, these few years only gradually became popular.

From:http://www.google.com.hk/ggblog/googlechinablog/2006/08/blog-post_8115.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.