If you need to store a large number of URLs and need to search by URL. If you use B-tree to store URLs, the content stored is very large because the URLs themselves are very long. Under normal circumstances, the following query will be available:
Select id from URL WHERE url= "http://www.baidu.com";
If you delete the index on the original URL, and add an indexed URL_CRC column, using CRC32 to do the hash, you can query in the following way:
SELECT ID from URL WHERE url= ' http://www.baidu.com ' and Rul_crc=crc32 (' http://www.baidu.com ');
This performance is very high because the MySQL optimizer uses this highly selective, small, URL_CRC-based index to complete the lookup. Even if there are multiple identical index values, it is very fast to find the index entry by making a fast integer comparison based on the hash value, and then returning the corresponding row. Another way is to index the full URL string, which can be very slow.
The flaw in this implementation is the need to maintain the hash value. Can be maintained manually and can be implemented by triggers. In this way, remember not to use SHA1 () and MD5 () as the hash function. Because these two functions calculate the hash value of the very long string, it will waste more space, the comparison will be slower. SHA1 () and MD5 () are strong cryptographic functions, designed to maximize the elimination of conflicts, eggs here do not need to engage in such requirements. The conflict of simple hash functions in an acceptable range, colleagues have the ability to provide better performance.
If the data table is very large, CRC32 () will have a lot of hash conflicts, you can consider implementing a simple 64-bit hash function. This custom function returns an integer, not a string. A simple way to use the MD5 () function to return part of a value as a custom hash function. This can be worse than writing a hash algorithm, but it's the easiest to implement.
SELECT CONV (Right (MD5 (' http://www.baidu.com '), +), 16,10) as HASH64.
Handle hash conflicts. When querying using a hash index, you must include a constant value in the WHERE clause:
SELECT ID from URL WHERE url=crc32 (' http://www.baidu.com ') and url= ' http://www.baidu.com ';
Once a hash conflict occurs and the hash value of another string is exactly the same, the following statement does not work correctly:
SELECT ID from URL WHERE url=crc32 (' http://www.baidu.com ');
Because the so-called ' birthday paradox ', the probability of a hash conflict may increase more quickly than imagined, CRC32 () returns a 32-bit integer, and when the index has 9.3W records, the probability of a conflict is 1%. For example, we will use the word "/usr/share/dic/words" in the Countdown data table and perform a CRC32 () calculation, and finally there will be 98569 rows. This has already appeared a hash conflict. To avoid hash collisions, you must bring the Hahs value and the corresponding column value into the Where condition. If you do not want to query the specific value, for example, just the number of records (imprecise), you can use the hash value of CRC32 () without the into row value to query directly. You can also use the FNV64 () function as a hash function, with a hash value of 64 bits, very fast, and much less conflict than CRC32 ().
Basics of Mysql indexing (bottom)