If you need to store a large number of URLs and search for them based on the URLs. If B-Tree is used to store the URL, the stored content will be large because the URL itself is very long. Under normal circumstances, there will be the following query: SELECTidFROMurlWHEREurlwww.baidu.com; if you need to store a large number of URLs, you need to search for them based on the URL. If B-Tree is used to store the URL, the stored content will be large because the URL itself is very long. Under normal circumstances, the following query will be performed:
SELECT id FROM url WHERE url = "http://www.baidu.com ";
If you delete the index on the original URL and add an indexed url_crc column and use CRC32 for hash, you can query it in the following way:
SELECT id FROM url WHERE url = 'http: // www.baidu.com 'AND rul_crc = CRC32 ('http: // www.baidu.com ');
The performance is very high, because the MySQL Optimizer uses the url_crc column-based index with high selectivity and small size to complete the search. Even if there are multiple identical index values, the search is fast. you only need to make a quick integer comparison based on the hash value to find the index entries, and then return the corresponding rows one by one. Another way is to index the complete URL string, which will be very slow.
The defect of this implementation is that the hash value needs to be maintained. Manual maintenance and trigger implementation are supported. If this method is used, remember not to use SHA1 () and MD5 () as hash functions. Because the hash value calculated by these two functions is a long string, it will waste more space and lead to slower comparison. SHA1 () and MD5 () are strong encryption functions, and the design goal is to eliminate conflicts to the maximum extent. this is not required here. Conflicts between simple hash functions are acceptable. colleagues can provide better performance.
If the data table is very large and CRC32 () has a large number of hash conflicts, you can implement a simple 64-bit hash function by yourself. This custom function returns an integer instead of a string. A simple way is to use part of the MD5 () function's return value as a custom hash function. This is probably worse than writing a hash algorithm, but this is the easiest way to implement it.
Select conv (RIGHT (MD5 ('http: // www.baidu.com '), 16), 16, 10) AS HASH64.
Handle hash conflicts. When you use a hash index for query, you must include a constant value in the WHERE clause:
SELECT id from url WHERE url = crc32 ('http: // www.baidu.com ') AND url = 'http: // www.baidu.com ';
Once a hash conflict occurs, the hash value of the other string is exactly the same, the following statement cannot work correctly:
SELECT id from url WHERE url = crc32 ('http: // www.baidu.com ');
Because the probability of hash conflicts in the so-called 'birthday Paradox 'may increase much faster than imagined, CRC32 () returns a 32-bit integer. when the index has million records, the probability of a conflict is 1%. For example, we calculate the word reciprocal data table in '/usr/share/dic/word' and perform crc32 () calculation. There are 98569 rows at the end. This has already seen a hash conflict. To avoid hash conflicts, the hahs value and the corresponding column value must be included in the WHERE condition. If you do not want to query specific values, for example, just count the number of records (inaccurate), you can directly use the crc32 () hash value to query without bringing in the column value. You can also use the FNV64 () function as a hash function. the hash value is 64-bit, which is fast and has fewer conflicts than crc32.