0.1 billion 32-bit md5 password values. How can I query whether one of the md5 values is as efficient as possible?

Source: Internet
Author: User
What kind of storage is best? For example, you only need to check whether there is a "9d9720.dfc685f9b10d8d1b944330c09" in this database. Is it best to return trueorfalse for storage? For example, you only need to check whether there is a "9d971_dfc685f9b10d8d1b944330c09" in the database, and return true or false

Reply content:

What kind of storage is best? For example, you only need to check whether there is a "9d971_dfc685f9b10d8d1b944330c09" in the database, and return true or false

Your business model is unclear and you cannot get a good answer. Let me explain a few methods and their advantages and disadvantages.

  1. Sort and store the data to the hard disk in sequence, and then store the data in the order of 32bit32bit32bit. Search uses the simplest binary search (90% of programmers cannot write the correct binary search during the interview, but you can find a ready-made one ). Time complexity O (logn) is extremely fast
    Advantage: fast speed, limited disk space, no memory usage
    Disadvantage: It is difficult to add new index values to a full arrangement (Fortunately, once and for all) (re-arrange every addition, the key is to overwrite the dump disk)
    Conclusion: Suitable for data that will not change

  2. Bloom filter. Google: for example, there is a Python version.
    Advantage: High Speed and good insertion Performance
    Disadvantage: error rate, although controllable
    Conclusion: Suitable for non-100% accurate requirements and suitable for frequently changing data

  3. Partition (similar to the first floor method) First hash, then modulo, such as 5000, split into 5000 sub-files. The sub-files are sorted separately. Perform hash and modulo on the key during search, find the corresponding sub-file, and then perform binary search. Of course, MD5 can generally be considered as uniform hash, so there is no need to hash, just take the modulo directly.
    Advantage: The insert speed is good and the insert performance is good (only insert sorting for a single partition is used for a single insert)
    Disadvantages: it seems that there are no disadvantages and it is a compromise.

  4. Let alone a pure Hash table. Read all the data into the memory and create a Hash table (if 0.1 billion is used, the Hash table is not large, that is, a few G)
    Advantages: time complexity O (1), heh insertion complexity O (1)
    Disadvantage: memory usage...
    Conclusion: In addition to the cost, all performance is No1.

Conclusion:

  1. The source data will never change, so the first solution
  2. 100% accuracy is not required, so the second solution
  3. If you have money to buy memory, the fourth Solution
  4. Ordinary people, solution 3

The logic of the above methods is very simple and can be implemented quickly. If you have time, you can test the performance.

In addition, how fast is the Hash table ..

Generate 1000 million random strings (32 bytes in a single row)

$ head -1 1000wbCxshZTroH6OukITgLsCccK9SlBd7CHL

The last 100 strings (worst case of grep)

$ tail -100 1000w> q100$ time (cat q100 | while read line;do grep -Fx $line 1000w >/dev/null;done) 6.87s user 7.36s system 99% cpu 14.322 total

We can see that the worst performance of grep is 7req/s, and the time complexity is O (n)

Use awk to evaluate the performance of the hash table (the dict of awk is implemented by the hash table)

$ time awk 'ARGIND==1{a[$0]}' 1000w14.24s user 0.61s system 99% cpu 14.861 total 

It can be seen that the loading time of the hash table is 15 s. It is enough to load the table once if it is written as a service, so the loading time is not counted.

Query performance.

$ time awk 'ARGIND==1{a[$0]} ARGIND>1&&($0 in a){print $0}' 1000w 1000w >/dev/null27.88s user 0.73s system 99% cpu 28.734 total

Hash Table performance is 10000000/(28.734-14.861) = 7201_req/s is 10 times that of grep, time complexity is O (1)

My algorithm is relatively poor. Give me a simple idea.
All those who have used git know that the object name of git is the last 38 digits of a hash value (sha-1,
The subdirectory in the objects directory of git, which is the first two digits of the hash value of the objects object
Search for an objects object. First find the corresponding directory based on the first two, and then find the specific file under the directory.
The following is a part of the git objects directory:

00  06  0c  12  18  1e  24  2a  30  36  3c  42  48  4e  54  5a  60  66  6c  72  78  

This ensures that the data is evenly distributed in different directories.

Similarly, you can create tables in your database.
For example, md5_3e, md5_06 ,...
If you want to check whether 3eabecb5ff177ebadd305fe52e278d92df3754 exists
First, check whether the table md5_3e exists. If yes, check whether the value abecb5ff177ebadd305fe52e278d92df3754 exists in md5_3e.

In this solution, each table stores about more than 0.3 million of the data and adds an index to the field. The query speed is certainly fast.

This is my idea, and it should be the simplest way to implement the requirements.

The landlord can study bit-map related things, which is very helpful for your problem.

You can try the bloom filter and bloom filter. when determining whether an element belongs to a set, it is possible to mistakenly think that the element that does not belong to this set belongs to this set, however, the elements that belong to this set are not mistaken for those that do not belong to this set and are not suitable for zero errors.

This can be attributed to the storage structure based on the binary method.

For example, B-Example

See http://blog.csdn.net/v_july_v/article/details/6530142

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.