0.1 billion 32-bit md5 password values. How can I query whether one of the md5 values is as efficient as possible?

Last Update:2018-05-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What kind of storage is best? For example, you only need to check whether there is a "9d9720.dfc685f9b10d8d1b944330c09" in this database. Is it best to return trueorfalse for storage? For example, you only need to check whether there is a "9d971_dfc685f9b10d8d1b944330c09" in the database, and return true or false

Reply content:

What kind of storage is best? For example, you only need to check whether there is a "9d971_dfc685f9b10d8d1b944330c09" in the database, and return true or false

Your business model is unclear and you cannot get a good answer. Let me explain a few methods and their advantages and disadvantages.

Sort and store the data to the hard disk in sequence, and then store the data in the order of 32bit32bit32bit. Search uses the simplest binary search (90% of programmers cannot write the correct binary search during the interview, but you can find a ready-made one ). Time complexity O (logn) is extremely fast
Advantage: fast speed, limited disk space, no memory usage
Disadvantage: It is difficult to add new index values to a full arrangement (Fortunately, once and for all) (re-arrange every addition, the key is to overwrite the dump disk)
Conclusion: Suitable for data that will not change
Bloom filter. Google: for example, there is a Python version.
Advantage: High Speed and good insertion Performance
Disadvantage: error rate, although controllable
Conclusion: Suitable for non-100% accurate requirements and suitable for frequently changing data
Partition (similar to the first floor method) First hash, then modulo, such as 5000, split into 5000 sub-files. The sub-files are sorted separately. Perform hash and modulo on the key during search, find the corresponding sub-file, and then perform binary search. Of course, MD5 can generally be considered as uniform hash, so there is no need to hash, just take the modulo directly.
Advantage: The insert speed is good and the insert performance is good (only insert sorting for a single partition is used for a single insert)
Disadvantages: it seems that there are no disadvantages and it is a compromise.
Let alone a pure Hash table. Read all the data into the memory and create a Hash table (if 0.1 billion is used, the Hash table is not large, that is, a few G)
Advantages: time complexity O (1), heh insertion complexity O (1)
Disadvantage: memory usage...
Conclusion: In addition to the cost, all performance is No1.

Conclusion:

The source data will never change, so the first solution
100% accuracy is not required, so the second solution
If you have money to buy memory, the fourth Solution
Ordinary people, solution 3

The logic of the above methods is very simple and can be implemented quickly. If you have time, you can test the performance.

In addition, how fast is the Hash table ..

Generate 1000 million random strings (32 bytes in a single row)

$ head -1 1000wbCxshZTroH6OukITgLsCccK9SlBd7CHL

The last 100 strings (worst case of grep)

$ tail -100 1000w> q100$ time (cat q100 | while read line;do grep -Fx $line 1000w >/dev/null;done) 6.87s user 7.36s system 99% cpu 14.322 total

We can see that the worst performance of grep is 7req/s, and the time complexity is O (n)

Use awk to evaluate the performance of the hash table (the dict of awk is implemented by the hash table)

$ time awk 'ARGIND==1{a[$0]}' 1000w14.24s user 0.61s system 99% cpu 14.861 total

It can be seen that the loading time of the hash table is 15 s. It is enough to load the table once if it is written as a service, so the loading time is not counted.

Query performance.

$ time awk 'ARGIND==1{a[$0]} ARGIND>1&&($0 in a){print $0}' 1000w 1000w >/dev/null27.88s user 0.73s system 99% cpu 28.734 total

Hash Table performance is 10000000/(28.734-14.861) = 7201_req/s is 10 times that of grep, time complexity is O (1)

My algorithm is relatively poor. Give me a simple idea.
All those who have used git know that the object name of git is the last 38 digits of a hash value (sha-1,
The subdirectory in the objects directory of git, which is the first two digits of the hash value of the objects object
Search for an objects object. First find the corresponding directory based on the first two, and then find the specific file under the directory.
The following is a part of the git objects directory:

00  06  0c  12  18  1e  24  2a  30  36  3c  42  48  4e  54  5a  60  66  6c  72  78

This ensures that the data is evenly distributed in different directories.

Similarly, you can create tables in your database.
For example, md5_3e, md5_06 ,...
If you want to check whether 3eabecb5ff177ebadd305fe52e278d92df3754 exists
First, check whether the table md5_3e exists. If yes, check whether the value abecb5ff177ebadd305fe52e278d92df3754 exists in md5_3e.

In this solution, each table stores about more than 0.3 million of the data and adds an index to the field. The query speed is certainly fast.

This is my idea, and it should be the simplest way to implement the requirements.

The landlord can study bit-map related things, which is very helpful for your problem.

You can try the bloom filter and bloom filter. when determining whether an element belongs to a set, it is possible to mistakenly think that the element that does not belong to this set belongs to this set, however, the elements that belong to this set are not mistaken for those that do not belong to this set and are not suitable for zero errors.

This can be attributed to the storage structure based on the binary method.

For example, B-Example

See http://blog.csdn.net/v_july_v/article/details/6530142

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

0.1 billion 32-bit md5 password values. How can I query whether one of the md5 values is as efficient as possible?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

0.1 billion 32-bit md5 password values. How can I query whether one of the md5 values is as efficient as possible?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support