Bloom filter and dawgdic (A trie tree)

Source: Internet
Author: User

I have a friend who made a mobile browser.

He has the following requirement: when a user enters a website URL, the mobile browser needs to identify whether the website is a malicious URL. In addition, he has a malicious website library.

There may be a variety of solutions.

One of them is to place the malicious website library locally. When a mobile browser obtains a website, it matches it with each address in the website library, determine whether the URL is a malicious address based on the matching result.

Oh, I forgot to add that there are 1.5 million pieces of data in the URL library, which is 23 MB after compression. If a browser adds such a large library to identify a malicious URL, you will have no users.

The solution I provided at the beginning is the bloom filter (Bloom filter ). Regarding its detailed mechanism, I will only provide some parameter values here when I mention it in Mr. Wu's beautiful mathematics: the array size is 1500000*20/8 B (that is, the bitset size is 20 times the data item), the number of hash functions is 13, and the error rate is one in ten. I implemented this algorithm using C ++ and Java respectively, and the test results were satisfactory. The size of the array is more than 4 MB, and the size is only 2.8 mb After Zip compression. In the 4G era, mobile browsers come with a library of 3 m size, which is acceptable.

This should have ended so far. Another requirement is that when a user enters a database in the front of a website, the browser should provide up to ten related websites.

Of course, this web site library is larger and must be constantly updated, meaning it cannot be stored locally. However, the number of websites browsed by each user generally does not exceed one hundred. At the beginning, this database can be zero. As the number of users increases, it is okay to count the number of websites cached locally, this does not need to go to the server to pull a lot of URL libraries. Besides, it doesn't matter if it doesn't match.

The algorithm I think of is the trie tree. Of course it is stupid to implement a trie tree by myself. I searched the internet and got a prompt on stackoverflow: dawgdic. It also claims to be the best trie tree, with the fastest search speed, and claims that the dictionary library is more space-saving than the trie tree implemented by two-dimensional arrays. After I download the code on code.google.com( the latest code is dawgdic-0.4.5.tar.gz, 2011), I read its example and have the following functions:

1. Based on ordered data, it can build a space-saving Dawg dictionary;

2. Each item in its Dawg dictionary library can have only one key, or its value can be inserted with it, that is, each data item is a key-value pair;

3. Based on the constructed dictionary, Kv query can be performed, that is, a key is provided and its value is returned;

4. If you can only give a prefix of a key, it can return all the keys with the same prefix. These results can be sorted alphabetically and returned, or sorted by value;

5. If you can only give a suffix of a key, it can return all the keys with the same suffix. These results can be sorted alphabetically and returned, or sorted by value.

Based on the above features, the above requirement was suddenly solved (^_^ ). The features we need to use are 1, 2, and 4. The key of the dawg dictionary is of course the URL of the website, and its weight is of course the browsing times. Since the dawg dictionary cannot be modify after it has been constructed, the number of browsing times for each web site varies, this requires that the dawg dictionary be rebuilt within a period of time.

In fact, the above simply lists the respective application scenarios of the two algorithms. In fact, these two algorithms are widely used. For example, if the bloom filter is left blank, the dawg tree can be used for hot search prompts in search, word searches in some English-Chinese dictionaries, and personalized prompts in input methods.

After dinner at night, write this note to summarize your recent amateur studies and then work overtime.

Accompanying Statement: without my permission, junk copy websites such as www.tuicool.com cannot repost my blog.

Bloom filter and dawgdic (A trie tree)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.