Simple URL deduplication

Source: Internet
Author: User

I found that the prefixes of several blogs are "not simple", which roughly describes the status of a seemingly simple task. It is not easy to find out after practice. A lot of things are like this. If you do not do it yourself, if you do not study it carefully, you can only be in a confused state.
This reminds me of another story. When I graduated, I was interviewed by a CTO from a company. He told me a word that makes me still remember it, he told me that technology is actually very simple (a Daniel and I have said something similar a few years later ). I have been pondering the meaning of this sentence. Now I understand that it means that no matter how difficult the technology is, you can always learn and master it with your heart.

Being simple or not simple is not actually a technology, but an attitude towards doing things.

Let's get down to the truth.
The so-called URL deduplication (I have not found the corresponding English, URL filtering ?), This means that crawlers can remove duplicate URLs to avoid crawling the same webpage multiple times. Crawlers usually put the URLs to be crawled in a queue, and extract the new URLs from the captured webpages before they are put into the queue, first, make sure that these new URLs have not been captured. If they have already been captured, they will not be placed in the queue.

The most intuitive way-hash table

In order to build the entire crawler as soon as possible, the initial URL deduplication solution is a hashset in the memory, which is the most intuitive method that everyone can think. The string of the URL is placed in the hashset. Any new URL is first searched in the hashset. If there is no hashset, the new URL is inserted into the hashset, and put the URL in the queue to be crawled.
The advantage of this solution is that its de-duplication effect is precise and there is no duplicate URL missed. Its disadvantage is that my crawler crashed the next morning, out of memory. As the number of webpages crawled increases, the hashset keeps increasing without limit. In addition, many URLs in the network are actually very long, with a large number of URLs containing hundreds of characters. Of course, because my crawler runs on a small server, there is not much JVM memory, otherwise it should be able to support another 1-2 days.
In a simple estimation, if the average length of a single URL is 100 bytes (I think this is very conservative), you need to capture 10 million urls:
100 byte * 10 000 000 = 1 GB
The 10 million URL is a huge drop in the Internet. You can understand how much memory is needed to mount the hashset of all URLs.

Compressed URL

In order for my crawlers to support a few more days without modifying too much code, the second version adds a small feature that does not store the original URL in the hashset, instead, the URL is compressed and then put in. It seems that many paper have discussed how to compress the URL, including the short URL in Sina Weibo, which is actually a good solution, but I don't know these methods. To be lazy, I use MD5 to encode the URL directly.
The result of MD5 is 128 bits, that is, the length of 16 bytes. Compared with the estimated average URL length, the length of bytes has been reduced by several times and can be supported for many days.
Of course, even if you find an algorithm that can be compressed to the extreme, as the number of URLs increases, one day out of memory. Therefore, this solution does not solve essential problems.
Another problem with MD5 is that two identical URLs may be mapped to the same MD5 value, so that one of them will never be crawled. I'm not sure how likely it will be. If it is very small, this tiny error will not have a big impact.

Bloom Filter

The memory-based hashset method has an essential problem, that is, the memory consumption increases with the increase of URLs. Unless the memory size can accommodate all the URLs to be crawled, this solution will eventually reach the bottleneck one day.
At this time, I would like to find a scheme similar to hashset, but the consumed memory is relatively fixed and will not keep increasing, so I naturally thought of the bloom filter. I will not talk about the concept of bloom filter here, which can be found everywhere on the Internet. I tried bloom filter briefly, but soon gave up. There are several problems with the bloom filter-based solution:
The first is theoretical. Bloom filter filters out some normal samples (in my case, URLs that have not been captured), that is, false positive. Of course, the probability depends on the parameter settings of the bloom filter. However, this raises the next question;
The second is in practice. How should we set the parameters of the bloom filter? M, K, and N should be set as much as possible. I have no experience in this case, and it may need to be tested and tested repeatedly before it can be determined;
The above two problems are not the root cause for giving up the bloom filter. The real reason is that I am working on a crawler framework that can start many crawler tasks, each task may capture its own specific URL, and the tasks are independent of each other. In this way, a bloom filter is required for each task. Although the memory consumed by using the bloom filter for a single task is fixed, increasing the number of tasks will lead to more Bloom Filters, this results in more memory consumption. Memory overflow still exists.
However, if it is only a capture task, using the bloom filter should be a very good choice.

Berkeleydb

I finally understand that what I need is actually a de-duplication solution that can be put on disk, so that memory overflow will never be possible. I knew that there was such a thing as berkeleydb for a long time, but for the first time I really knew that I was using berkeleydb as the underlying storage on a single machine in Amazon Dynamo's paper. At that time, I felt that this was really an alternative. I still had something called "DB" but it didn't support SQL. At that time, nosql was not used, and such a thing was called non-relational database.
Berkeleydb is a key-value database. Simply put, it is a hash table on disk, which is why it can be used for URL deduplication. Another alternative is that it is in the same process space as the program running, unlike the normal dB, it is used as a separate program running.
Here is the code that uses berkeleydb in heritrix for URL deduplication, to find out exactly: (the code is located in org. archive. crawler. util. bdburiuniqfilter of heritrix source code)

A bunch of initialization and configuration functions are ignored directly, and there are only two related functions:

    /**     * Create fingerprint.     * Pubic access so test code can access createKey.     * @param uri URI to fingerprint.     * @return Fingerprint of passed <code>url</code>.     */    public static long createKey(CharSequence uri) {        String url = uri.toString();        int index = url.indexOf(COLON_SLASH_SLASH);        if (index > 0) {            index = url.indexOf('/', index + COLON_SLASH_SLASH.length());        }        CharSequence hostPlusScheme = (index == -1)? url: url.subSequence(0, index);        long tmp = FPGenerator.std24.fp(hostPlusScheme);        return tmp | (FPGenerator.std40.fp(url) >>> 24);    }

    /**     * value: only 1 byte     */    private static DatabaseEntry ZERO_LENGTH_ENTRY = new DatabaseEntry(            new byte[0]);    protected boolean setAdd(CharSequence uri) {        DatabaseEntry key = new DatabaseEntry();        LongBinding.longToEntry(createKey(uri), key);        long started = 0;                OperationStatus status = null;        try {            if (logger.isLoggable(Level.INFO)) {                started = System.currentTimeMillis();            }            status = alreadySeen.putNoOverwrite(null, key, ZERO_LENGTH_ENTRY);            if (logger.isLoggable(Level.INFO)) {                aggregatedLookupTime +=                    (System.currentTimeMillis() - started);            }        } catch (DatabaseException e) {            logger.severe(e.getMessage());        }        if (status == OperationStatus.SUCCESS) {            count++;            if (logger.isLoggable(Level.INFO)) {                final int logAt = 10000;                if (count > 0 && ((count % logAt) == 0)) {                    logger.info("Average lookup " +                        (aggregatedLookupTime / logAt) + "ms.");                    aggregatedLookupTime = 0;                }            }        }        if(status == OperationStatus.KEYEXIST) {            return false; // not added        } else {            return true;        }    }

A brief explanation:

The first function, createkey, is used to compress the URL. It converts a URL of any length into a long value. The value range of the long type is 2 ^ 64. Therefore, the probability that two URLs map to the same long type value is quite low. But I didn't take this function into consideration too much, so I'm not sure how it works.

The second setadd function writes the compressed URL to the berkeleydb. As mentioned before, berkeleydb is a key-value database. Each record contains a key and a value. But in URL deduplication, value is not important (for example, we used hashset instead of hashmap in memory). Therefore, a value of byte length is used to represent value, this is the static variable zero_length_entry.

Although there are so many lines in setadd, this line is really useful:

status = alreadySeen.putNoOverwrite(null, key, ZERO_LENGTH_ENTRY);

Insert the compressed long value as the key and zero_length_entry into the berkeleydb as the value. If the long value already exists in the DB, operationstatus is returned. keyexist indicates that the corresponding URL has been captured before, so this URL will not be placed in the queue to be crawled.

Last

Unfortunately, I have not yet drawn a blank number to test the performance of the berkeleydb solution. I'm not sure how many setadd operations can be performed per second. Is it enough to meet our performance requirements. Make up later.

In addition, although I don't know, I think the URL deduplication solution for crawlers of professional search engines like Baidu may be much more complicated than the ones listed here, after all, there must be higher requirements in all aspects.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.