URL to re-thinking

Source: Internet
Author: User
Tags hash md5 requires

The so-called URL to go heavy (I have not found the corresponding English, url Filtering?), is the crawler will repeatedly crawl URL removal, avoid crawling the same page multiple times. Crawlers typically place the URL to be crawled in a queue, extracting the new URL from the crawled Web page, and before they are placed in the queue, first make sure that the new URLs have not been crawled and that they are no longer queued if they have been crawled before. the most intuitive approach to –hash table

In order to set up the whole reptile as soon as possible, the first URL to re-use the scheme is an in-memory hashset, this is the most intuitive way, everyone can imagine. A string of URLs is placed in the HashSet, and any new url is first searched in HashSet, and if not in HashSet, the new URL is inserted into the hashset and the URL is placed in the queue to be crawled.
The benefit of this scheme is that its deduplication effect is accurate and does not miss a duplicate URL. Its disadvantage is that my crawler hung up the next morning, out of Memory. As crawling web pages increase, HashSet will continue to grow without limits. In addition, many URLs in the network are actually very long, with a large number of URLs reaching hundreds of characters. Of course, because my crawler is running on a small server, the JVM's memory is not much, otherwise it should be able to support more than 1-2 days.
For a simple estimate, assuming that the average length of a single URL is a byte (which I think is very conservative), then crawling a 10 million URL requires:
* 1 GB
And the 10 million URL is bucket throughout the Internet. You can see how much memory is required to load the hashset of all URLs. Compress URLs

For my crawler can hold more days, and do not want to change too much code, the second version added a small function, that is, the HashSet does not store the original URL, but the URL is compressed and then put in. There seems to be a lot of paper discussed how to compress URLs, including Sina Weibo short URL is actually a good solution, but these methods I do not. In order to be lazy, I use MD5 to encode the URL directly. The result of the
MD5 is the length of the byte. Compared to the estimated average length of the URL between 100byte has been reduced several times, can support a lot of days.
Of course, even if you find an algorithm that can be compressed to the extreme, with more and more URLs, will one day out of Memory. Therefore, this scheme does not solve the essential problem.
MD5 Another problem is that it is possible that two identical URLs are mapped to the same MD5 value, so that one of them will never be crawled. I'm not sure how big the odds are. If it is very small, this small error will not have much effect. Bloom Filter

An essential problem with the

memory-based HashSet approach is that the memory it consumes is growing as the URL grows. Unless you can guarantee that the size of the memory can accommodate all the URLs that need to be crawled, the solution will someday reach the bottleneck.
At this point, I would like to find a similar to HashSet but the memory consumption is relatively fixed and not growing, so naturally think of Bloom Filter. The concept of bloom filter is not discussed here, and can be found on the internet everywhere. I simply tried the bloom Filter, but soon gave it up. The solution based on Bloom filter has several problems:
The first one is theoretical. Bloom Filter filters out some of the normal samples (which I'm not crawling with), so-called false Positive. The probability, of course, depends on the parameter setting of the Bloom filter. But this leads to the next question;
The second is in practice, that is, how the several parameters of the Bloom filter should be set. M,k,n should be set to how much appropriate, this I have no experience, and may need to repeat the experiment and test to be able to be better determined;
The above two questions are not the root cause of my abandonment of bloom filter, the real reason is that I am doing a reptile framework, The above can start a lot of crawler tasks, each task may crawl their own specific URLs, and the task is independent. This requires a Bloom filter for each task, although the memory consumed by the Bloom filter for a single task is fixed, but increasing the number of tasks results in more bloom filter, resulting in more memory consumption. There is still a possibility of a memory overflow.
But if it's just a crawl task, then using the Bloom filter should be a great choice. BerkeleyDB

I finally realized that what I needed was a de-emphasis solution that could be placed on disk, so that memory overflow would never be possible. It was a long time ago that BerkeleyDB had such a thing, but the first true understanding was that in Amazon's Dynamo paper, BerkeleyDB was used as the underlying storage on a single machine. I thought it was a different thing, and there was something called "DB" that didn't support SQL. There is no such thing as the term NoSQL, which is called Non-relational database.
BerkeleyDB is a key-value database, simply put, is a hash table on disk, which is why it can be used to do URLs to heavy reasons. Another alternative is that it runs in the same process space as the program, unlike the general DB, which is run as a separate program.
Here is attached Heritrix in the use of BerkeleyDB do URL de-heavy code, a probe: (Code in Heritrix source code org.archive.crawler.util.BdbUriUniqFilter)

There are a bunch of functions that do initialization and configuration that are ignored directly, and only two really related functions:

public static long CreateKey (Charsequence uri) {  
    String url = uri.tostring ();  
    int index = Url.indexof (Colon_slash_slash);  
    if (Index > 0) {  
        index = url.indexof ('/', Index + colon_slash_slash.length ());  
    }  
    Charsequence Hostplusscheme = (Index = =-1)? Url:url.subSequence (0, index);  
    Long tmp = FPGENERATOR.STD24.FP (hostplusscheme);  
    return tmp | (FPGENERATOR.STD40.FP (URL) >>>);  
}  
/** * Value:only 1 byte */private static Databaseentry Zero_length_entry = new Databaseentry (  

    New Byte[0]);  
        Protected Boolean setadd (Charsequence uri) {databaseentry key = new Databaseentry ();  
        Longbinding.longtoentry (CreateKey (URI), key);  

        Long started = 0;  
        Operationstatus status = NULL;  
            try {if (logger.isloggable (Level.info)) {started = System.currenttimemillis ();  
            } status = Alreadyseen.putnooverwrite (null, key, zero_length_entry); if (logger.isloggable (Level.info)) {aggregatedlookuptime + = (System.currenttimemi  
            Llis ()-started);  
        }} catch (Databaseexception e) {logger.severe (E.getmessage ());  
            } if (status = = operationstatus.success) {count++;  
            if (logger.isloggable (Level.info)) {    Final int logat = 10000;  
                        if (Count > 0 && ((count% logat) = = 0)) {logger.info ("Average lookup" +  
                    (Aggregatedlookuptime/logat) + "Ms.");  
                Aggregatedlookuptime = 0;  
        }}} if (status = = Operationstatus.keyexist) {return false;//Not added  
        } else {return true;   }  
    }

Simply explain:

The first function, CreateKey, is to compress the URL, which converts the URL of any length into a long value. The value range of long is 2^64, so the probability that two URLs are mapped to the same long value should be quite low. But I didn't really look at the function too much, so it's not sure how it works.

The second function, Setadd, is to write the compressed URL to BerkeleyDB. As previously mentioned, BerkeleyDB is a key-value database, and each of its records includes a key and a value. But in URL de-weight, value is not important (for example, we used to use HashSet instead of HashMap), so here is a single byte length value to represent value, which is the static variable zero_length_entry.

Don't look setadd has so many lines, really useful on this line:

Inserts the compressed Long value as key,zero_length_entry as value into the BerkeleyDB, if the long value is already in the DB, will return operationstatus.keyexist, indicating that the corresponding URL has been crawled before, then the URL will not be placed in the queue to be crawled. finally

Unfortunately, I have not yet taken the time to do a performance test for berkeleydb, not sure how many times per second it can perform setadd operations, is enough to meet our performance requirements. Later on.

In addition, although I do not understand, but I think like Baidu such a professional search engine, its crawler URL to re-program may be more complex than listed here, after all, the requirements of the various aspects of the higher.

The above content is transferred from: not simple URL to go heavy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.