Source code analysis of open-source JAVA crawler crawler4j-4 URL management and URL queue

Source: Internet
Author: User

During the process of crawling, a large number of URLs need to be stored and allocated. How to efficiently manage these URLs is the top priority of a crawler system.

By default, crawler4j runs a maximum of thousands of URLs per hour. After modificationHundreds of thousands per hour(As described in the following article) How should we manage so many URLs?

Crawler4j uses the embedded database Berkeley db je for temporary URL storage and Allocation Management. For Berkeley db je, I made a brief introduction in another article:

Do not use SQL for massive and simple data? Try the efficient embedded database Berkeley db je!

WebURL:

Let's start with the main function of BasicCrawlController and see how the program adds the entry URL:

controller.addSeed("http://www.ics.uci.edu/");controller.addSeed("http://www.ics.uci.edu/~lopes/");controller.addSeed("http://www.ics.uci.edu/~welling/");

Let's look at the addSeed () method of the crawler controller:

public void addSeed(String pageUrl) {addSeed(pageUrl, -1);}public void addSeed(String pageUrl, int docId) {String canonicalUrl = URLCanonicalizer.getCanonicalURL(pageUrl);if (canonicalUrl == null) {logger.error("Invalid seed URL: " + pageUrl);return;}if (docId < 0) {docId = docIdServer.getDocId(canonicalUrl);if (docId > 0) {// This URL is already seen.return;}docId = docIdServer.getNewDocID(canonicalUrl);} else {try {docIdServer.addUrlAndDocId(canonicalUrl, docId);} catch (Exception e) {logger.error("Could not add seed: " + e.getMessage());}}WebURL webUrl = new WebURL();webUrl.setURL(canonicalUrl);webUrl.setDocid(docId);webUrl.setDepth((short) 0);if (!robotstxtServer.allows(webUrl)) {logger.info("Robots.txt does not allow this seed: " + pageUrl);} else {frontier.schedule(webUrl);}}

Here, a WebURL is defined as the URL Model class, and some URL attributes are stored: locate and determine which part of the URL is a domain name, because the domain name may be different, such. cn, .com.cn ,. gov, .gov.cn; there are also some crawler attributes: assigned ID, parent URLID, parent URL, depth, priority, which will be specified during crawler work, the so-called parent URL refers to the page on which the address is found. The depth is the level at which it is found. For example, the entry URL is 0 and the entry URL page is 1, the new one found from 1 is 2, and so on, the crawler with a higher priority (smaller number) will be preferentially allocated.


DocIDServer:

In addSeed, setDocid is the unique ID assigned to the URL. The default value is auto-increment from 1: 1 2 3 4 5... although these IDs can be managed and stored using the collection classes that come with JAVA, it is still efficient to ensure that the IDs are unique and increase to tens of millions, crawler4j uses the bdb je mentioned earlier for storage. Of course, there is another reason for recovery, that is, crawlers can continue after the system is recovered, but I am not going to discuss this situation, in this case, the running efficiency of crawler4j is quite low!

Use docIdServer. getDocId () to check whether the URL is stored. If not, use docId = docIdServer. getNewDocID (canonicalUrl) to obtain the new ID. To see how docIdServer works, first initialize and input Environment in the crawler controller Constructor (for more information about Env, see the bdb je link at the beginning of the article ):

docIdServer = new DocIDServer(env, config);

The DocIdServer class is only responsible for managing the url id. constructor:

public DocIDServer(Environment env, CrawlConfig config) throws DatabaseException {super(config);DatabaseConfig dbConfig = new DatabaseConfig();dbConfig.setAllowCreate(true);dbConfig.setTransactional(config.isResumableCrawling());dbConfig.setDeferredWrite(!config.isResumableCrawling());docIDsDB = env.openDatabase(null, "DocIDs", dbConfig);if (config.isResumableCrawling()) {int docCount = getDocCount();if (docCount > 0) {logger.info("Loaded " + docCount + " URLs that had been detected in previous crawl.");lastDocID = docCount;}} else {lastDocID = 0;}}

Here is just a simple Creation of a database named DocIDs (for recoverability, we will not discuss it. Here and below both indicate that resumable is false ). This DB is stored with the URL as the key and the ID as the value. Because of the uniqueness of the key, it can ensure that the URL is not repeated and it is better to use the URL for ID query.

Let's look at getDocId ():

public int getDocId(String url) {synchronized (mutex) {if (docIDsDB == null) {return -1;}OperationStatus result;DatabaseEntry value = new DatabaseEntry();try {DatabaseEntry key = new DatabaseEntry(url.getBytes());result = docIDsDB.get(null, key, value, null);if (result == OperationStatus.SUCCESS && value.getData().length > 0) {return Util.byteArray2Int(value.getData());}} catch (Exception e) {e.printStackTrace();}return -1;}}

Because it is multi-threaded access, synchronized (mutex) is used to ensure thread security. If the key is a specified URL from the database, the corresponding ID value is returned. Otherwise,-1 is returned, indicating that no value is found.

public int getNewDocID(String url) {synchronized (mutex) {try {// Make sure that we have not already assigned a docid for this URLint docid = getDocId(url);if (docid > 0) {return docid;}lastDocID++;docIDsDB.put(null, new DatabaseEntry(url.getBytes()), new DatabaseEntry(Util.int2ByteArray(lastDocID)));return lastDocID;} catch (Exception e) {e.printStackTrace();}return -1;}}

Use getNewDocID to generate a new ID and store it and the URL in the DB.

AddUrlAndDocId () is used when you do not want to automatically generate an ID and want to specify an ID. It is generally not recommended unless it is used for the second time and you want to use the same ID as before, however, if this is the case, you must first find the ID of the previous time, which is inefficient and not necessary!

DocIDServer mainly involves these two methods. The logic is very simple and the function is very simple.


Frontier

Return to the addSeed method, and add the specified URL to the queue. The crawler thread can parse the URL only after the specified URL is added to the queue.

Frontier has two important new attributes: Counter Counters and URL queue WorkQueues:

protected WorkQueues workQueues = new WorkQueues(env, "PendingURLsDB", config.isResumableCrawling());protected Counters counters = new Counters(env, config);

Counter Counters is easy to implement. It is stored with a HashMap. Currently, only two values are stored: the number of URLs added to the queue and the number of URLs crawled.

The URL queue WorkQueues stores the WebURL that has been found but has not been assigned to the crawler thread. It is stored in bdb je and creates a database named PendingURLsDB:

public WorkQueues(Environment env, String dbName, boolean resumable) throws DatabaseException {this.env = env;this.resumable = resumable;DatabaseConfig dbConfig = new DatabaseConfig();dbConfig.setAllowCreate(true);dbConfig.setTransactional(resumable);dbConfig.setDeferredWrite(!resumable);urlsDB = env.openDatabase(null, dbName, dbConfig);webURLBinding = new WebURLTupleBinding();}

A custom WebURLTupleBinding can be used to save the attributes of a WebURL in JE. If you need to add some attributes to the WebURL, such as the Tag Name of the anchor is a, img, or iframe, you must modify WebURLTupleBinding in addition to WebURL. Otherwise, it will not be stored in the DB, this attribute will be blank when the thread extracts it!

WorkQueues uses the put, delete, and get methods to add, delete, and query data. It uses six bytes as the key. The first is the priority attribute of WebURL, and the second is the depth attribute of WebURL, the remaining four digits are converted to byte using the WebURL ID, and the content defined in WebURLTupleBinding is used as the value. Because the database is stored with the key as the index, the higher priority is that the smaller the number is, the lower the depth is.

About priority, crawler4j hasSmall BUGThat is, the default value of the priority attribute of WebURL is 0, which makes it impossible to crawl a URL first, the solution is to assign the default value to priority in the WebURL constructor or setURL. As for the value, let's look at it!


Frontier provides two methods to add a URL to the queue:

public void scheduleAll(List
 
   urls) {int maxPagesToFetch = config.getMaxPagesToFetch();synchronized (mutex) {int newScheduledPage = 0;for (WebURL url : urls) {if (maxPagesToFetch > 0 && (scheduledPages + newScheduledPage) >= maxPagesToFetch) {break;}try {workQueues.put(url);newScheduledPage++;} catch (DatabaseException e) {logger.error("Error while puting the url in the work queue.");}}if (newScheduledPage > 0) {scheduledPages += newScheduledPage;counters.increment(Counters.ReservedCounterNames.SCHEDULED_PAGES, newScheduledPage);}synchronized (waitingList) {waitingList.notifyAll();}}}public void schedule(WebURL url) {int maxPagesToFetch = config.getMaxPagesToFetch();synchronized (mutex) {try {if (maxPagesToFetch < 0 || scheduledPages < maxPagesToFetch) {workQueues.put(url);scheduledPages++;counters.increment(Counters.ReservedCounterNames.SCHEDULED_PAGES);}} catch (DatabaseException e) {logger.error("Error while puting the url in the work queue.");}}}
 

You can add a single batch or add a counter at the same time as you add it to the queue. Multiple logics have their own implementation classes to achieve separation. Frontier is responsible for combining these logics. You only need to call Fontier externally! Another method of Frontier is to obtain data in the queue. Multiple lines can be obtained at a time:

public void getNextURLs(int max, List
 
   result) {while (true) {synchronized (mutex) {if (isFinished) {return;}try {List
  
    curResults = workQueues.get(max);workQueues.delete(curResults.size());if (inProcessPages != null) {for (WebURL curPage : curResults) {inProcessPages.put(curPage);}}result.addAll(curResults);} catch (DatabaseException e) {logger.error("Error while getting next urls: " + e.getMessage());e.printStackTrace();}if (result.size() > 0) {return;}}try {synchronized (waitingList) {waitingList.wait();}} catch (InterruptedException ignored) {// Do nothing}if (isFinished) {return;}}}
  
 

Every time the crawler thread calls this method to get 50 URLs, The crawler will delete the URL from the queue after receiving the URL, start parsing, and call it again after parsing. If the queue is empty, the thread will wait for wait () in this method, and other threads will also queue at synchronized until the scheduleAll method is called, the thread will be re-activated until all ().


The above is the code analysis for storing and allocating URLs by crawlers in the crawler 4j. The classes involved are all stored in edu. uci. ics. crawler4j. frontier package. This package also has an InProcessPagesDB class for recoverable crawling.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.