Implementing "web Spider" with Java programming

Last Update:2017-02-27 Source: Internet

Author: User

Tags constructor continue thread

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction

"Web Spider" or "web crawler", is a kind of access to the site and track links to the program, through it, can quickly draw a Web site contains information on the page map. This article mainly describes how to use Java programming to build a "spider", we will first in a reusable spider class wrapper a basic "spider", and in the sample program to show how to create a specific "spider" to scan the relevant sites and find dead links.

The Java language is ideal for building a "spider" Program with support for the HTTP protocol, which transmits most of the web page information; it also builds an HTML parser, which makes the Java language the first choice for building a "spider" Program in this article.

Use "Spider"

The example program in the following example 1 will scan a Web site and look for dead links. When you use this program, you need to enter a URL and click the Begin button, and after the program begins, the Begin button becomes the Cancel button. When the program scans the Web site, the progress is displayed under the Cancel button, and the number of associated normal links and dead links is displayed when the current page is checked, and the dead link is displayed in the scrolling text box at the bottom of the program. Clicking the Cancel button stops the scan process, and then you can enter a new URL, and if you do not click Cancel, the program will run until all pages are found, and then the Cancel button will revert back to "Begin", indicating that the program has stopped.

The following shows how the sample program interacts with the reusable "Spider" class, which is included in the Checklinks class in Example 1, which implements the Ispiderreportable interface, as shown in Example 2, through which the spider class can interact with the sample program. In this interface, three methods are defined: The first method is "Spiderfoundurl", which is invoked each time the program locates a URL, and if the method returns True, the program should continue and find the link; the second method is "Spiderurlerror", It is invoked every time a program detects a URL that causes an error (such as "404 Page Not Found"), and the third method is "Spiderfoundemail", which is invoked each time an e-mail address is found. With these three methods, the spider class can feed the relevant information to the program that created it.

When the Begin method is called, the spider begins to work, and the spider is started as a separate thread to allow the program to redraw its user interface. Clicking the "Begin" button starts this background thread, and when the background thread is running, the run method of the "Checklinks" class is invoked, and the Run method starts when the spider object is instantiated, as follows:

spider = new Spider(this); spider.clear(); base = new URL(url.getText()); spider.addURL(base); spider.begin();

First, a new spider object is instantiated, where you need to pass a "ispiderreportable" object to the spider object's constructor because the "Checklinks" class implements the "Ispiderreportable" interface, Simply simply pass it to the constructor as the current object (which can be represented by the keyword this), and secondly, maintain a list of URLs that it has visited in the program, and the "clear" method is called to ensure that the URL list is empty at the start of the program. You must add a URL to its to-do list before the program starts running. The URL that the user enters at this point is the first one added to the list, and the program starts by scanning the page and finds other pages linked to the starting URL; Finally, call the "Begin" method to start the "spider", This method will not return until the spider has finished working or the user has canceled.

When "Spider" Runs, you can invoke the three methods implemented by the "Ispiderreportable" interface to report the current state of the program, most of the work of the program is done by the "Spiderfoundurl" method, when the "spider" discovers a new URL, It first checks if it works, and if the URL causes an error, it will be treated as a dead link, and if the link is valid, it will continue to check if it is on a different server, and if the link is on the same server, "Spiderfoundurl" returns True, indicating "spider" You should keep track of this URL and find other links, and if the link is on another server, you won't be able to scan for any other links, because it will cause the spider to constantly surf the internet for more and more sites, so the sample program will only look up links on the user-specified Web site.

Construct Spider class

Earlier on, how to use the Spider class, see the code in Example 3. Using the Spider class and the "Ispiderreportable" interface to easily add "spider" Functionality to a program, continue to explain how the spider class works.

The spider class must keep track of the URLs it accesses, in order to ensure that the spider does not access the same URL more than once; Further, the spider must divide the URL into three groups, and the first group is stored in the "workloadwaiting" attribute, Contains an unhandled URL list, where the first URL to be accessed by the spider exists, and the second group is stored in "workloadprocessed", which is a URL that the spider has already processed and does not need to be accessed again; The third group is stored in "Workloaderror", Contains the URL where the error occurred.

The Begin method contains the main loop of the Spider class, which repeats the "workloadwaiting" and processes each of the pages, and of course we also think that when these pages are processed, other URLs are likely to be added to "workloadwaiting", Therefore, the Begin method continues this process until the Cancel method of the Spider class is invoked, or the URL is no longer left in "workloadwaiting". This process is as follows:

cancel = false; while ( !getWorkloadWaiting().isEmpty() && !cancel ) { Object list[] = getWorkloadWaiting().toArray(); for ( int i=0; (i processURL((URL)list[i]); }

When the above code traverses "Workloadwaiting", it passes each URL that needs to be processed to the "Processurl" method, which is the only way to actually read and parse the HTML information in the URL.

Reading and parsing HTML

Java also supports access to URL content and parsing html, which is what the "processurl" approach does. Reading URL content in Java is relatively straightforward, and the following is the code for the "Processurl" method to implement this functionality:

URLConnection connection = url.openConnection(); if ( (connection.getContentType()!=null) && !connection.getContentType().toLowerCase() .startsWith("text/") ) { getWorkloadWaiting().remove(url); getWorkloadProcessed().add(url); log("Not processing because content type is: " + connection.getContentType() ); return; }

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More