The basic operation of Web Crawlers is to capture webpages. So how can we get the page we want as we like? This section describes how to capture a web page from a URL and provides an example of using Java to capture a Web page. Finally, let's talk about an important issue in the capture process: how to handle HTTP status codes.
1.1.1 deep understanding of URL
In fact, the process of capturing a webpage is the same as that of browsing a webpage through the IE browser. For example, you open a browser and enter the address of the rabbit hunting search website, as shown in Figure 1.1.
Figure 1.1 web page browsing using a browser
The process of "Opening" the webpage is actually that the browser acts as a browser "client" and sends a request to the server to "catch" the Server File locally, then explain and present. Furthermore, you can view the source code of the file captured by the browser. Select the "View" | "source file" command, and the source code of the file "Crawled" from the server will appear, as shown in Figure 1.2.
Figure 1.2 browser source code
In the above example, the string we entered in the address bar of the browser is called URL. So what is a URL? Intuitively, the URL is the string of the http://www.lietu.com entered on the browser side. Next we will introduce the URL-related knowledge in depth.
Before you understand a URL, you must first understand the concept of URI. What is URI? Each type of available resources on the Web, such as HTML documents, images, video clips, and programs, are located by a universal resource identifier (URI.
URI is generally composed of three parts: ① The naming mechanism for resource access; ② the host name for storing the resource; ③ the name of the resource itself, represented by the path. The following URI: http://www.webmonkey.com.cn/html/html40/
We can explain it as follows: this is a resource that can be accessed through the HTTP protocol. It is located on the host www.webmonkey.com.cn and accessed through the path "/html/html40.
A URL is a subset of a URI. It is the abbreviation of UniformResourceLocator ". Generally speaking, URLs are strings used to describe information resources on the Internet. They are mainly used in various WWW client programs and server programs, especially the famous Mosaic. URL can be used to describe various information resources in a unified format, including files, server addresses and directories. The URL format consists of three parts:
The first part is the protocol (or service mode ).
The second part is the Host IP address (sometimes including the port number) that contains the resource ).
The third part is the specific address of the host resource, such as the Directory and file name.
The first part and the second part are separated by the ": //" symbol, and the second and third parts are separated by the "/" symbol. The first part and the second part are indispensable, and the third part can be omitted sometimes.
Based on the URL definition, we provide examples of two common URL protocols for your reference.
1. http url example
The super text transmission protocol HTTP is used to provide resources for the Super Text Information Service. Example: http://www.peopledaily.com.cn/channel/welcome.htm
Its computer domain name is www.peopledaily.com.cn. The super Startup File (.html) is welcome.htm under the directory/channel. This is a computer of the People's Daily of China.
Example: http://www.rol.cn.net/talk/talk1.htm
The computer domain name is www.rol.cn.net. The super Startup File (.html) is the talk1.htm under the directory/talk. This is the address of the reide chat room. You can enter room 1st of the reide chat room.
2. File URL
When a file is expressed as a URL, the server uses file to indicate the Host IP address, the file access path (I .e. the Directory), and the file name. Sometimes the Directory and file name can be omitted, but the "/" symbol cannot be omitted. Example: file: // ftp.yoyodyne.com/pub/files/foobar.txt
The above URL indicates a file in the pub/files/directory of the ftp.yoyodyne.com host. The file name is foobar.txt. Example: file: // ftp.yoyodyne.com/pub
Indicates the directory/pub on the ftp.yodyne.com host. Example: file: // ftp.yoyodyne.com/
Represents the root directory of the ftp.yoyodyne.com host.
The most important processing object of a crawler is the URL. It obtains the required file content based on the URL and further processes it. Therefore, an accurate understanding of URLs is essential to understanding web crawlers. Starting from the next section, we will detail how to obtain webpage content based on the URL address.
1.1.2 capture webpage content through the specified URL
The previous section describes in detail the composition of a URL. This section describes how to capture a webpage based on a given URL.
Webpage capturing means to read the network resources specified in the URL from the network stream and save them to the local device. Similar to simulating the functions of IE browser using a program, the URL is sent to the server as the content of the HTTP request, and then the server's response resources are read.
Java is a programming language for the network. It regards network resources as a file, and its access to network resources is as convenient as access to local files. It encapsulates requests and responses as streams. Therefore, we can obtain the response stream based on the corresponding content, and then read data from the stream in bytes. For example, the java.net. URL class can send a request to the corresponding Web server and obtain the response document. The java.net. URL class has a default constructor that uses URL addresses as parameters to construct URL objects:
URLpageURL=newURL(path);
Then, you can get the network stream through the obtained URL object, and then perform operations on network resources like local files:
InputStreamstream=pageURL.openStream();
In actual projects, the network environment is complex. Therefore, we only use the API in the java.net package to simulate the work of the IE client, and the amount of code is very large. You need to process the status code returned by HTTP, set the HTTP proxy, and process HTTPS. To facilitate application development, HttpClient, an open-source Apache HTTP client, is often used in actual development. It can fully handle various problems in HTTP connections and is very convenient to use. You only need to introduce the HttpClient. jar package in the project to simulate IE to obtain the webpage content. For example:
// Create a client, similar to opening a browser HttpClienthttpclient = newHttpClient (); // create a get method, similar to entering an address GetMethodgetMethod = newGetMethod ("http://www.blablabla.com") in the browser's address bar "); // press enter to obtain the response status code intstatusCode=httpclient.exe cuteMethod (getMethod). // you can view the hit information and obtain many other items, such as head and cookies. out. println ("response =" + getMethod. getResponseBodyAsString (); // release getMethod. releaseConnection ();
The above sample code is an example of using HttpClient for request and response. The first line indicates creating a client, which is equivalent to opening a browser. The second line uses the get method to request http://www.blablabla.com. The third line executes the request to obtain the response status. The getMethod. getResponseBodyAsString () method of the fourth row can obtain the returned content in string mode. This is also the content required for web page capturing. In this example, the returned content is simply printed. In actual projects, the returned content is usually written to a local file and saved. Close the network connection to avoid resource consumption.
This example uses the get method to access Web resources. Generally, get requests pass the parameters to the server as part of the URL to the server. However, the HTTP protocol itself limits the URL string length. Therefore, too many parameters cannot be passed to the server. To avoid this problem, the post method is usually used for HTTP requests. The HttpClient package also supports the post method. For example:
// Obtain the post method PostMethodPostMethod = newPostMethod ("http://www.saybot.com/postme"); // use an array to pass the parameter NameValuePair [] postData = newNameValuePair [2]; // set the parameter postData [0] = newNameValuePair ("Weapon", "gun"); postData [1] = newNameValuePair ("What gun", "shengun"); postMethod. addParameters (postData); // returns the response code intstatuscode=httpclient.exe cuteMethod (getMethod); // you can view the hit information and obtain many other items, such as System such as head and cookies. out. println ("response =" + getMethod. getResponseBodyAsString (); // release getMethod. releaseConnection ();
The preceding example shows how to use the post method to access Web resources. Unlike the get method, the post method can use NameValuePair to set parameters. Therefore, you can set "unlimited" parameters. The get method writes parameters in the URL. Because the URL has a length limit, the length of the parameter to be transmitted is limited.
Sometimes, the machines that execute crawler programs cannot directly access Web resources, but need to access them through the HTTP Proxy Server. HttpClient also has good support for the proxy server. For example:
// Creating an HttpClient is equivalent to opening a proxy HttpClienthttpClient = newHttpClient (); // you can specify the IP address and port httpClient of the proxy server. getHostConfiguration (). setProxy ("192.168.0.1", 9527); // tell httpClient to use preemptive authentication. Otherwise, you will receive the consequence of "you are not qualified. getParams (). setAuthenticationPreemptive (true); // MyProxyCredentialsProvder returns the proxy's credential (username/password) httpClient. getParams (). setParameter (CredentialsProvider. PROVIDER, newMyProxyCredentialsProvider (); // sets the user name and password of the proxy server httpClient. getState (). setProxyCredentials (newAuthScope ("192.168.0.1", AuthScope. ANY_PORT, AuthScope. ANY_REALM), newUsernamePasswordCredentials ("username", "password "));
The example above explains in detail how to use HttpClient to set the proxy server. If your LAN needs a proxy server to access Web resources, you can refer to the above Code settings.
This section describes how to use HttpClient to capture the content of a webpage. Then, we will give a detailed example to illustrate how to obtain the webpage.
1.1.3 Java Web page capture example
In this section, we will write an actual web page capture example based on the content we mentioned earlier. This example summarizes the content in the previous section. The Code is as follows:
PublicclassRetrivePage {privatestaticHttpClienthttpClient = newHttpClient (); // set the proxy server static {// set the Proxy Server IP address and port httpClient. getHostConfiguration (). setProxy ("172.17.18.84", 8080);} terminate (Stringpath) throwsHttpException, IOException {InputStreaminput = null; OutputStreamoutput = null; // obtain the post method PostMethodpostMethod = newPostMethod (path ); // set NameValuePair [] postData = newNameValuePair [2]; p OstData [0] = newNameValuePair ("name", "lietu"); postData [1] = new NameValuePair ("password", "*****"); postMethod. addParameters (postData); // execution, return the status code intstatuscode=httpclient.exe cuteMethod (postMethod); // process the status code (for simplicity, only the status code with a return value of 200) if (statusCode = HttpStatus. SC _ OK) {input = postMethod. getResponseBodyAsStream (); // get the file name Stringfilename = path. substring (path. lastIndexOf ('/') + 1); // obtain the file output stream output = newFileOutputStream (file Name); // output to the file inttempByte =-1; while (tempByte = input. read ()> 0) {output. write (tempByte);} // close the input/output stream if (input! = Null) {input. close () ;}if (output! = Null) {output. close () ;} returntrue;} returnfalse;}/*** test code */publicstaticvoidmain (String [] args) {// capture the lietu homepage and output try {RetrivePage. downloadPage ("http://www.lietu.com/");} catch (httpjavastione) {// TODOAuto-generatedcatchblocke.printStackTrace ();} catch (iow.tione) {// TODOAuto-generatedcatchblocke.printStackTrace ();}}}
The above example is an example of crawling the rabbit search homepage. It is a simple web page capture example. Due to the complexity of the Internet, the real web page capture program will consider a lot of problems. For example, the Resource Name, the resource type, and the status code. Among them, the most important thing is to process various returned status codes. The next section describes how to handle status codes.
1.1.4 process HTTP Status Codes
The previous section describes HTTP status codes when HttpClient accesses Web resources. For example, the following statement:
IntstatusCode=httpClient.exe cuteMethod (getMethod); // press enter to obtain the response status code
The HTTP status code indicates the status of the response returned by the HTTP protocol. For example, a client sends a request to the server. If the requested resource is successfully obtained, the returned status code is 200, indicating that the response is successful. If the requested resource does not exist, Error 404 is usually returned.
HTTP status codes are generally divided into 5 types, with 1 ~ It starts with five digits and consists of three integers. 1XX is usually used as an experiment. This section describes several common status codes such as 2XX, 3XX, 4XX, and 5XX, as shown in Table 1.1.
| Status Code |
Code Description |
Processing Method |
| 200 |
Request successful |
Obtain the response content for processing |
| 201 |
After the request is completed, the new resource is created. The URI of the newly created resource can be obtained in the response object. |
Crawlers will not encounter |
| 202 |
Request accepted, but not processed yet |
Blocking wait |
| 204 |
The server has implemented the request, but no new information is returned. If the customer is a user agent, you do not need to update the document view for this. |
Discard |
| 300 |
This status code is not directly used by HTTP/1.0 applications, but serves as the default explanation for 3XX type responses. Multiple available requested resources exist |
If it can be processed in the program If it cannot be processed in the program, it is discarded. |
| 301 |
A permanent URL will be allocated to the requested resource, so that you can access this resource through this URL in the future. |
Redirect to the allocated URL |
| 302 |
The requested resource is temporarily saved in a different URL |
Redirect to a temporary URL |
| 304 |
The requested resource is updated. |
Discard |
| 400 |
Illegal Request |
Discard |
| 401 |
Unauthorized |
Discard |
| 403 |
Disable |
Discard |
| 404 |
Not found |
Discard |
| 5XX |
The status code starting with "5" indicates that the server finds an error and cannot continue to execute the request. |
Discard |
When the returned status code is 5XX, it indicates that an error occurs on the application server, which can be solved simply by discarding it.
When the return value status code is 3XX, it is usually switched. The following is the code snippet of the redirection. You can integrate it with the code in the previous section by yourself:
// If redirection is required, the redirection operation if (statusCode = HttpStatus. SC _MOVED_TEMPORARILY) | (statusCode = HttpStatus. SC _MOVED_PERMANENTLY) | (statusCode = HttpStatus. SC _SEE_OTHER) | (statusCode = HttpStatus. SC _TEMPORARY_REDIRECT) {// read the new URL header = postMethod. getResponseHeader ("location"); if (header! = Null) {StringnewUrl = header. getValue (); if (newUrl = null | newUrl. equals ("") {newUrl = "/"; // use post to redirect to PostMethodredirect = newPostMethod (newUrl); // send the request for further processing ......}}}
When the response status code is 2XX, we only need to process status codes 1.1 and 200 according to the description in Table 202. Other return values can be further processed. The 200 return status code is the success status code, which can be directly crawled on the webpage, for example:
// Process the status code if (statusCode = HttpStatus. SC _ OK) {input = postMethod. getResponseBodyAsStream (); // get the file name Stringfilename = path. substring (path. lastIndexOf ('/') + 1); // get the output stream of the file output = newFileOutputStream (filename); // output to the file inttempByte =-1; while (tempByte = input. read ()> 0) {output. write (tempByte );}}
The 202 response status code indicates that the request has been accepted, and the server will proceed with further processing.