Java Crawler Technology HttpClient Learning notes

Last Update:2018-04-20 Source: Internet

Author: User

Tags try catch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first section,HttpClient

I.introduction of HttpClient

The Hypertext Transfer Protocol "theHyper-text Transfer Protocol (HTTP)" is the most important (significant) protocol used on the internet today,

More and more Java applications need to access network resources directly through the HTTP protocol.

While the basic functionality of accessing the HTTP protocol has been provided in the Java NET package of the JDK, the JDK library itself provides a lack of functionality and flexibility for most applications.

HttpClient is a sub-project under Apache Jakarta Common to provide an efficient, up-to-date, feature-rich client programming toolkit that supports the HTTP protocol, and it supports the latest versions and recommendations of the HTTP protocol.

Official site:http://hc.apache.org/

Latest Version:http://hc.apache.org/httpcomponents-client-4.5.x/

Official document:http://hc.apache.org/httpcomponents-client-4.5.x/tutorial/html/index.html

Second, maven dependency package

<groupId>org.apache.httpcomponents</groupId>

<artifactId>httpclient</artifactId>

</dependency>

Third,the HelloWorld realization of httpclient

 Packagecom.guo.httpclient;Importjava.io.IOException;Importorg.apache.http.HttpEntity;Importorg.apache.http.ParseException;Importorg.apache.http.client.ClientProtocolException;ImportOrg.apache.http.client.methods.CloseableHttpResponse;ImportOrg.apache.http.client.methods.HttpGet;Importorg.apache.http.impl.client.CloseableHttpClient;Importorg.apache.http.impl.client.HttpClients;Importorg.apache.http.util.EntityUtils; Public classHelloWorld { Public Static voidMain (String args[]) {//Creating an HttpClient instanceCloseablehttpclient httpclient=Httpclients.createdefault (); //Creating an HttpGet instanceHttpGet httpget=NewHttpGet ("https://www.cnblogs.com/"); Closeablehttpresponse Response=NULL;//define a return message        Try{Response=Httpclient.execute (HttpGet); } Catch(Clientprotocolexception e) {//HTTP protocol Exception//TODO auto-generated Catch blockE.printstacktrace (); } Catch(IOException e) {//IO Exception//TODO auto-generated Catch blockE.printstacktrace (); }//executing an HTTP GET request//Get return information entityHttpentity entity=response.getentity (); Try{System.out.println ("Get Web Content" +entityutils.tostring (Entity, "utf-8"));//get Web page content}Catch(ParseException e) {//Parsing Exceptions//TODO auto-generated Catch blockE.printstacktrace (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); }                Try{response.close (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); }                Try{httpclient.close (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); }            }}

This section is about direct requests without a mock browser some sites can't crawl

The following section will address this issue

If you crawl a domestic web site to remove the above code's try catch, throw an exception on the line

Section two, simulating browser crawl (for example in Firefox)

Some site settings anti-grilled, the above section of the direct request will appear the following questions

This requires a mock browser to query

One: Set the request header message user-agent simulation Browser

1. Request Header Message

Open a website, here take www.tuicool.com as an example press F12 dot network

Look at the request header information, the browser is sent to the target server to see, the target server to identify, such as

How do I simulate a browser?

Use the HttpGet instance to call the SetHeader method to set the header information to the target server, the code is as follows

Two: Get the response content type Content-type

Get The entity called Getcontent-type method with Httpentity, the response content is a key-value pair, here we GetValue get value values, the code is as follows

This will get the page content commented out, only output the response content information

The operation results are as follows

This coded utf-8, which is not encoded, is based on the server settings

Why do you want to get the type of response content?

Because when we collect, there will be a lot of links, to filter out some irrelevant information

Three: Get the response State status

200 normal 403 reject 500 server error 404 Page not found

The previous response was smooth, with a response status of 200 as follows

To get the response status, call the Getstatusline method with the Closeablehttpresponse instance, with the following code

Here we just need state 200, so add a Getstatuscode method, just get the status code

Section III,HttpClient grab pictures

First use the ContentType to get the following type, the code is as follows

Displays the result as a image/jpeg picture type, as follows

Now put this picture locally, (also can be placed on the server)

Here httpentity entity calls a GetContent method This method is InputStream input stream type, so return InputStream, first determine the entity is not empty

Gets the InputStream input stream instance, how to copy the picture to a local

In the traditional way

Slightly

In simple words, with Apache encapsulated Commons.io

First Maven introduces the jar package and then writes the code as follows

But actually development how do you know it's a . jpg suffix? Development will address http://xxx.com/xxx.xx point after the xx get to, and then stitching to save the file .

Fourth section proxy IP

With the GAO anonymous agent

Baidu Search Agent IP,

call the Setconfig method with the HttpGet instance .

Specific project to write a small reptile crawling proxy IP site, just crawl the first 10 proxy IP, put in the queue.

Section Fifth link and read timeout

First, httpclient connection time

is the time at which the HttpClient sends the request to the destination URL host address on the connection, theoretically the shorter the faster the distance.

The default connection time for HttpClient is 1 minutes, if more than 1 minutes will continue to try to connect .

If a URL is always not connected, it can affect threads in other threads, so we need to set it up.

For example, set 10 seconds if 10 seconds are not connected we will error. Use the log4j log to record relevant information.

Second, HttpClient Read Time

Is HttpClient has connected to the target server, and then to obtain the content data, the general situation of reading data is very fast,

If the amount of data being read is large, or the problem with the target server itself (such as slow reading of the database, large concurrency, etc.) can affect the read time.

Still need to set, such as set 10 seconds if 10 seconds have not finished reading, the error

Java Crawler Technology HttpClient Learning notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More