The first section,HttpClient
I.introduction of HttpClient
The Hypertext Transfer Protocol "theHyper-text Transfer Protocol (HTTP)" is the most important (significant) protocol used on the internet today,
More and more Java applications need to access network resources directly through the HTTP protocol.
While the basic functionality of accessing the HTTP protocol has been provided in the Java NET package of the JDK, the JDK library itself provides a lack of functionality and flexibility for most applications.
HttpClient is a sub-project under Apache Jakarta Common to provide an efficient, up-to-date, feature-rich client programming toolkit that supports the HTTP protocol, and it supports the latest versions and recommendations of the HTTP protocol.
Official site:http://hc.apache.org/
Latest Version:http://hc.apache.org/httpcomponents-client-4.5.x/
Official document:http://hc.apache.org/httpcomponents-client-4.5.x/tutorial/html/index.html
Second, maven dependency package
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>
Third,the HelloWorld realization of httpclient
Packagecom.guo.httpclient;Importjava.io.IOException;Importorg.apache.http.HttpEntity;Importorg.apache.http.ParseException;Importorg.apache.http.client.ClientProtocolException;ImportOrg.apache.http.client.methods.CloseableHttpResponse;ImportOrg.apache.http.client.methods.HttpGet;Importorg.apache.http.impl.client.CloseableHttpClient;Importorg.apache.http.impl.client.HttpClients;Importorg.apache.http.util.EntityUtils; Public classHelloWorld { Public Static voidMain (String args[]) {//Creating an HttpClient instanceCloseablehttpclient httpclient=Httpclients.createdefault (); //Creating an HttpGet instanceHttpGet httpget=NewHttpGet ("https://www.cnblogs.com/"); Closeablehttpresponse Response=NULL;//define a return message Try{Response=Httpclient.execute (HttpGet); } Catch(Clientprotocolexception e) {//HTTP protocol Exception//TODO auto-generated Catch blockE.printstacktrace (); } Catch(IOException e) {//IO Exception//TODO auto-generated Catch blockE.printstacktrace (); }//executing an HTTP GET request//Get return information entityHttpentity entity=response.getentity (); Try{System.out.println ("Get Web Content" +entityutils.tostring (Entity, "utf-8"));//get Web page content}Catch(ParseException e) {//Parsing Exceptions//TODO auto-generated Catch blockE.printstacktrace (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); } Try{response.close (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); } Try{httpclient.close (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); } }}
This section is about direct requests without a mock browser some sites can't crawl
The following section will address this issue
If you crawl a domestic web site to remove the above code's try catch, throw an exception on the line
Section two, simulating browser crawl (for example in Firefox)
Some site settings anti-grilled, the above section of the direct request will appear the following questions
This requires a mock browser to query
One: Set the request header message user-agent simulation Browser
1. Request Header Message
Open a website, here take www.tuicool.com as an example press F12 dot network
Look at the request header information, the browser is sent to the target server to see, the target server to identify, such as
How do I simulate a browser?
Use the HttpGet instance to call the SetHeader method to set the header information to the target server, the code is as follows
Two: Get the response content type Content-type
Get The entity called Getcontent-type method with Httpentity, the response content is a key-value pair, here we GetValue get value values, the code is as follows
This will get the page content commented out, only output the response content information
The operation results are as follows
This coded utf-8, which is not encoded, is based on the server settings
Why do you want to get the type of response content?
Because when we collect, there will be a lot of links, to filter out some irrelevant information
Three: Get the response State status
200 normal 403 reject 500 server error 404 Page not found
The previous response was smooth, with a response status of 200 as follows
To get the response status, call the Getstatusline method with the Closeablehttpresponse instance, with the following code
Here we just need state 200, so add a Getstatuscode method, just get the status code
Section III,HttpClient grab pictures
First use the ContentType to get the following type, the code is as follows
Displays the result as a image/jpeg picture type, as follows
Now put this picture locally, (also can be placed on the server)
Here httpentity entity calls a GetContent method This method is InputStream input stream type, so return InputStream, first determine the entity is not empty
Gets the InputStream input stream instance, how to copy the picture to a local
In the traditional way
Slightly
In simple words, with Apache encapsulated Commons.io
First Maven introduces the jar package and then writes the code as follows
But actually development how do you know it's a . jpg suffix? Development will address http://xxx.com/xxx.xx point after the xx get to, and then stitching to save the file .
Fourth section proxy IP
With the GAO anonymous agent
Baidu Search Agent IP,
call the Setconfig method with the HttpGet instance .
Specific project to write a small reptile crawling proxy IP site, just crawl the first 10 proxy IP, put in the queue.
Section Fifth link and read timeout
First, httpclient connection time
is the time at which the HttpClient sends the request to the destination URL host address on the connection, theoretically the shorter the faster the distance.
The default connection time for HttpClient is 1 minutes, if more than 1 minutes will continue to try to connect .
If a URL is always not connected, it can affect threads in other threads, so we need to set it up.
For example, set 10 seconds if 10 seconds are not connected we will error. Use the log4j log to record relevant information.
Second, HttpClient Read Time
Is HttpClient has connected to the target server, and then to obtain the content data, the general situation of reading data is very fast,
If the amount of data being read is large, or the problem with the target server itself (such as slow reading of the database, large concurrency, etc.) can affect the read time.
Still need to set, such as set 10 seconds if 10 seconds have not finished reading, the error
Java Crawler Technology HttpClient Learning notes