HttpClient (ii) HttpClient use IP proxy to process connection timeouts

Source: Internet
Author: User

Objective

In fact, the front of the point is a little bit of water, in fact, HttpClient has a lot of powerful features:

(1) Implement all HTTP methods (Get,post,put,head, etc.) (2) Support automatic Steering (3) Support HTTPS Protocol (4) support proxy server, etc., httpclient use Agent IP1.1, preface

when crawling Web pages, some target sites have anti-crawler mechanisms, for frequent visits to the site and regular access to the site behavior, will collect the shielding IP measures .
At this point, proxy IP comes in handy. You can use proxy IP, and block one for a different IP.
About proxy IP words also divided into several transparent agents, anonymous agents, obfuscation agent, high stealth agent, the general use of high stealth agent.     

1.2, several proxy IP

1) Transparent agent (Transparent proxy)

REMOTE_ADDR = Proxy IP
Http_via = Proxy IP
Http_x_forwarded_for = Your IP
Although transparent proxy can "hide" your IP address directly, you can still find out who you are from http_x_forwarded_for.

2) anonymous agent (Anonymous proxy)

REMOTE_ADDR = Proxy IP
Http_via = Proxy IP
Http_x_forwarded_for = Proxy IP
An anonymous proxy is a little bit more advanced than a transparent proxy: Others can only know that you are using an agent and you cannot know who you are.
There is a bit more advanced than a purely anonymous proxy: obfuscation agent

3) Obfuscation agent (distorting Proxies)

REMOTE_ADDR = Proxy IP
Http_via = Proxy IP
Http_x_forwarded_for = Random IP Address
As above, the same as anonymous proxy, if the use of confusion proxy, others can still know you are using proxy, but will get a fake IP address, disguised more lifelike.

4) Hi-Stealth agent (Elite proxy or high anonymity proxy)

REMOTE_ADDR = Proxy IP
Http_via = Not determined
Http_x_forwarded_for = Not determined
It can be seen that the high-stealth agent so that others simply can not find that you are using agents, so is the best choice.
In general, we are crawling with high-stealth proxy IP;
That proxy IP from where to do it very simple Baidu, you know a lot of proxy IP site. Generally will give some free, but spend a little money to make a charge interface more convenient.

1.3. Instance to use proxy IP

Use Requestconfig.custom (). SetProxy (proxy). Build () To set the proxy IP

Package Com.jxlg.study.httpclient;import com.sun.org.apache.regexp.Internal. Re;import Org.apache.http.httpentity;import Org.apache.http.httphost;import Org.apache.http.client.config.requestconfig;import Org.apache.http.client.methods.closeablehttpresponse;import Org.apache.http.client.methods.httpget;import Org.apache.http.impl.client.closeablehttpclient;import Org.apache.http.impl.client.httpclients;import Org.apache.http.util.entityutils;import java.io.IOException; Public classUseProxy { Public Static voidMain (string[] args) throws IOException {//Creating an HttpClient instanceCloseablehttpclient httpClient =Httpclients.createdefault (); //Creating an HttpGet instanceHttpGet HttpGet =NewHttpGet ("http://www.tuicool.com"); //Set the proxy IP, set the connection time-out, set the timeout for requesting read data, set the connection timeout from Connect manager ,Httphost proxy =NewHttphost ("58.60.255.82",8118); Requestconfig Requestconfig=Requestconfig.custom (). SetProxy (proxy). Setconnecttimeout (10000). SetSocketTimeout (10000). Setconnectionrequesttimeout ( the). build ();        Httpget.setconfig (Requestconfig); //set the request header messageHttpget.setheader ("user-agent","mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/62.0.3202.94 safari/537.36"); Closeablehttpresponse Response=Httpclient.execute (HttpGet); if(Response! =NULL) {httpentity entity= Response.getentity ();//Get return entity            if(Entity! =NULL) {System. out. println ("the content of the webpage is:"+ entityutils.tostring (Entity,"Utf-8")); }        }        if(Response! =NULL) {response.close (); }        if(HttpClient! =NULL) {httpclient.close (); }    }}

1.4, the actual development of how to get proxy IP

We can use HttpClient to crawl the latest 20 high-stealth proxy IP on http://www.xicidaili.com/to save to the list, when an IP is blocked and the connection timeout is obtained.

Then take out a list of the IP, and so on, you can determine when the number of lists in the list is less than 5, the re-crawl proxy IP to save to the linked list.

1.5. HttpClient connection Timeout and read timeout

HttpClient has a connection time and time to read the content when executing the specific HTTP request ;

1) HttpClient Connection time

The so-called connection is the time httpclient send the request to the destination URL to connect to the host address , in theory, the shorter the distance faster,

The more unobstructed the line, but because the routing complex staggered, often connected to the time is not fixed, bad luck even,httpclient The default connection time, according to my test,

The default is 1 minutes , if more than 1 minutes to continue to try to connect, so there will be a problem if you encounter a URL is always not connected, will affect the threads of other threads go in, say nasty point,

Is squat manger not to poop. So we have to make special settings, such as set 10 seconds if 10 seconds is not connected to the error, so we can do business processing,

For example, we have control over the business will be connected to try again. And this special URL is written in the log4j log. Easy for administrators to view.

2) HttpClient Read time

The so-called read time is HttpClient has been connected to the target server, and then to obtain the content data, the general situation of reading data is very fast,

But if the amount of data read is large, or the target server itself problems (such as reading the database slow, large concurrency, etc.). ) can also affect read time .

Ibid, we still need to special settings, such as setting 10 seconds if 10 seconds have not finished reading, the error, ibid, we can deal with the business.

For example, we give an address http://central.maven.org/maven2/, this is the foreign address connection time is relatively long, and read more content. Connection timeouts and read timeouts are easily seen.

How do we implement it in code?

HttpClient provides us with a Requestconfig class dedicated to configuring parameters such as connection time, read time, and proxy IP as explained above.

Example:

Package Com.jxlg.study.httpclient;import Org.apache.http.httpentity;import Org.apache.http.client.config.requestconfig;import Org.apache.http.client.methods.closeablehttpresponse;import Org.apache.http.client.methods.httpget;import Org.apache.http.impl.client.closeablehttpclient;import Org.apache.http.impl.client.httpclients;import Org.apache.http.util.entityutils;import java.io.IOException; Public classtimesetting { Public Static voidMain (string[] args) throws IOException {closeablehttpclient httpClient=Httpclients.createdefault (); HttpGet HttpGet=NewHttpGet ("http://central.maven.org/maven2/"); Requestconfig Config=Requestconfig.custom (). Setconnecttimeout ( the). SetSocketTimeout ( the). build ();        Httpget.setconfig (config); Httpget.setheader ("user-agent","mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/62.0.3202.94 safari/537.36"); Closeablehttpresponse Response=Httpclient.execute (HttpGet); if(Response! =NULL) {httpentity entity=response.getentity (); System. out. println ("the content of the webpage is:"+ entityutils.tostring (Entity,"UTF-8")); }        if(Response! =NULL) {response.close (); }        if(HttpClient! =NULL) {httpclient.close (); }    }}

  

HttpClient (ii) HttpClient use IP proxy to process connection timeouts

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.