1. Using connection pooling
Although the HTTP protocol is not connected, but is based on TCP, the underlying still need to establish a connection with the server. For a program that needs to crawl a large number of pages from the same site, the connection pool should be used, otherwise each crawl will connect to the Web site, send requests, get responses, release connections, on the one hand is inefficient, on the other hand inadvertently will be careless of the release of certain resources, Causes the site to reject the connection (many sites will deny a large number of connections to the same IP, preventing Dos attacks).
The connection pool routines are as follows:
[Java]View PlainCopy
- Schemeregistry schemeregistry = new Schemeregistry ();
- Schemeregistry.register (new Scheme ("http", plainsocketfactory.getsocketfactory ()));
- Schemeregistry.register (new Scheme ("https", 443, sslsocketfactory.getsocketfactory ()));
- Poolingclientconnectionmanager cm = new Poolingclientconnectionmanager (schemeregistry);
- Cm.setmaxtotal (200);
- Cm.setdefaultmaxperroute (2);
- Httphost Googleresearch = new Httphost ("research.google.com", 80);
- Httphost Wikipediaen = new Httphost ("en.wikipedia.org", 80);
- Cm.setmaxperroute (new Httproute (googleresearch), 30);
- Cm.setmaxperroute (new Httproute (Wikipediaen), 50);
The role of Schemaregistry is to register the protocol's default port number. Poolingclientconnectionmanager is a pooled Connection manager, which is the connection pool, Setmaxtotal sets the maximum number of connections for the connection pool, Setdefaultmaxperroute sets each route (http:// hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html#d5e467) on the default number of connections, Setmaxperroute sets the maximum number of connections for a site alone.
Getting the HTTP client from the connection pool is also very important:
[Java]View PlainCopy
- Defaulthttpclient client = new defaulthttpclient (cm);
2. Set HttpClient parameters
HttpClient need to set the appropriate parameters to work better. The default parameters can handle a small amount of crawl work, but finding a suitable set of parameters can often improve the gripping effect in a particular situation. The routines for setting parameters are as follows:
[Java]View PlainCopy
- Defaulthttpclient client = new defaulthttpclient (cm);
- Integer sockettimeout = 10000;
- Integer connectiontimeout = 10000;
- Final int retrytime = 3;
- Client.getparams (). Setparameter (Coreconnectionpnames.so_timeout, sockettimeout);
- Client.getparams (). Setparameter (Coreconnectionpnames.connection_timeout, ConnectionTimeout);
- Client.getparams (). Setparameter (Coreconnectionpnames.tcp_nodelay, false);
- Client.getparams (). Setparameter (Coreconnectionpnames.socket_buffer_size, 1024x768 * 1024);
- Httprequestretryhandler Myretryhandler = new Httprequestretryhandler ()
- {
- @Override
- Public Boolean retryrequest (IOException exception, int Executioncount, HttpContext context)
- {
- if (executioncount >= retrytime)
- {
- retry if over Max Retry Count
- return false;
- }
- if (Exception instanceof interruptedioexception)
- {
- //Timeout
- return false;
- }
- if (Exception instanceof unknownhostexception)
- {
- //Unknown host
- return false;
- }
- if (Exception instanceof connectexception)
- {
- //Connection refused
- return false;
- }
- if (Exception instanceof sslexception)
- {
- //SSL Handshake exception
- return false;
- }
- HttpRequest request = (HttpRequest) context.getattribute (executioncontext.http_request);
- Boolean idempotent =! (Request instanceof Httpentityenclosingrequest);
- if (idempotent)
- {
- //Retry If the request is considered idempotent
- return true;
- }
- return false;
- }
- };
- Client.sethttprequestretryhandler (Myretryhandler);
5, 6 lines set the maximum waiting time of the socket, the maximum waiting time of the connection (in milliseconds). The socket wait time is the maximum time interval between two packets when the page and data are downloaded from the site, and the HttpClient considers the connection to be faulty. The maximum waiting time for a connection is the maximum wait time when a connection is established with the site, and the site is not considered to be able to connect if the site is not responding at this time. Line 7th sets httpclient not to use Nodelay policy. If the Nodelay policy is enabled, transferring data between HttpClient and the site will send the data in the send buffer as timely as possible, regardless of network bandwidth utilization, which is suitable for scenarios with high real-time requirements. When this policy is disabled, data transmission is sent using Nagle's algorithm, which takes into account bandwidth utilization rather than the real-time data transfer. Line 8th sets the size of the socket buffer (in bytes), which defaults to 8KB. Httprequestretryhandler is the interface responsible for handling request retries. Implement the Retryrequest method in the inner class of the interface. This method is called when an exception occurs after the httpclient sends the request. In this method, depending on the number of times the request was executed, the request content, the exception information to determine whether to continue the retry, if you continue to retry returns True, otherwise false. 3, set the request header settings request header is also very important, such as setting user-agent can be the crawler disguised as a browser, cheat some web site to check the crawler, Setting Accept-encoding to gzip can suggest sites to transmit data in a compressed format, save bandwidth, and so on. The routines are as follows:
[Java]View PlainCopy
- HttpResponse response = null;
- HttpGet get = new HttpGet (URL);
- Get.addheader ("Accept", "text/html");
- Get.addheader ("Accept-charset", "Utf-8");
- Get.addheader ("accept-encoding", "gzip");
- Get.addheader ("Accept-language", "En-us,en");
- Get.addheader ("user-agent", "mozilla/5.0" (X11; Linux x86_64) applewebkit/537.22 (khtml, like Gecko) chrome/25.0.1364.160 safari/537.22 ");
- Response = Client.execute (get);
- httpentity entity = response.getentity ();
- Header Header = Entity.getcontentencoding ();
- if (header = null)
- {
- headerelement[] codecs = header.getelements ();
- For (int i = 0; i < codecs.length; i++)
- {
- if (Codecs[i].getname (). Equalsignorecase ("gzip"))
- {
- Response.setentity (new Gzipdecompressingentity (entity));
- }
- }
- }
- return response;
The meaning of each header reference http://kb.cnblogs.com/page/92320/needs to be set on the good. If it takes a lot of different user-agent to take turns (the same user-agent is often easily identified as a reptile for a site), you can find it online, or you can look it up in your Chrome browser or grab it with the grab bag software. It is important to note that after Accept-encoding is set to gzip, the content of the reply to the site is checked for compression and, if so, decompressed, as shown in the code after the 9th line in the previous routine. [citation please specify source http://blog.csdn.net/bhq2010/article/details/9210007]
Java httpclient Usage Summary