Java httpclient Usage Summary

Source: Internet
Author: User

1. Using connection pooling

Although the HTTP protocol is not connected, but is based on TCP, the underlying still need to establish a connection with the server. For a program that needs to crawl a large number of pages from the same site, the connection pool should be used, otherwise each crawl will connect to the Web site, send requests, get responses, release connections, on the one hand is inefficient, on the other hand inadvertently will be careless of the release of certain resources, Causes the site to reject the connection (many sites will deny a large number of connections to the same IP, preventing Dos attacks).

The connection pool routines are as follows:

[Java]View PlainCopy
  1. Schemeregistry schemeregistry = new Schemeregistry ();
  2. Schemeregistry.register (new Scheme ("http", plainsocketfactory.getsocketfactory ()));
  3. Schemeregistry.register (new Scheme ("https", 443, sslsocketfactory.getsocketfactory ()));
  4. Poolingclientconnectionmanager cm = new Poolingclientconnectionmanager (schemeregistry);
  5. Cm.setmaxtotal (200);
  6. Cm.setdefaultmaxperroute (2);
  7. Httphost Googleresearch = new Httphost ("research.google.com", 80);
  8. Httphost Wikipediaen = new Httphost ("en.wikipedia.org", 80);
  9. Cm.setmaxperroute (new Httproute (googleresearch), 30);
  10. Cm.setmaxperroute (new Httproute (Wikipediaen), 50);


The role of Schemaregistry is to register the protocol's default port number. Poolingclientconnectionmanager is a pooled Connection manager, which is the connection pool, Setmaxtotal sets the maximum number of connections for the connection pool, Setdefaultmaxperroute sets each route (http:// hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html#d5e467) on the default number of connections, Setmaxperroute sets the maximum number of connections for a site alone.

Getting the HTTP client from the connection pool is also very important:

[Java]View PlainCopy
    1. Defaulthttpclient client = new defaulthttpclient (cm);

2. Set HttpClient parameters

HttpClient need to set the appropriate parameters to work better. The default parameters can handle a small amount of crawl work, but finding a suitable set of parameters can often improve the gripping effect in a particular situation. The routines for setting parameters are as follows:

[Java]View PlainCopy
  1. Defaulthttpclient client = new defaulthttpclient (cm);
  2. Integer sockettimeout = 10000;
  3. Integer connectiontimeout = 10000;
  4. Final int retrytime = 3;
  5. Client.getparams (). Setparameter (Coreconnectionpnames.so_timeout, sockettimeout);
  6. Client.getparams (). Setparameter (Coreconnectionpnames.connection_timeout, ConnectionTimeout);
  7. Client.getparams (). Setparameter (Coreconnectionpnames.tcp_nodelay, false);
  8. Client.getparams (). Setparameter (Coreconnectionpnames.socket_buffer_size, 1024x768 * 1024);
  9. Httprequestretryhandler Myretryhandler = new Httprequestretryhandler ()
  10. {
  11. @Override
  12. Public Boolean retryrequest (IOException exception, int Executioncount, HttpContext context)
  13. {
  14. if (executioncount >= retrytime)
  15. {
  16. retry if over Max Retry Count
  17. return false;
  18. }
  19. if (Exception instanceof interruptedioexception)
  20. {
  21. //Timeout
  22. return false;
  23. }
  24. if (Exception instanceof unknownhostexception)
  25. {
  26. //Unknown host
  27. return false;
  28. }
  29. if (Exception instanceof connectexception)
  30. {
  31. //Connection refused
  32. return false;
  33. }
  34. if (Exception instanceof sslexception)
  35. {
  36. //SSL Handshake exception
  37. return false;
  38. }
  39. HttpRequest request = (HttpRequest) context.getattribute (executioncontext.http_request);
  40. Boolean idempotent =!  (Request instanceof Httpentityenclosingrequest);
  41. if (idempotent)
  42. {
  43. //Retry If the request is considered idempotent
  44. return true;
  45. }
  46. return false;
  47. }
  48. };
  49. Client.sethttprequestretryhandler (Myretryhandler);

5, 6 lines set the maximum waiting time of the socket, the maximum waiting time of the connection (in milliseconds). The socket wait time is the maximum time interval between two packets when the page and data are downloaded from the site, and the HttpClient considers the connection to be faulty. The maximum waiting time for a connection is the maximum wait time when a connection is established with the site, and the site is not considered to be able to connect if the site is not responding at this time. Line 7th sets httpclient not to use Nodelay policy. If the Nodelay policy is enabled, transferring data between HttpClient and the site will send the data in the send buffer as timely as possible, regardless of network bandwidth utilization, which is suitable for scenarios with high real-time requirements. When this policy is disabled, data transmission is sent using Nagle's algorithm, which takes into account bandwidth utilization rather than the real-time data transfer. Line 8th sets the size of the socket buffer (in bytes), which defaults to 8KB. Httprequestretryhandler is the interface responsible for handling request retries. Implement the Retryrequest method in the inner class of the interface. This method is called when an exception occurs after the httpclient sends the request. In this method, depending on the number of times the request was executed, the request content, the exception information to determine whether to continue the retry, if you continue to retry returns True, otherwise false. 3, set the request header settings request header is also very important, such as setting user-agent can be the crawler disguised as a browser, cheat some web site to check the crawler, Setting Accept-encoding to gzip can suggest sites to transmit data in a compressed format, save bandwidth, and so on. The routines are as follows: [Java]View PlainCopy
  1. HttpResponse response = null;
  2. HttpGet get = new HttpGet (URL);
  3. Get.addheader ("Accept", "text/html");
  4. Get.addheader ("Accept-charset", "Utf-8");
  5. Get.addheader ("accept-encoding", "gzip");
  6. Get.addheader ("Accept-language", "En-us,en");
  7. Get.addheader ("user-agent", "mozilla/5.0" (X11;  Linux x86_64) applewebkit/537.22 (khtml, like Gecko) chrome/25.0.1364.160 safari/537.22 ");
  8. Response = Client.execute (get);
  9. httpentity entity = response.getentity ();
  10. Header Header = Entity.getcontentencoding ();
  11. if (header = null)
  12. {
  13. headerelement[] codecs = header.getelements ();
  14. For (int i = 0; i < codecs.length; i++)
  15. {
  16. if (Codecs[i].getname (). Equalsignorecase ("gzip"))
  17. {
  18. Response.setentity (new Gzipdecompressingentity (entity));
  19. }
  20. }
  21. }
  22. return response;

The meaning of each header reference http://kb.cnblogs.com/page/92320/needs to be set on the good. If it takes a lot of different user-agent to take turns (the same user-agent is often easily identified as a reptile for a site), you can find it online, or you can look it up in your Chrome browser or grab it with the grab bag software. It is important to note that after Accept-encoding is set to gzip, the content of the reply to the site is checked for compression and, if so, decompressed, as shown in the code after the 9th line in the previous routine. [citation please specify source http://blog.csdn.net/bhq2010/article/details/9210007]

Java httpclient Usage Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.