ZZ from http://www.buaaer.com/bbs/blog.php? Tid = 39937
Httpclient 4.0 is coming soon, so there are not many instance tutorials on the network. Most of the information obtained by searching httpclient is based on the original commons httpclient 3.1 (legacy) package. The official website download page is: Workshop.
In this example, we can capture webpage encoding, content, and other information.
By default, the server returns the encoding supported by the server based on the client's request header information, for example, google.cn itself supports UTF-8 and gb2312 encoding, therefore, if you do not specify any header information in the header, it will return the gb2312 encoding by default. If we directly access google.cn in a browser and use httplook, or if the firebug plug-in of Firefox looks at the returned header information, it will find that it returns a UTF-8 code.
Next let's take a look at the example to explain it. I will also put the comments in the code to explain it and put the complete code for beginners to understand.
This instance will
Httpclient-related packages used
Httpclient-4.0.jar
Httpcore-4.0.1.jar
Httpmime-4.0.jar
Commons-logging-1.0.4.jar and other related packages
// Httpclienttest. Java
Package com. Alibaba Huo. crawler. test;
Import java. util. RegEx. matcher;
Import java. util. RegEx. pattern;
Import org. Apache. http. header;
Import org. Apache. http. httpentity;
Import org. Apache. http. httphost;
Import org. Apache. http. httpresponse;
Import org. Apache. http. Client. httpclient;
Import org. Apache. http. Client. Methods. httpget;
Import org. Apache. http. impl. Client. defaulthttpclient;
Import org. Apache. http. util. entityutils;
Class httpclienttest {
Public final static void main (string [] ARGs) throws exception {
// Initialization. The constructor here is different from that in 3.1.
Httpclient = new defaulthttpclient ();
Httphost targethost = new httphost ("www.google.cn ");
// Httpget = new httpget ("http://www.apache.org /");
Httpget = new httpget ("/");
// View the default request header information
System. Out. println ("Accept-charset:" + httpget. getfirstheader ("Accept-charset "));
// If this parameter is not added, you will find that no matter whether you set accept-charset to GBK or UTF-8, it will return gb2312 by default (this example is for google.cn)
Httpget. setheader ("User-Agent", "Mozilla/5.0 (windows; U; Windows NT 5.1; ZH-CN; RV: 1.9.1.2 )");
// Multiple codes can be accepted at the same time.
Httpget. setheader ("Accept-language", "ZH-CN, ZH; q = 0.5 ");
Httpget. setheader ("Accept-charset", "gb2312, UTF-8; q = 0.7, *; q = 0.7 ");
// Verify that the header settings take effect
System. Out. println ("Accept-charset:" + httpget. getfirstheader ("Accept-charset"). getvalue ());
// Execute HTTP Request
System. Out. println ("executing request" + httpget. geturi ());
Httpresponse response = httpclient.exe cute (targethost, httpget );
// Httpresponse response = httpclient.exe cute (httpget );
System. Out. println ("----------------------------------------");
System. Out. println ("Location:" + response. getlastheader ("location "));
System. Out. println (response. getstatusline (). getstatuscode ());
System. Out. println (response. getlastheader ("Content-Type "));
System. Out. println (response. getlastheader ("Content-Length "));
System. Out. println ("----------------------------------------");
// Determine the page return status to determine whether to redirect to capture a new link
Int statuscode = response. getstatusline (). getstatuscode ();
If (statuscode = httpstatus. SC _moved_permanently) |
(Statuscode = httpstatus. SC _moved_temporarily) |
(Statuscode = httpstatus. SC _see_other) |
(Statuscode = httpstatus. SC _temporary_redirect )){
// Here the redirection processing is not verified here
String newuri = response. getlastheader ("location"). getvalue ();
Httpclient = new defaulthttpclient ();
Httpget = new httpget (newuri );
Response = httpclient.exe cute (httpget );
}
// Get hold of the response entity
Httpentity entity = response. getentity ();
// View all returned header information
Header headers [] = response. getallheaders ();
Int II = 0;
While (II System. Out. println (headers [II]. getname () + ":" + headers [II]. getvalue ());
++ II;
}
// If the response does not enclose an entity, there is no need
// To bother about connection release
If (entity! = NULL ){
// Save the source code stream in a byte array, because the stream may be used twice,
Byte [] bytes = entityutils. tobytearray (entity );
String charset = "";
// If the Content-Type header contains the encoding information, we can obtain it directly here
Charset = entityutils. getcontentcharset (entity );
System. Out. println ("In header:" + charset );
// If the header does not exist, we need to check the page source code. Although this method cannot be completely correct, some rough web page coders do not write the header encoding information in the page.
If (charset = ""){
RegEx = "(? = <Meta ).*? (? <= Charset = [\\' | \\\ "]?) ([[A-Z] | [A-Z] | [0-9] |-] *) ";
P = pattern. Compile (RegEx, pattern. case_insensitive );
M = P. matcher (new string (bytes); // The default encoding is converted to a string. Because our matching does not contain Chinese characters, possible garbled characters in the string will not affect us.
Result = M. Find ();
If (M. groupcount () = 1 ){
Charset = M. Group (1 );
} Else {
Charset = "";
}
}
System. Out. println ("last get:" + charset );
// At this point, we can convert the original byte array into a string according to normal encoding (if encoding is found)
System. Out. println ("encoding string is:" + new string (bytes, charset ));
}
Httpclient. getconnectionmanager (). Shutdown ();
}
}
[This post was last edited by darkness at, September 9,] Comment (4)
Httpclient is an open-source Java client tool library that implements the HTTP protocol. It can send HTTP requests through programs. Now httpclient has been renamed httpcomponents, httpclient 4.0 is almost re-designed, and httpclient 3 has been rewritten. all code of X. In httpclient 4.0, we fixed some issues left over from httpclient 1.0. These issues cannot be solved without changing the core AP code. Therefore, the httpclient development team completely changed the underlying code this time.
Changes in the httpclient 4.0 architecture:
1. re-designed the httpclient 4.0 API architecture and completely resolved all the architectural defect codes known as httpclient 3.x internally.
2. httpclient 4.0 provides more concise, flexible, and clear APIs.
3. httpclient 4.0 introduces many modular structures.
4. The performance of httpclient 4.0 has been greatly improved, including less memory usage. Http transmission is more efficient by using the httpcore module.
5. By using protocol interceptors, httpclient 4.0 implements the cross-HTTP (cross-cutting HTTP protocol) Protocol
6. httpclient 4.0 enhances connection management to better handle persistent connections. At the same time, httpclient 4.0 also supports connection status.
7. httpclient 4.0 adds plug-in (Pluggable) redirection (redirect) and authentication (authentication) processing.
8. httpclient 4.0 supports sending requests through a proxy or a group of proxies.
9. More flexible SSL context customization functions can be implemented in httpclient 4.0.
10. httpclient 4.0 reduces junk information in provincial and municipal HTTP requests and parse HTTP responses.
11. The httpclient Team encourages all projects to be upgraded to httpclient 4.0.
For more information about httpclient, visit the following website:
Http://www.apache.org/dist/httpc... t/release_notes.txt
The httpclient 4.0 API guide can be accessed through the following URL:
Http://hc.apache.org/httpcomponents-client/tutorial/html/
You can find some sample code of httpclient 4.0 at the URL below:
Http://hc.apache.org/httpcomponents-client/examples.html
From: http://hi.baidu.com/czqaiyss/item/660918d47e3a07c51b72b44e