article from: http://babyjoycry.javaeye.com/blog/587527 here thanks to the original author ... \ (^o^)/~Recent research crawl Web content, found to get the encoding format of the page, Java does not have a ready-made method of implementation, although CSDN has a talent to write an article, accompanied by code, unfortunately, I did not find the relevant package, forced, had to do their own hands and clothing.
Import Java.io.BufferedReader; Import java.io.IOException; Import Java.io.InputStreamReader; Import java.net.HttpURLConnection; Import Java.net.URL; Import Java.util.Iterator; Import java.util.List; Import Java.util.Map; Import Java.util.Set; Import Cpdetector.io.CodepageDetectorProxy; Import Cpdetector.io.HTMLCodepageDetector; Import Cpdetector.io.JChardetFacade; public class Pageencodedetector {private static Codepagedetectorproxy detector = Codepagedetectorproxy. Getinstan CE (); static {Detector.add (new Htmlcodepagedetector (false)); Detector.add (Jchardetfacade.getinstance ()); }/** * Test case * @param args */public static void main (string[] args) {Pageencodedetector web = new Pageencodedetector (); try {System.out.println (Web.getcharset ("http://www.baidu.com/")); } catch (IOException e) {//TODO auto-generated catch block E.printstacktrace (); }}/** * @param strurl * pageURL address, need to start with http://http://www.pujia.com * @return * @throws IOException */Public String Getcharset (strin G strURL) throws IOException {//define URL object url url = new URL (strurl); Gets the HTTP connection object HttpURLConnection URLConnection = (httpurlconnection) URL. OpenConnection (); ; Urlconnection.connect (); Page encoding String strencoding = null; /** * First according to header information, determine the page encoding *//MAP is the header information (URL page header information) map<string, list<string>> Map = u Rlconnection.getheaderfields (); set<string> keys = Map.keyset (); iterator<string> Iterator = Keys.iterator (); Traversal, finding character encoding String key = null; String tmp = NULL; while (Iterator.hasnext ()) {key = Iterator.next (); TMP = Map.get (key). ToString (). toLowerCase (); Get Content-type CharSet if (key! = null && key.equals ("Content-type")) {int m = tmp.indexof ("Char Set= "); if (M! =-1) {strencoding = tMp.substring (M + 8). Replace ("]", "" "); return strencoding; }}}/** * Get the page code by parsing meta *///Get Web page source code (English characters and numbers are not garbled, so you can get correct <meta/> area) StringBuffer SB = new StringBuffer (); String Line; try {BufferedReader in = new BufferedReader (new InputStreamReader (URL. OpenStream ())); while (line = In.readline ())! = null) {sb.append (line); } in.close (); } catch (Exception e) {//Report No errors that arise System.err.println (e); System.err. println ("Usage:java HttpClient <URL> [<filename>]"); } String Htmlcode = Sb.tostring (); Parse the HTML source, remove the <meta/> area, and remove the charset String strbegin = "<meta"; String strend = ">"; String strtmp; int begin = Htmlcode.indexof (Strbegin); int end =-1; int inttmp; while (Begin >-1) {end = Htmlcode.substring (BEGIN). IndexOf (Strend); if (Begin >-1&& End >-1) {strtmp = htmlcode.substring (begin, Begin + End). toLowerCase (); Inttmp = Strtmp.indexof ("charset"); if (Inttmp >-1) {strencoding = strtmp.substring (inttmp + 7, end). replace ("=", ""). Replace ("/ "," "). Replace (" \ "", ""). Replace ("\", ""). Replace ("", ""). return strencoding; }} Htmlcode = Htmlcode.substring (begin); Begin = Htmlcode.indexof (Strbegin); }/** * Parse bytes Get page encoding */strencoding = getfileencoding (URL); Sets the default page character encoding if (strencoding = = null) {strencoding = "GBK"; } return strencoding; }/** * *<br> * Method Description: Through the Page Content Identification page encoding * *<br> * Input parameters: strURL Web links; Timeout Timeout setting * *<br> * return type: page encoding */public static String getfileencoding (url url) {java.ni O.charset.charset charset = null; try {charset = detector.detectcodepage (URL); } catch (Exception e) {System.out.println (E.getclass () + "parse" + "Encode failed"); } if (charset! = null) return Charset.name (); return null; } }
Need to download Cpdetector_1.0.5.jar and Chardet.jar
Java get page encoding