Preface
Recently, a search project requires crawling many websites to obtain the required information. When crawling a webpage, You need to obtain the code of the webpage. Otherwise, you will find that many of the crawled webpages are garbled.
Analysis
Generally, encoding is specified for the webpage header. You can parse the header or meta to obtain the charset. However, sometimes the webpage does not specify encoding. In this case, you need to check the encoding format of the webpage content. through research, it is best to use cpdetector.
Cpdetector automatically detects the text encoding format. If a non-empty result is returned, the result is encoded as a character. Some common probe implementation classes are built in. Examples of these probe implementation classes can be added through the add method. For example, the detector follows "who first returns non-empty probe results, the detected character set encoding is returned based on the principle of "whose result prevails.
1. First, charset can be parsed from the header.
Content-Type in the webpage header information specifies the encoding,
You can analyze the header to find the character encoding.
Map <String, List <String> map = urlConnection. getHeaderFields (); Set <String> keys = map. keySet (); Iterator <String> iterator = keys. iterator (); // traversal, search for character encoding String key = null; String tmp = null; while (iterator. hasNext () {key = iterator. next (); tmp = map. get (key ). toString (). toLowerCase (); // get the content-type charset if (key! = Null & key. equals ("Content-Type") {int m = tmp. indexOf ("charset ="); if (m! =-1) {strencoding = tmp. substring (m + 8). replace ("]", ""); return strencoding ;}}}
2. Second, charset can be parsed from the webpage meta
Under normal circumstances, when writing a webpage, the webpage code is specified and can be read from the meta.
First, get the webpage stream. Because the English and numbers are not garbled, You can parse the meta and get the charset.
StringBuffer sb = new StringBuffer (); String line; try {BufferedReader in = new BufferedReader (new InputStreamReader (url. openStream (); while (line = in. readLine ())! = Null) {sb. append (line);} in. close ();} catch (Exception e) {// Report any errors that arise System. err. println (e); System. err. println ("Usage: java HttpClient <URL> [<filename>]");} String htmlcode = sb. toString (); // parse the html source code, retrieve the <meta/> area, and retrieve charsetString strbegin = "<meta"; String strend = ">"; String strtmp; int begin = htmlcode. indexOf (strbegin); int end =-1; int inttmp; while (begin>-1) {end = htmlcode. substring (begin ). indexOf (strend); if (begin>-1 & end>-1) {strtmp = htmlcode. substring (begin, begin + end ). toLowerCase (); inttmp = strtmp. indexOf ("charset"); if (inttmp>-1) {strencoding = strtmp. substring (inttmp + 7, end ). replace ("= ",""). replace ("/",""). replace ("\"",""). replace ("\'",""). replace ("", ""); return strencoding ;}} htmlcode = htmlcode. substring (begin); begin = htmlcode. indexOf (strbegin );}
3. When the encoding format is not parsed using 1 and 2, the cpdetector is used to detect the encoding format based on the webpage content.
You can add multiple encoding test instances:
public static void getFileEncoding(URL url) throws MalformedURLException, IOException { CodepageDetectorProxy codepageDetectorProxy = CodepageDetectorProxy.getInstance(); codepageDetectorProxy.add(JChardetFacade.getInstance()); codepageDetectorProxy.add(ASCIIDetector.getInstance()); codepageDetectorProxy.add(UnicodeDetector.getInstance()); codepageDetectorProxy.add(new ParsingDetector(false)); codepageDetectorProxy.add(new ByteOrderMarkDetector()); Charset charset = codepageDetectorProxy.detectCodepage(url); System.out.println(charset.name()); }}