How does spider JAVA determine webpage encoding?

Source: Internet
Author: User

Preface

Recently, a search project requires crawling many websites to obtain the required information. When crawling a webpage, You need to obtain the code of the webpage. Otherwise, you will find that many of the crawled webpages are garbled.

 

Analysis

Generally, encoding is specified for the webpage header. You can parse the header or meta to obtain the charset. However, sometimes the webpage does not specify encoding. In this case, you need to check the encoding format of the webpage content. through research, it is best to use cpdetector.

Cpdetector automatically detects the text encoding format. If a non-empty result is returned, the result is encoded as a character. Some common probe implementation classes are built in. Examples of these probe implementation classes can be added through the add method. For example, the detector follows "who first returns non-empty probe results, the detected character set encoding is returned based on the principle of "whose result prevails.

 

1. First, charset can be parsed from the header.

Content-Type in the webpage header information specifies the encoding,

  

You can analyze the header to find the character encoding.

Map <String, List <String> map = urlConnection. getHeaderFields (); Set <String> keys = map. keySet (); Iterator <String> iterator = keys. iterator (); // traversal, search for character encoding String key = null; String tmp = null; while (iterator. hasNext () {key = iterator. next (); tmp = map. get (key ). toString (). toLowerCase (); // get the content-type charset if (key! = Null & key. equals ("Content-Type") {int m = tmp. indexOf ("charset ="); if (m! =-1) {strencoding = tmp. substring (m + 8). replace ("]", ""); return strencoding ;}}}

 

2. Second, charset can be parsed from the webpage meta

Under normal circumstances, when writing a webpage, the webpage code is specified and can be read from the meta.

  

First, get the webpage stream. Because the English and numbers are not garbled, You can parse the meta and get the charset.

StringBuffer sb = new StringBuffer (); String line; try {BufferedReader in = new BufferedReader (new InputStreamReader (url. openStream (); while (line = in. readLine ())! = Null) {sb. append (line);} in. close ();} catch (Exception e) {// Report any errors that arise System. err. println (e); System. err. println ("Usage: java HttpClient <URL> [<filename>]");} String htmlcode = sb. toString (); // parse the html source code, retrieve the <meta/> area, and retrieve charsetString strbegin = "<meta"; String strend = ">"; String strtmp; int begin = htmlcode. indexOf (strbegin); int end =-1; int inttmp; while (begin>-1) {end = htmlcode. substring (begin ). indexOf (strend); if (begin>-1 & end>-1) {strtmp = htmlcode. substring (begin, begin + end ). toLowerCase (); inttmp = strtmp. indexOf ("charset"); if (inttmp>-1) {strencoding = strtmp. substring (inttmp + 7, end ). replace ("= ",""). replace ("/",""). replace ("\"",""). replace ("\'",""). replace ("", ""); return strencoding ;}} htmlcode = htmlcode. substring (begin); begin = htmlcode. indexOf (strbegin );}

 

3. When the encoding format is not parsed using 1 and 2, the cpdetector is used to detect the encoding format based on the webpage content.

You can add multiple encoding test instances:

public static void getFileEncoding(URL url) throws MalformedURLException, IOException {    CodepageDetectorProxy codepageDetectorProxy = CodepageDetectorProxy.getInstance();        codepageDetectorProxy.add(JChardetFacade.getInstance());    codepageDetectorProxy.add(ASCIIDetector.getInstance());    codepageDetectorProxy.add(UnicodeDetector.getInstance());    codepageDetectorProxy.add(new ParsingDetector(false));    codepageDetectorProxy.add(new ByteOrderMarkDetector());    Charset charset = codepageDetectorProxy.detectCodepage(url);    System.out.println(charset.name());    }}

 

 

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.