The way in which the content of the Web page is encoded before the Java Crawler crawls the page content

Source: Internet
Author: User

Recently in the reptile function, crawling Web content, and then semantic analysis of the content, and finally tag the page, so as to determine the user access to the page properties.

A garbled problem was encountered while crawling content. Therefore, it is necessary to judge the content encoding format of the Web page, the way is broadly divided into three kinds: first, from the header tag to obtain content-type=#Charset; second, get content-type= #Charset from meta tags ; three, according to the page Content analysis encoding format.

One/two method does not accurately indicate the specific encoding of the page, comprehensive consideration, add a third way.

The third approach is to introduce the open source jar package Info.monitorenter.cpdetector, which can be downloaded from github (https://github.com/onilton/cpdetector-maven-repo/tree/ master/info/monitorenter/cpdetector/1.0.10) Download.

 Packagecom.mobivans.encoding;ImportInfo.monitorenter.cpdetector.io.ASCIIDetector;ImportInfo.monitorenter.cpdetector.io.ByteOrderMarkDetector;ImportInfo.monitorenter.cpdetector.io.CodepageDetectorProxy;ImportInfo.monitorenter.cpdetector.io.JChardetFacade;ImportInfo.monitorenter.cpdetector.io.ParsingDetector;ImportInfo.monitorenter.cpdetector.io.UnicodeDetector;ImportJava.io.ByteArrayInputStream;Importjava.io.IOException;ImportJava.io.InputStream;Importjava.net.MalformedURLException;ImportJava.net.URL;Importjava.net.URLConnection;ImportJava.nio.charset.Charset;Importjava.util.List;ImportJava.util.Map;Importorg.apache.commons.io.IOUtils; Public classpageencoding {/**Test Case *@paramargs*/     Public Static voidMain (string[] args) {//String charset = Getencodingbyheader ("http://blog.csdn.net/liuzhenwen/article/details/4060922");//String charset = Getencodingbymeta ("http://blog.csdn.net/liuzhenwen/article/details/4060922");String charset = Getencodingbycontentstream ("http://blog.csdn.net/liuzhenwen/article/details/5930910");    System.out.println (CharSet); }    /*** Get the page code from the header *@paramstrURL *@return     */     Public Staticstring Getencodingbyheader (String strurl) {string CharSet=NULL; Try{urlconnection Urlconn=NewURL (strURL). OpenConnection (); //get the header of a linkmap<string, list<string>> headerfields =Urlconn.getheaderfields (); //determine if there are content-type in headers            if(Headerfields.containskey ("Content-type")){                //get the Content-type in the header: [text/html; Charset=utf-8]list<string> attrs = Headerfields.get ("Content-type"); String[] as= Attrs.get (0). Split (";");  for(String att:as) {if(Att.contains ("CharSet")){//System.out.println (att.split ("=") [1]);CharSet = att.split ("=") [1]; }                }            }             returnCharSet; } Catch(malformedurlexception e) {e.printstacktrace (); returnCharSet; } Catch(IOException e) {e.printstacktrace (); returnCharSet; }    }        /*** Get page encoding from META *@paramstrURL *@return     */     Public Staticstring Getencodingbymeta (String strurl) {string CharSet=NULL; Try{urlconnection Urlconn=NewURL (strURL). OpenConnection (); //avoid being rejectedUrlconn.setrequestproperty ("User-agent", "mozilla/5.0" (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/51.0.2704.103 safari/537.36 "); //reads HTML into a listList<string> lines =Ioutils.readlines (Urlconn.getinputstream ());  for(String line:lines) {if(Line.contains ("http-equiv") && line.contains ("CharSet"))){//System.out.println (line);String tmp = Line.split (";") [1]; CharSet= Tmp.substring (Tmp.indexof ("=") +1, Tmp.indexof ("\" ")); }Else{                    Continue; }            }            returnCharSet; } Catch(malformedurlexception e) {e.printstacktrace (); returnCharSet; } Catch(IOException e) {e.printstacktrace (); returnCharSet; }    }        /*** page encoding according to Web content * Case: For situations where you can read a webpage directly (exception: some blog sites prohibit access requests without user-agent information) *@paramURL *@return     */     Public Staticstring getencodingbycontenturl (string url) {Codepagedetectorproxy CDP=codepagedetectorproxy.getinstance (); Cdp.add (Jchardetfacade.getinstance ());//dependent jar Packages: Antlr.jar & Chardet.jarCdp.add (Asciidetector.getinstance ());        Cdp.add (Unicodedetector.getinstance ()); Cdp.add (NewParsingdetector (false)); Cdp.add (Newbyteordermarkdetector ()); Charset Charset=NULL; Try{CharSet= Cdp.detectcodepage (Newurl (url)); } Catch(malformedurlexception e) {e.printstacktrace (); } Catch(IOException e) {e.printstacktrace ();        } System.out.println (CharSet); returnCharSet = =NULL?NULL: Charset.name (). toLowerCase (); }        /*** page encoding based on Web content * Case: For cases where the page cannot be read directly, by converting the page to a mark-enabled input stream and then parsing the encoding *@paramstrURL *@return     */     Public Staticstring Getencodingbycontentstream (String strurl) {Charset Charset=NULL; Try{urlconnection Urlconn=NewURL (strURL). OpenConnection (); //Open link, add user-agent, avoid being rejectedUrlconn.setrequestproperty ("User-agent", "mozilla/5.0" (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/51.0.2704.103 safari/537.36 "); //Parse page ContentCodepagedetectorproxy CDP =codepagedetectorproxy.getinstance (); Cdp.add (Jchardetfacade.getinstance ());//dependent jar Packages: Antlr.jar & Chardet.jarCdp.add (Asciidetector.getinstance ());            Cdp.add (Unicodedetector.getinstance ()); Cdp.add (NewParsingdetector (false)); Cdp.add (Newbyteordermarkdetector ()); InputStream in=Urlconn.getinputstream (); Bytearrayinputstream Bais=NewBytearrayinputstream (Ioutils.tobytearray (in)); //Detectcodepage (inputstream in, int length) supports only Mark's InputStreamCharSet = Cdp.detectcodepage (Bais, 2147483647); } Catch(malformedurlexception e) {e.printstacktrace (); } Catch(IOException e) {e.printstacktrace (); }        returnCharSet = =NULL?NULL: Charset.name (). toLowerCase (); }}

Note the point:

1.info.monitorenter.cpdetector is not open source in mvn-repository and therefore cannot be downloaded from mvn-repository, the jar needs to be put down locally and then manually imported to the local REPOSITORY,MVN command as follows:

Install:Install-the location of thefile -dfile=jar package-dgroupid= The groupid-dartifactid= of the jar The jar's artifactid-dversion= the jar's Version-dpackaging=jar

Then add the dependency of the jar in the Pom.xml

<!--CharSet Detector -<Dependency>    <groupId>Info.monitorenter.cpdetector</groupId>    <Artifactid>Cpdetector</Artifactid>    <version>1.0.10</version></Dependency>

2.jchardetfacade.getinstance () in the introduction Antlr.jar and Chardet.jar will report an exception before, add the dependency of these two jars in Pom.xml:

<!--ANTLR -<Dependency>    <groupId>Antlr</groupId>    <Artifactid>Antlr</Artifactid>    <version>2.7.7</version></Dependency><!--Chardetfacade -<Dependency>    <groupId>Net.sourceforge.jchardet</groupId>    <Artifactid>Jchardet</Artifactid>    <version>1.0</version></Dependency>

If it's a normal project, don't worry about pom.xml, just download the three jar packages and add them to the project's environment.

The way in which the content of the Web page is encoded before the Java Crawler crawls the page content

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.