Get the file character set with Jchardet

Source: Internet
Author: User
Tags intl

Some time ago, when I was learning lucene, I encountered a problem of encoding error in reading TXT document. Learn a few solutions, most of which is to convert files to hex (can be viewed using the UE's CTRL+H), read the beginning of the four flag bit to judge. However, there are always some text files that are not recognized (I encountered a partial use of UTF-8 encoded files), and later found the Jchardet. Jchardet is Mozilla (that is, Firefox) code recognition algorithm Java implementation, forget, here is the official website, see for yourself.

On the code:

 PackageCom.zhyea.util;ImportJava.io.BufferedInputStream;ImportJava.io.File;ImportJava.io.FileInputStream;Importjava.io.FileNotFoundException;Importjava.io.IOException;ImportOrg.mozilla.intl.chardet.nsDetector;ImportOrg.mozilla.intl.chardet.nsICharsetDetectionObserver;/*** Use Jchardet to get the file character set * *@authorRobin **/ Public classFilecharsetdetector {/*** Character Set name*/    Private StaticString encoding; /*** Whether the character set has been detected*/    Private Static Booleanfound; Private StaticNsdetector Detector; Private StaticNsicharsetdetectionobserver Observer; /*** Adaptive Language enumeration *@authorRobin **/    enumlanguage{Japanese (1), Chinese (2), SimplifiedChinese (3), TraditionalChinese (4), Korean (5), Dontknow (6); Private inthint; Language (inthint) {             This. hint =hint; }                 Public intGethint () {return  This. Hint; }    }        /*** Pass in a file object, check the file encoding * *@paramfile * File object instance *@returnfile Encoding, if none, returns Null *@throwsFileNotFoundException *@throwsIOException*/     Public StaticString checkencoding (file file)throwsFileNotFoundException, IOException {returncheckencoding (file, Getnsdetector ()); }    /*** Get the encoding of the file * *@paramfile * File object instance *@paramlanguage * Language *@returnFile Encoding *@throwsFileNotFoundException *@throwsIOException*/     Public StaticString checkencoding (file file, Language Lang)throwsFileNotFoundException, IOException {returnCheckencoding (file,NewNsdetector (Lang.gethint ())); }    /*** Get the encoding of the file * *@parampath * File paths *@returnfile encoding, eg:utf-8,gbk,gb2312 form, if none, returns Null *@throwsFileNotFoundException *@throwsIOException*/     Public StaticString checkencoding (String path)throwsFileNotFoundException, IOException {returnCheckencoding (NewFile (path)); }    /*** Get the encoding of the file * *@parampath * File paths *@paramlanguage * Language *@return     * @throwsFileNotFoundException *@throwsIOException*/     Public Staticstring checkencoding (string path, Language lang)throwsFileNotFoundException, IOException {returnCheckencoding (NewFile (path), Lang); }    /*** Get the encoding of the file * *@paramfile *@paramdet *@return     * @throwsFileNotFoundException *@throwsIOException*/    Private StaticString checkencoding (file file, nsdetector detector)throwsFileNotFoundException, IOException {detector.                Init (Getcharsetdetectionobserver ()); if(isascii (file, detector)) {encoding= "ASCII"; Found=true; }        if(!found) {String prob[]=detector.getprobablecharsets (); if(Prob.length > 0) {Encoding= Prob[0]; } Else {                return NULL; }        }                returnencoding; }        /*** Check if the file encoding type is ASCII type *@paramfile * To check for encoded files *@paramDetector *@return     * @throwsIOException*/    Private Static BooleanIsascii (file file, Nsdetector detector)throwsioexception{Bufferedinputstream Input=NULL; Try{input=NewBufferedinputstream (Newfileinputstream (file)); byte[] buffer =New byte[1024]; intHasread; BooleanDone =false; BooleanIsascii =true;  while((hasread=input.read (buffer))! =-1) {                if(isascii) isascii=detector.isascii (buffer, hasread); if(!isascii &&!)Done )= Detector. DoIt (buffer, Hasread,false); }                        returnIsascii; }finally{detector.            Dataend (); if(NULL!=input) input.close (); }    }        /*** Nsdetector single case Creation *@return     */    Private Staticnsdetector Getnsdetector () {if(NULL==detector) {Detector=NewNsdetector (); }        returndetector; }        /*** Nsicharsetdetectionobserver single case Creation *@return     */    Private Staticnsicharsetdetectionobserver Getcharsetdetectionobserver () {if(NULL==observer) {Observer=NewNsicharsetdetectionobserver () { Public voidNotify (String charset) {found=true; Encoding=CharSet;        }            }; }        returnObserver; }}

This still has a problem, that is, to identify Unicode encoded files, will return windows-1252. When I use windows-1252 as my code, I get an error.

Yes, and then provide a download of this jar package address, the official website will sometimes convulsions, can not access.

: http://download.csdn.net/detail/tianxiexingyun/8286849

That's it.

Get the file character set with Jchardet

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.