Some time ago, when I was learning lucene, I encountered a problem of encoding error in reading TXT document. Learn a few solutions, most of which is to convert files to hex (can be viewed using the UE's CTRL+H), read the beginning of the four flag bit to judge. However, there are always some text files that are not recognized (I encountered a partial use of UTF-8 encoded files), and later found the Jchardet. Jchardet is Mozilla (that is, Firefox) code recognition algorithm Java implementation, forget, here is the official website, see for yourself.
On the code:
PackageCom.zhyea.util;ImportJava.io.BufferedInputStream;ImportJava.io.File;ImportJava.io.FileInputStream;Importjava.io.FileNotFoundException;Importjava.io.IOException;ImportOrg.mozilla.intl.chardet.nsDetector;ImportOrg.mozilla.intl.chardet.nsICharsetDetectionObserver;/*** Use Jchardet to get the file character set * *@authorRobin **/ Public classFilecharsetdetector {/*** Character Set name*/ Private StaticString encoding; /*** Whether the character set has been detected*/ Private Static Booleanfound; Private StaticNsdetector Detector; Private StaticNsicharsetdetectionobserver Observer; /*** Adaptive Language enumeration *@authorRobin **/ enumlanguage{Japanese (1), Chinese (2), SimplifiedChinese (3), TraditionalChinese (4), Korean (5), Dontknow (6); Private inthint; Language (inthint) { This. hint =hint; } Public intGethint () {return This. Hint; } } /*** Pass in a file object, check the file encoding * *@paramfile * File object instance *@returnfile Encoding, if none, returns Null *@throwsFileNotFoundException *@throwsIOException*/ Public StaticString checkencoding (file file)throwsFileNotFoundException, IOException {returncheckencoding (file, Getnsdetector ()); } /*** Get the encoding of the file * *@paramfile * File object instance *@paramlanguage * Language *@returnFile Encoding *@throwsFileNotFoundException *@throwsIOException*/ Public StaticString checkencoding (file file, Language Lang)throwsFileNotFoundException, IOException {returnCheckencoding (file,NewNsdetector (Lang.gethint ())); } /*** Get the encoding of the file * *@parampath * File paths *@returnfile encoding, eg:utf-8,gbk,gb2312 form, if none, returns Null *@throwsFileNotFoundException *@throwsIOException*/ Public StaticString checkencoding (String path)throwsFileNotFoundException, IOException {returnCheckencoding (NewFile (path)); } /*** Get the encoding of the file * *@parampath * File paths *@paramlanguage * Language *@return * @throwsFileNotFoundException *@throwsIOException*/ Public Staticstring checkencoding (string path, Language lang)throwsFileNotFoundException, IOException {returnCheckencoding (NewFile (path), Lang); } /*** Get the encoding of the file * *@paramfile *@paramdet *@return * @throwsFileNotFoundException *@throwsIOException*/ Private StaticString checkencoding (file file, nsdetector detector)throwsFileNotFoundException, IOException {detector. Init (Getcharsetdetectionobserver ()); if(isascii (file, detector)) {encoding= "ASCII"; Found=true; } if(!found) {String prob[]=detector.getprobablecharsets (); if(Prob.length > 0) {Encoding= Prob[0]; } Else { return NULL; } } returnencoding; } /*** Check if the file encoding type is ASCII type *@paramfile * To check for encoded files *@paramDetector *@return * @throwsIOException*/ Private Static BooleanIsascii (file file, Nsdetector detector)throwsioexception{Bufferedinputstream Input=NULL; Try{input=NewBufferedinputstream (Newfileinputstream (file)); byte[] buffer =New byte[1024]; intHasread; BooleanDone =false; BooleanIsascii =true; while((hasread=input.read (buffer))! =-1) { if(isascii) isascii=detector.isascii (buffer, hasread); if(!isascii &&!)Done )= Detector. DoIt (buffer, Hasread,false); } returnIsascii; }finally{detector. Dataend (); if(NULL!=input) input.close (); } } /*** Nsdetector single case Creation *@return */ Private Staticnsdetector Getnsdetector () {if(NULL==detector) {Detector=NewNsdetector (); } returndetector; } /*** Nsicharsetdetectionobserver single case Creation *@return */ Private Staticnsicharsetdetectionobserver Getcharsetdetectionobserver () {if(NULL==observer) {Observer=NewNsicharsetdetectionobserver () { Public voidNotify (String charset) {found=true; Encoding=CharSet; } }; } returnObserver; }}
This still has a problem, that is, to identify Unicode encoded files, will return windows-1252. When I use windows-1252 as my code, I get an error.
Yes, and then provide a download of this jar package address, the official website will sometimes convulsions, can not access.
: http://download.csdn.net/detail/tianxiexingyun/8286849
That's it.
Get the file character set with Jchardet