How Java Gets the file encoding format

Source: Internet
Author: User

1: Simple judgment is UTF-8 or not UTF-8, because generally except UTF-8 is GBK, so set the default is GBK. When storing a file in a given character set, it is possible to store the encoded information in the first three bytes of the file, so the basic principle is to read out the three bytes of the file and determine the value of those bytes, so that the encoded format can be learned. In fact, if the project is running on the platform is the Chinese operating system, if these text files within the project, that is, developers can control the text encoding format, as long as the two common coding can be determined: GBK and UTF-8.   Because the Chinese Windows default encoding is GBK, it is generally as long as the UTF-8 encoding format is determined. For UTF-8 encoded text files, the first 3 bytes of the value is-17,-69,-65, so, determine whether the code fragment is UTF-8 encoded format as follows:
  New File (path);          InputStreaminnew  java.io.FileInputStream (file);           byte New byte [3];          In.read (b);          In.close ();           if (B[0] = = -17 && b[1] = = -69 && b[2] = = -65)               + ": Encoded as UTF-8");           Else                + ": May be GBK, or it may be another code");

2: If you want to achieve more complex file encoding detection, you can use an open source project Cpdetector, where the URL is: http://cpdetector.sourceforge.net/. Its class library is small, only about 500K, Cpdetector is based on the principle of statistics, is not guaranteed to be completely correct, using the class library to determine the text file code as follows:

Read the external file (first use Cpdetector to detect the encoding format of the file, and then use the detected encoding to read the file):

/*** Use third party open source package Cpdetector to get the file encoding format * *@parampath * source file to determine the file encoding format *@authorHuanglei *@version2012-7-12 14:05*/     Public Staticstring Getfileencode (string path) {/** Detector is a detector that gives the detection task to a specific instance of the probe implementation class. * Cpdetector contains a number of commonly used probe implementation classes, which can be added through the Add method, such as Parsingdetector, * jchardetfacade, Asciidetector, Unicodedetector         。 * Detector returns the detected * Character set encoding in accordance with the "who first returns non-null probe results, whichever is the result".         Use three third-party jar packages: Antlr.jar, Chardet.jar, and Cpdetector.jar * Cpdetector are based on statistical principles and are not guaranteed to be completely correct. */Codepagedetectorproxy Detector=codepagedetectorproxy.getinstance (); /** Parsingdetector can be used to check the encoding of HTML, XML and other files or character streams, and the parameters in the construction method are used to indicate whether the details of the probing process are displayed, and false is not displayed. */Detector.add (NewParsingdetector (false)); /** The Jchardetfacade encapsulates the jchardet provided by the Mozilla organization, which can be used to encode and measure most files.         Therefore, generally with this detector can meet the requirements of most projects, if you are not at ease, you can * add a few more detectors, such as the following asciidetector, Unicodedetector and so on. */Detector.add (Jchardetfacade.getinstance ());//used to Antlr.jar, Chardet.jar//Asciidetector for ASCII code determinationDetector.add (Asciidetector.getinstance ()); //Unicodedetector for the determination of Unicode family codesDetector.add (Unicodedetector.getinstance ()); Java.nio.charset.Charset CharSet=NULL; File F=NewFile (path); Try{CharSet=Detector.detectcodepage (F.touri (). Tourl ()); } Catch(Exception ex) {ex.printstacktrace (); }        if(CharSet! =NULL)            returnCharset.name (); Else            return NULL; }string CharsetName=Getfileencode (ConfigFilePath); System.out.println (CharsetName); InputStream=NewFileInputStream (configfile); BufferedReader in=NewBufferedReader (NewInputStreamReader (InputStream, CharsetName));

Read the jar package internal resource file (using Cpdetector to detect the encoding format of the resource file inside the jar and then read the file in the detected encoding):

/*** Use the third party open source package cpdetector to obtain the URL corresponding to the file encoding * *@parampath * The URL of the source file to determine the file encoding format *@authorHuanglei *@version2012-7-12 14:05*/     Public StaticString getfileencode (url url) {/** Detector is a detector that gives the detection task to a specific instance of the probe implementation class. * Cpdetector contains a number of commonly used probe implementation classes, which can be added through the Add method, such as Parsingdetector, * jchardetfacade, Asciidetector, Unicodedetector         。 * Detector returns the detected * Character set encoding in accordance with the "who first returns non-null probe results, whichever is the result".         Use three third-party jar packages: Antlr.jar, Chardet.jar, and Cpdetector.jar * Cpdetector are based on statistical principles and are not guaranteed to be completely correct. */Codepagedetectorproxy Detector=codepagedetectorproxy.getinstance (); /** Parsingdetector can be used to check the encoding of HTML, XML and other files or character streams, and the parameters in the construction method are used to indicate whether the details of the probing process are displayed, and false is not displayed. */Detector.add (NewParsingdetector (false)); /** The Jchardetfacade encapsulates the jchardet provided by the Mozilla organization, which can be used to encode and measure most files.         Therefore, generally with this detector can meet the requirements of most projects, if you are not at ease, you can * add a few more detectors, such as the following asciidetector, Unicodedetector and so on. */Detector.add (Jchardetfacade.getinstance ());//used to Antlr.jar, Chardet.jar//Asciidetector for ASCII code determinationDetector.add (Asciidetector.getinstance ()); //Unicodedetector for the determination of Unicode family codesDetector.add (Unicodedetector.getinstance ()); Java.nio.charset.Charset CharSet=NULL; Try{CharSet=detector.detectcodepage (URL); } Catch(Exception ex) {ex.printstacktrace (); }        if(CharSet! =NULL)            returnCharset.name (); Else            return NULL; }url URL= Createstationtreemodel.class. GetResource ("/resource/" + "config file"); URLConnection URLConnection=url.openconnection (); InputStream=Urlconnection.getinputstream (); String CharsetName=getfileencode (URL); System.out.println (CharsetName); BufferedReader in=NewBufferedReader (NewInputStreamReader (InputStream, CharsetName));

3: Probe the encoding of any input text stream by calling its overloaded form:

The number of bytes above is specified by the programmer, the more the number of bytes, the more accurate the decision, of course, the longer the time spent. Note that the specified number of bytes cannot exceed the maximum length of the text stream.

4: Specific application Examples of determining file encoding:A property file (. properties) is a common way of storing text in a Java program, such as a struts framework that leverages a property file to store string resources in a program. Its contents are as follows: #注释语句 Property name = property value The general method for reading into a property file is:
  FileInputStream ios=New  fileinputstream ("attribute file name");      Properties Prop=new  properties ();      Prop.load (iOS);      String value=prop.getproperty ("attribute name");      Ios.close ();

It is convenient to read the properties file using the Java.io.Properties load method, but if there is Chinese in the property file, garbled behavior will be found after reading it. This occurs because the Load method uses the byte stream to read the text, and after reading it, it needs to encode the byte stream into a string, and it uses the encoding "Iso-8859-1", which is the ASCII character set and does not support Chinese encoding.

Method One: Use explicit transcoding:
String value=prop.getproperty ("attribute name");       String Encvalue=new string (Value.getbytes ("Iso-8859-1″)," actual encoding of the property file ");

Method Two: Like this attribute file is within the project, we can control the encoding format of the property file, for example, the agreement to use the Windows default GBK, directly using "GBK" to transcode, if the Convention uses UTF-8, use "Utf-8″ direct transcoding."

method Three: If you want to be flexible, so that the automatic detection of code, you can use the method described above to determine the encoding of attribute files, so as to facilitate the work of developersAdd: You can get a collection of Java support encodings using the following code: Charset.availablecharsets (). KeySet ();    The following code can be used to obtain the system default encoding: Charset.defaultcharset (); This article is reproduced from http://www.cnblogs.com/java0721/archive/2012/07/21/2602963.html

How Java Gets the file encoding format

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.