Java correctly identifies the character set of the file (especially the UTF-8 characters with and without BOM)

Source: Internet
Author: User

You need to read the TXT file uploaded by the user in the project a few days ago, but you are not sure about the character set of the TXT file.

UTF-16, UTF-8 (with BOM), Unicode can be different based on the first three bytes

Public String gettxtencode (fileinputstream in) throws ioexception {byte [] Head = new byte [3]; In. read (head); string code = "GBK"; if (head [0] =-1 & head [1] =-2) code = "UTF-16 "; if (head [0] =-2 & head [1] =-1) code = "Unicode "; // with BOM if (head [0] =-17 & head [1] =-69 & head [2] =-65) code = "UTF-8"; if ("Unicode ". equals (CODE) {code = "UTF-16";} return code ;}

The UTF-8 without BOM and the first three bytes of GBK are uncertain.

By searching on Google, it is found that the identification without Bom is a bug left by Java. Haha, the root cause is finally found. Java provides a solution to this bug.

package com.justsy.sts.utf8;import java.io.*;  /**  * This inputstream will recognize unicode BOM marks and will skip bytes if  * getEncoding() method is called before any of the read(...) methods.  *   * Usage pattern: String enc = "ISO-8859-1"; // or NULL to use systemdefault  * FileInputStream fis = new FileInputStream(file); UnicodeInputStream uin = new  * UnicodeInputStream(fis, enc); enc = uin.getEncoding(); // check and skip  * possible BOM bytes InputStreamReader in; if (enc == null) in = new  * InputStreamReader(uin); else in = new InputStreamReader(uin, enc);  */  public class UnicodeInputStream extends InputStream {      PushbackInputStream internalIn;      boolean isInited = false;      String defaultEnc;      String encoding;        private static final int BOM_SIZE = 4;        public UnicodeInputStream(InputStream in, String defaultEnc) {          internalIn = new PushbackInputStream(in, BOM_SIZE);          this.defaultEnc = defaultEnc;      }        public String getDefaultEncoding() {          return defaultEnc;      }        public String getEncoding() {          if (!isInited) {              try {                  init();              } catch (IOException ex) {                  IllegalStateException ise = new IllegalStateException(                          "Init method failed.");                  ise.initCause(ise);                  throw ise;              }          }          return encoding;      }        /**      * Read-ahead four bytes and check for BOM marks. Extra bytes are unread      * back to the stream, only BOM bytes are skipped.      */      protected void init() throws IOException {          if (isInited)              return;            byte bom[] = new byte[BOM_SIZE];          int n, unread;          n = internalIn.read(bom, 0, bom.length);            if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00)                  && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {              encoding = "UTF-32BE";              unread = n - 4;          } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)                  && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {              encoding = "UTF-32LE";              unread = n - 4;          } else if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB)                  && (bom[2] == (byte) 0xBF)) {              encoding = "UTF-8";              unread = n - 3;          } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {              encoding = "UTF-16BE";              unread = n - 2;          } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {              encoding = "UTF-16LE";              unread = n - 2;          } else {              // Unicode BOM mark not found, unread all bytes              encoding = defaultEnc;              unread = n;          }          // System.out.println("read=" + n + ", unread=" + unread);            if (unread > 0)              internalIn.unread(bom, (n - unread), unread);            isInited = true;      }        public void close() throws IOException {          // init();          isInited = true;          internalIn.close();      }        public int read() throws IOException {          // init();          isInited = true;          return internalIn.read();      }  }

By using the above-mentioned inputstream class implementation, You can correctly read and retrieve character sets without BOM and BOM.

package com.justsy.sts.utf8;import java.io.BufferedReader;  import java.io.File;  import java.io.FileInputStream;  import java.io.IOException;  import java.io.InputStreamReader;import java.nio.charset.Charset;  public class UTF8Test {      public static void main(String[] args) throws IOException {          File f  = new File("D:"+File.separator+"Order.txt");          FileInputStream in = new FileInputStream(f);          String dc  = Charset.defaultCharset().name();        UnicodeInputStream uin = new UnicodeInputStream(in,dc);        BufferedReader br = new BufferedReader(new InputStreamReader(uin));          String line = br.readLine();          while(line != null)          {              System.out.println(line);              line = br.readLine();          }      }  }

Combined with the solutions provided by Java, we can fully identify various character sets.

Public String gettxtencode (fileinputstream in) throws ioexception {string Dc = charset. defaultcharset (). name (); unicodeinputstream uin = new unicodeinputstream (in, DC); If ("UTF-8 ". equals (uin. getencoding () {uin. close (); Return "UTF-8";} uin. close (); byte [] Head = new byte [3]; In. read (head); string code = "GBK"; if (head [0] =-1 & head [1] =-2) code = "UTF-16 "; if (head [0] =-2 & head [1] =-1) code = "Unicode "; // with BOM if (head [0] =-17 & head [1] =-69 & head [2] =-65) code = "UTF-8"; if ("Unicode ". equals (CODE) {code = "UTF-16";} return code ;}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.