Java correctly identifies the character set of the file (especially the UTF-8 characters with and without BOM)

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

You need to read the TXT file uploaded by the user in the project a few days ago, but you are not sure about the character set of the TXT file.

UTF-16, UTF-8 (with BOM), Unicode can be different based on the first three bytes

Public String gettxtencode (fileinputstream in) throws ioexception {byte [] Head = new byte [3]; In. read (head); string code = "GBK"; if (head [0] =-1 & head [1] =-2) code = "UTF-16 "; if (head [0] =-2 & head [1] =-1) code = "Unicode "; // with BOM if (head [0] =-17 & head [1] =-69 & head [2] =-65) code = "UTF-8"; if ("Unicode ". equals (CODE) {code = "UTF-16";} return code ;}

The UTF-8 without BOM and the first three bytes of GBK are uncertain.

By searching on Google, it is found that the identification without Bom is a bug left by Java. Haha, the root cause is finally found. Java provides a solution to this bug.

package com.justsy.sts.utf8;import java.io.*;  /**  * This inputstream will recognize unicode BOM marks and will skip bytes if  * getEncoding() method is called before any of the read(...) methods.  *   * Usage pattern: String enc = "ISO-8859-1"; // or NULL to use systemdefault  * FileInputStream fis = new FileInputStream(file); UnicodeInputStream uin = new  * UnicodeInputStream(fis, enc); enc = uin.getEncoding(); // check and skip  * possible BOM bytes InputStreamReader in; if (enc == null) in = new  * InputStreamReader(uin); else in = new InputStreamReader(uin, enc);  */  public class UnicodeInputStream extends InputStream {      PushbackInputStream internalIn;      boolean isInited = false;      String defaultEnc;      String encoding;        private static final int BOM_SIZE = 4;        public UnicodeInputStream(InputStream in, String defaultEnc) {          internalIn = new PushbackInputStream(in, BOM_SIZE);          this.defaultEnc = defaultEnc;      }        public String getDefaultEncoding() {          return defaultEnc;      }        public String getEncoding() {          if (!isInited) {              try {                  init();              } catch (IOException ex) {                  IllegalStateException ise = new IllegalStateException(                          "Init method failed.");                  ise.initCause(ise);                  throw ise;              }          }          return encoding;      }        /**      * Read-ahead four bytes and check for BOM marks. Extra bytes are unread      * back to the stream, only BOM bytes are skipped.      */      protected void init() throws IOException {          if (isInited)              return;            byte bom[] = new byte[BOM_SIZE];          int n, unread;          n = internalIn.read(bom, 0, bom.length);            if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00)                  && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {              encoding = "UTF-32BE";              unread = n - 4;          } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)                  && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {              encoding = "UTF-32LE";              unread = n - 4;          } else if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB)                  && (bom[2] == (byte) 0xBF)) {              encoding = "UTF-8";              unread = n - 3;          } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {              encoding = "UTF-16BE";              unread = n - 2;          } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {              encoding = "UTF-16LE";              unread = n - 2;          } else {              // Unicode BOM mark not found, unread all bytes              encoding = defaultEnc;              unread = n;          }          // System.out.println("read=" + n + ", unread=" + unread);            if (unread > 0)              internalIn.unread(bom, (n - unread), unread);            isInited = true;      }        public void close() throws IOException {          // init();          isInited = true;          internalIn.close();      }        public int read() throws IOException {          // init();          isInited = true;          return internalIn.read();      }  }

By using the above-mentioned inputstream class implementation, You can correctly read and retrieve character sets without BOM and BOM.

package com.justsy.sts.utf8;import java.io.BufferedReader;  import java.io.File;  import java.io.FileInputStream;  import java.io.IOException;  import java.io.InputStreamReader;import java.nio.charset.Charset;  public class UTF8Test {      public static void main(String[] args) throws IOException {          File f  = new File("D:"+File.separator+"Order.txt");          FileInputStream in = new FileInputStream(f);          String dc  = Charset.defaultCharset().name();        UnicodeInputStream uin = new UnicodeInputStream(in,dc);        BufferedReader br = new BufferedReader(new InputStreamReader(uin));          String line = br.readLine();          while(line != null)          {              System.out.println(line);              line = br.readLine();          }      }  }

Combined with the solutions provided by Java, we can fully identify various character sets.

Public String gettxtencode (fileinputstream in) throws ioexception {string Dc = charset. defaultcharset (). name (); unicodeinputstream uin = new unicodeinputstream (in, DC); If ("UTF-8 ". equals (uin. getencoding () {uin. close (); Return "UTF-8";} uin. close (); byte [] Head = new byte [3]; In. read (head); string code = "GBK"; if (head [0] =-1 & head [1] =-2) code = "UTF-16 "; if (head [0] =-2 & head [1] =-1) code = "Unicode "; // with BOM if (head [0] =-17 & head [1] =-69 & head [2] =-65) code = "UTF-8"; if ("Unicode ". equals (CODE) {code = "UTF-16";} return code ;}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java correctly identifies the character set of the file (especially the UTF-8 characters with and without BOM)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java correctly identifies the character set of the file (especially the UTF-8 characters with and without BOM)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support