Java read garbled UTF-8 files with BOM

Source: Internet
Author: User
Tags bug id

Using a UTF-8 file written in java can be read correctly, but if you use NotePad to save the same content in UTF-8 format, when a program is used, one invisible character is read from the file, and one more?

Instance:

Create a text: txt text with the content of test BOM, save as UTF-8.

Process the UnicodeReader class with BOM

Package com. java. io; import java. io. IOException; import java. io. inputStream; import java. io. inputStreamReader; import java. io. pushbackInputStream; import java. io. reader;/** version: 1.1/2007-01-25-changed BOM recognition ordering (longer boms first) Network Address: http://koti.mbnet.fi/akini/java/unicodereader/UnicodeReader.java.txtOriginal pseudo code: Thomas WeidenfellerImplementation tweaked: Aki Nieminenhttp: // Www.unicode.org/unicode/faq/utf_bom.htmlBOMs: 00 00 fe ff = UTF-32, big-endian ff fe 00 = UTF-32, little-endian ef bb bf = UTF-8, fe ff = UTF-16, big-endian ff fe = UTF-16, little-endianWin2k Notepad: Unicode format = UTF-16LE *** // *** Generic unicode textreader, which will use BOM mark * to identify the encoding to be used. if BOM is not found * then use a given default or system encoding. */pub Lic class UnicodeReader extends Reader {PushbackInputStream internalIn; InputStreamReader internalIn2 = null; String defaultEnc; private static final int BOM_SIZE = 4; /***** @ param in inputstream to be read * @ param defaultEnc default encoding if stream does not have * BOM marker. give NULL to use system-level default. */UnicodeReader (InputStream in, String defaultEnc) {internalIn = new Pushb AckInputStream (in, BOM_SIZE); this. defaultEnc = defaultEnc;} public String getDefaultEncoding () {return defaultEnc;}/*** Get stream encoding or NULL if stream is uninitialized. * Call init () or read () method to initialize it. */public String getEncoding () {if (internalIn2 = null) return null; return internalIn2.getEncoding ();}/*** Read-ahead four bytes and check for BOM marks. extra byt Es are * unread back to the stream, only BOM bytes are skipped. */protected void init () throws IOException {if (internalIn2! = Null) return; String encoding; byte bom [] = new byte [BOM_SIZE]; int n, unread; n = internalIn. read (bom, 0, bom. length); if (bom [0] = (byte) 0x00) & (bom [1] = (byte) 0x00) & (bom [2] = (byte) 0xFE) & (bom [3] = (byte) 0xFF) {encoding = "UTF-32BE "; unread = n-4;} else if (bom [0] = (byte) 0xFF) & (bom [1] = (byte) 0xFE) & (bom [2] = (byte) 0x00) & (bom [3] = (byte) 0x00) {encoding = "UTF-32LE "; unread = n-4;} else if (bom [0] = (byte) 0xEF) & (bom [1] = (byte) 0xBB) & (bom [2] = (byte) 0xBF) {encoding = "UTF-8"; unread = n-3 ;} else if (bom [0] = (byte) 0xFE) & (bom [1] = (byte) 0xFF) {encoding = "UTF-16BE "; unread = n-2;} else if (bom [0] = (byte) 0xFF) & (bom [1] = (byte) 0xFE )) {encoding = "UTF-16LE"; unread = n-2;} else {// Unicode BOM mark not found, unread all bytes encoding = defaultEnc; unread = n;} // System. out. println ("read =" + n + ", unread =" + unread); if (unread> 0) internalIn. unread (bom, (n-unread), unread); // Use given encoding if (encoding = null) {internalIn2 = new InputStreamReader (internalIn );} else {internalIn2 = new InputStreamReader (internalIn, encoding) ;}} public void close () throws IOException {init (); internalIn2.close ();} public int read (char [] cbuf, int off, int len) throws IOException {init (); return internalIn2.read (cbuf, off, len );}}

Test class


Package com. java. io; import java. io. bufferedReader; import java. io. file; import java. io. fileInputStream; import java. io. inputStreamReader; import java. nio. charset. charset; public class BomRead {/*** read the UTF-8 file with BOM garbled * @ param args */public static void main (String [] args) throws Exception {File file = new File ("E: \ JS_Exercise \ JavaExercise \ BOM.txt"); FileInputStream in = new FileInputStream (file); Buf FeredReader br = new BufferedReader (new InputStreamReader (in, "UTF-8"); String line = null; System. out. println ("before processing:"); while (line = br. readLine ())! = Null) {System. out. println (line);} File file2 = new File ("E: \ JS_Exercise \ JavaExercise \ BOM.txt"); FileInputStream in2 = new FileInputStream (file2 ); bufferedReader br2 = new BufferedReader (new UnicodeReader (in2, "UTF-8"); String line2 = null; System. out. println ("after processing:"); while (line2 = br2.readLine ())! = Null) {System. out. println (line2 );}}}

Output result


Before processing:

? Test BOM

After processing:

Test BOM


Another solution

From the current point of view, 1.6 only solves the problem of reading the BOM file with failure, or cannot be differentiated to deal with BOM and BOM-free UTF-8 encoding files, from the Bug ID: the description in section 4508058 shows that this problem will be disabled as a non-modifiable problem. The application will handle the BOM encoding recognition, the cause can be viewed from another bug, because the Unicode Code Requirements for BOM may change. That is to say for a UTF-8 file, the application needs to know whether the file has written BOM, and then decide the way to deal with BOM.

Therefore, you can handle this issue in special cases.


This article is from the "Fengyun beach" blog, please be sure to keep this source http://3950566.blog.51cto.com/3940566/1338200

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.