Using a UTF-8 file written in java can be read correctly, but if you use NotePad to save the same content in UTF-8 format, when a program is used, one invisible character is read from the file, and one more?
Instance:
Create a text: txt text with the content of test BOM, save as UTF-8.
Process the UnicodeReader class with BOM
Package com. java. io; import java. io. IOException; import java. io. inputStream; import java. io. inputStreamReader; import java. io. pushbackInputStream; import java. io. reader;/** version: 1.1/2007-01-25-changed BOM recognition ordering (longer boms first) Network Address: http://koti.mbnet.fi/akini/java/unicodereader/UnicodeReader.java.txtOriginal pseudo code: Thomas WeidenfellerImplementation tweaked: Aki Nieminenhttp: // Www.unicode.org/unicode/faq/utf_bom.htmlBOMs: 00 00 fe ff = UTF-32, big-endian ff fe 00 = UTF-32, little-endian ef bb bf = UTF-8, fe ff = UTF-16, big-endian ff fe = UTF-16, little-endianWin2k Notepad: Unicode format = UTF-16LE *** // *** Generic unicode textreader, which will use BOM mark * to identify the encoding to be used. if BOM is not found * then use a given default or system encoding. */pub Lic class UnicodeReader extends Reader {PushbackInputStream internalIn; InputStreamReader internalIn2 = null; String defaultEnc; private static final int BOM_SIZE = 4; /***** @ param in inputstream to be read * @ param defaultEnc default encoding if stream does not have * BOM marker. give NULL to use system-level default. */UnicodeReader (InputStream in, String defaultEnc) {internalIn = new Pushb AckInputStream (in, BOM_SIZE); this. defaultEnc = defaultEnc;} public String getDefaultEncoding () {return defaultEnc;}/*** Get stream encoding or NULL if stream is uninitialized. * Call init () or read () method to initialize it. */public String getEncoding () {if (internalIn2 = null) return null; return internalIn2.getEncoding ();}/*** Read-ahead four bytes and check for BOM marks. extra byt Es are * unread back to the stream, only BOM bytes are skipped. */protected void init () throws IOException {if (internalIn2! = Null) return; String encoding; byte bom [] = new byte [BOM_SIZE]; int n, unread; n = internalIn. read (bom, 0, bom. length); if (bom [0] = (byte) 0x00) & (bom [1] = (byte) 0x00) & (bom [2] = (byte) 0xFE) & (bom [3] = (byte) 0xFF) {encoding = "UTF-32BE "; unread = n-4;} else if (bom [0] = (byte) 0xFF) & (bom [1] = (byte) 0xFE) & (bom [2] = (byte) 0x00) & (bom [3] = (byte) 0x00) {encoding = "UTF-32LE "; unread = n-4;} else if (bom [0] = (byte) 0xEF) & (bom [1] = (byte) 0xBB) & (bom [2] = (byte) 0xBF) {encoding = "UTF-8"; unread = n-3 ;} else if (bom [0] = (byte) 0xFE) & (bom [1] = (byte) 0xFF) {encoding = "UTF-16BE "; unread = n-2;} else if (bom [0] = (byte) 0xFF) & (bom [1] = (byte) 0xFE )) {encoding = "UTF-16LE"; unread = n-2;} else {// Unicode BOM mark not found, unread all bytes encoding = defaultEnc; unread = n;} // System. out. println ("read =" + n + ", unread =" + unread); if (unread> 0) internalIn. unread (bom, (n-unread), unread); // Use given encoding if (encoding = null) {internalIn2 = new InputStreamReader (internalIn );} else {internalIn2 = new InputStreamReader (internalIn, encoding) ;}} public void close () throws IOException {init (); internalIn2.close ();} public int read (char [] cbuf, int off, int len) throws IOException {init (); return internalIn2.read (cbuf, off, len );}}
Test class
Package com. java. io; import java. io. bufferedReader; import java. io. file; import java. io. fileInputStream; import java. io. inputStreamReader; import java. nio. charset. charset; public class BomRead {/*** read the UTF-8 file with BOM garbled * @ param args */public static void main (String [] args) throws Exception {File file = new File ("E: \ JS_Exercise \ JavaExercise \ BOM.txt"); FileInputStream in = new FileInputStream (file); Buf FeredReader br = new BufferedReader (new InputStreamReader (in, "UTF-8"); String line = null; System. out. println ("before processing:"); while (line = br. readLine ())! = Null) {System. out. println (line);} File file2 = new File ("E: \ JS_Exercise \ JavaExercise \ BOM.txt"); FileInputStream in2 = new FileInputStream (file2 ); bufferedReader br2 = new BufferedReader (new UnicodeReader (in2, "UTF-8"); String line2 = null; System. out. println ("after processing:"); while (line2 = br2.readLine ())! = Null) {System. out. println (line2 );}}}
Output result
Before processing:
? Test BOM
After processing:
Test BOM
Another solution
From the current point of view, 1.6 only solves the problem of reading the BOM file with failure, or cannot be differentiated to deal with BOM and BOM-free UTF-8 encoding files, from the Bug ID: the description in section 4508058 shows that this problem will be disabled as a non-modifiable problem. The application will handle the BOM encoding recognition, the cause can be viewed from another bug, because the Unicode Code Requirements for BOM may change. That is to say for a UTF-8 file, the application needs to know whether the file has written BOM, and then decide the way to deal with BOM.
Therefore, you can handle this issue in special cases.
This article is from the "Fengyun beach" blog, please be sure to keep this source http://3950566.blog.51cto.com/3940566/1338200