Java read UTF-8 format TXT file first line garbled -- question mark "?" Java read UTF-8 file with BOM garbled causes and solutions

Source: Internet
Author: User
Tags bug id

Java read UTF-8 TXT file first line garbled "?" And Solutions

Content of the test.txt file:
A medium
2 countries
3
4
5
6

 

Test.txt files are saved in UTF-8 format on a WordPad
Save and close, use the WordPad to open the UTF-8 document again, Chinese and letters are displayed normally

 

Test code:

 

[Java]View plaincopy

  1. Import java. Io. bufferedreader;
  2. Import java. Io. file;
  3. Import java. Io. fileinputstream;
  4. Import java. Io. inputstreamreader;
  5. Public class readtxtfile {
  6. Public static void main (string [] ARGs ){
  7. Try {
  8. String charsetname = "UTF-8 ";
  9. String Path = "D:/to_delete/test.txt ";
  10. File file = new file (PATH );
  11. If (file. isfile () & file. exists ())
  12. {
  13. Inputstreamreader insreader = new inputstreamreader (
  14. New fileinputstream (file), charsetname );
  15. Bufferedreader bufreader = new bufferedreader (insreader );
  16. String line = new string ();
  17. While (line = bufreader. Readline ())! = NULL ){
  18. System. Out. println (line );
  19. }
  20. Bufreader. Close ();
  21. Insreader. Close ();
  22. }
  23. } Catch (exception e ){
  24. System. Out. println ("An error occurred while reading the file content ");
  25. E. printstacktrace ();
  26. }
  27. }
  28. }

 

 

Program Execution result:
? A medium
2 countries
3
4
5
6

 

My solution:

Use ultraedit to save the above TXT file as a UTF-8 without BOM format;

Or

Use Notepad ++ to open the above TXT file to execute the following operation "format --> to UTF-8 without BOM format encoding", modify the TXT text to save.

 

 

 

There is a very good article on the internet, discusses the causes of the problem and the solution to Java read with BOM UTF-8 file garbled causes and solutions

 

URL: http://daimojingdeyu.javaeye.com/blog/397661

 

Keywords: Java reading UTF-8, Java writing UTF-8, encoding, UTF-8 garbled

Recently, when processing files, we found that the encoding of files of the same type may be different. So I want to unify the file format (because of the versatility of the UTF-8, decided to unify TO THE UTF-8), met the first problem is: how to view the existing file encoding method.

I searched for some good articles online and found them. I won't repost them here.
File encoding highlights
Principle of character string encoding (charset, encoding, decoding)
Java coding Analysis
Method for Determining file encoding or text stream Encoding
The previous articles can be seen as "Getting started to proficient" in coding"

If you have read the above article, you will surely understand that in Java, the class file adopts the utf8 encoding method, and the JVM runtime uses UTF16. Java strings are always Unicode and adopt the UTF-16 encoding method.

Want to test, Java on the ability to read and write UTF-8 files, the results found a very depressing problem, if written through Java UTF-8 files, using Java can be read correctly, however, if you save the same content in UTF-8 format with notepad, one more invisible character is read from the file in use program reading.
The test code is as follows: Java code

  1. ImportJava. Io. bufferedreader;
  2. ImportJava. Io. file;
  3. ImportJava. Io. fileinputstream;
  4. ImportJava. Io. ioexception;
  5. ImportJava. Io. inputstreamreader;
  6. Public ClassUtf8test {
  7. Public Static VoidMain (string [] ARGs)ThrowsIoexception {
  8. File F =NewFile ("./utf.txt ");
  9. Fileinputstream in =NewFileinputstream (f );
  10. // Specify the UTF-8 format to read when reading the file
  11. Bufferedreader BR =NewBufferedreader (NewInputstreamreader (in, "UTF-8 "));
  12. String line = Br. Readline ();
  13. While(Line! =Null)
  14. {
  15. System. Out. println (line );
  16. Line = Br. Readline ();
  17. }
  18. }
  19. }
[Java]View plaincopy

 
  1. Import java. Io. bufferedreader;
  2. Import java. Io. file;
  3. Import java. Io. fileinputstream;
  4. Import java. Io. ioexception;
  5. Import java. Io. inputstreamreader;
  6. Public class utf8test {
  7. Public static void main (string [] ARGs) throws ioexception {
  8. File F = new file ("./utf.txt ");
  9. Fileinputstream in = new fileinputstream (f );
  10. // Specify the UTF-8 format to read when reading the file
  11. Bufferedreader BR = new bufferedreader (New inputstreamreader (in, "UTF-8 "));
  12. String line = Br. Readline ();
  13. While (line! = NULL)
  14. {
  15. System. Out. println (line );
  16. Line = Br. Readline ();
  17. }
  18. }
  19. }

Utf.txt is created through notepad and uses the specified UTF-8 encoding when saving the file. Its content is: reference this is the first line.
This is Second line.

The normal test result should be the text content of utf.txt directly output. But the following content is actually output: reference? This is the first line.
This is Second line.

The first line has an extra question mark.
Through the above articles should be able to think of Java reading BOM (byte order mark) problem, when using UTF-8, you can use three bytes of "Ef bb bf" at the beginning of the file to identify the file using the UTF-8 encoding, of course, you can not use this three bytes.
The above problem should be caused by reading the first three bytes. I didn't quite believe this was a JDK bug at first. After many tests, the problem still exists, so I went to the dog again and found the following BUG:
Bug ID: 4508058
However, on some pages I disabled, I remember a file saying that this bug was only available in jdk1.5 and earlier versions, saying that 1.6 has been resolved, from the current point of view, 1.6 only solves the problem of reading the BOM file with failure, or cannot be differentiated to deal with BOM and BOM-free UTF-8 encoding files, from the bug ID: the description in section 4508058 shows that this problem will be disabled as a non-modifiable problem. The application will handle the BOM encoding recognition, the cause can be viewed from another bug, because the Unicode Code Requirements for BOM may change. That is to say for a UTF-8 file, the application needs to know whether the file has written Bom, and then decide the way to deal with Bom.

In the above while loop, you can add the following code to test the read content: Java code

  1. Byte[] Allbytes = line. getbytes ("UTF-8 ");
  2. For(IntI = 0; I <allbytes. length; I ++)
  3. {
  4. IntTMP = allbytes [I];
  5. String hexstring = integer. tohexstring (TMP );
  6. // One byte is in hexadecimal format, which can be expressed by only two digits. The last two digits are used to remove the symbol filling.
  7. Hexstring = hexstring. substring (hexstring. Length ()-2 );
  8. System. Out. Print (hexstring. touppercase ());
  9. System. Out. Print ("");
  10. }
[Java]View plaincopy

 
  1. Byte [] allbytes = line. getbytes ("UTF-8 ");
  2. For (INT I = 0; I <allbytes. length; I ++)
  3. {
  4. Int TMP = allbytes [I];
  5. String hexstring = integer. tohexstring (TMP );
  6. // One byte is in hexadecimal format, which can be expressed by only two digits. The last two digits are used to remove the symbol filling.
  7. Hexstring = hexstring. substring (hexstring. Length ()-2 );
  8. System. Out. Print (hexstring. touppercase ());
  9. System. Out. Print ("");
  10. }

The output result is as follows: ef bb bf 54 68 69 73 20 69 73 20 74 68 65 20 66 69 72 73 74 20 6C 69 6e 65 2e
? This is the first line.
54 68 69 73 20 69 73 20 73 65 63 6f 6e 64 20 6C 69 6e 65 2e
This is Second line.

The red part of "Ef bb bf" is the BOM code of the UTF-8 file, it can be seen that Java failed to correctly process the BOM code of the UTF-8 file when reading the file, the first three bytes are processed as text content.

The Code provided in the Link can solve the garbled problem:
Http://koti.mbnet.fi/akini/java/unicodereader/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.