Java output UTF-8 encoded file with BOM

Source: Internet
Author: User

When the CSV file is output from HTTP response, it is set to UTF8 without the default
BOM, but Windows Excel uses the BOM to confirm the UTF8 encoding, all need to write the BOM to the beginning of the file.


Microsoft's use of the BOM in UTF-8 is due to the distinction between UTF-8 and ASCII encoding.
Otherwise opening the CSV file with Excel may be garbled
Demo sample code such as the following:
Response.setcontenttype ("Text/csv");
Response.setheader ("Content-disposition", "attachment;filename=" + filename);
Response.setcharacterencoding ("UTF-8");
byte[] uft8bom={(byte) 0xef, (Byte) 0xBB, (byte) 0xbf};
OutputStream out = Response.getoutputstream ();
Out.write (Uft8bom);

OutputStreamWriter writer = new OutputStreamWriter (out, "UTF-8");

//write Other content ...

What is a BOM:
BOM (Byte-order Mark), a byte-order mark, which is a special tag inserted at the beginning of a UTF-8, UTF16, or UTF-32 encoded Unicode file that identifies the encoding type for a Unicode file. For UTF-8, the BOM is not required. Because the BOM is used to mark the encoding type and byte order (Big-endian or Little-endian) of the multibyte-encoded file.

In UTF8, the number of bits per character encoded is expressed by the first byte. And there is no distinction between Big-endian and Little-endian, see below.

BOMs File Header:
FE FF = UTF-32, Big-endian
FF FE xx = UTF-32, Little-endian
EF BB BF = UTF-8,
FE FF = UTF-16, Big-endian
FF FE = UTF-16, Little-endian

Another note: UTF-8 's web code should not use a BOM. Otherwise, it often goes wrong:
Using a BOM on a Web page is an error.

The BOM is not designed to support HTML and XML. To recognize text encoding, HTML has a charset attribute, XML has a encoding attribute, and it is not necessary to pull the BOM brace scene. Although the theoretical BOM can be used to identify UTF-16 encoded HTML pages, there are very few people doing it in real project. After all, UTF-16 such a code, even ASCII is double-byte, is not used to do Web pages.

Historical reasons why Windows uses the BOM:
Typically, a BOM is used to mark Unicode text-only byte streams to provide a convenient way for a text-processing program to recognize which Unicode encoding (UTF-8) A. txt file is read in. Utf-16be,utf-16le).

Windows is relatively good at BOM processing because Windows integrates Unicode recognition code into the API. It is mainly CreateFile (). When you open a text file, it proactively recognizes and rejects the BOM.

Windows uses this for historical reasons, because it was originally born out of a multi-code page environment. When introducing Unicode, Windows designers also want to be able to be compatible with Unicode and non-Unicode (multiple byte) text files at the same time without the user's attention, only with such a small trick.



Text files with BOMs often encounter problems in the Linux/unix environment:
The description is very specific:
http://www.zhihu.com/question/20167122

Text File parsing:
Text file corresponding to human readable text, how to convert from 2 to a text file? At first because the computer was invented in the USA. Naturally, all of us consider English as a total of 26 English letters. Plus special characters, 128 characters, 7 bits can be expressed as a byte. This is what everyone knows about Ascill coding.

The corresponding relationship is very easy, with a character corresponding to one by one bytes.

But very quickly found. Other non-English-speaking countries are far more than the Ascill code, at this time, of course, we want to unify. Different countries have different ways of coding their own. China's gb2312 is to do their own coding, so that every country has its own code, back and forth is too troublesome.     At this time, a new encoding method, Unicode encoding, want to unify the code, so the corresponding Unicode code for each character is specified. 1, very many files are ASCII encoding, assuming that Unicode is too wasteful.

2. There is no flag indicating the number of bytes to parse into a symbol. This is the time when the UTF that saved the world appeared. UTF is an implementation of Unicode, just smarter. The UTF16 is two bytes, or four bytes. The UTF32 is occupied by four bytes.

UTF8 is a very clever way of expressing.


1, for single-byte symbols, the first bit of byte is 0, and the next 7 bits represent the byte encoding.


2, for the N-byte symbol, the first n bits of byte are set to 1, and the n+1 bit is 0. The remaining bits are used for encoding.
For different encodings, there are different flags at the very front of the text, Unicode usually has two bits to indicate that each is FF Fe, or Feff, Fffe means Litte-endian encoded FEFF represents Big-endian encoding.
UTF8 is the beginning of EFBBBF.

It can be seen that utf-8 is self-explanatory. So without this logo file, most programs can be identified.

However, some programs do not recognize this flag, for example, PHP will directly take this flag as text parsing, will not be ignored.

Java output UTF-8 encoded file with BOM

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.