JAVA outputs a UTF-8 encoded file with BOM

Source: Internet
Author: User

JAVA outputs a UTF-8 encoded file with BOM
When the CSV file is output from the http response, it is not included by default when it is set to utf8.
Bom, But windows Excel uses bom to confirm utf8 encoding, and all bom needs to be written to the beginning of the file.
Microsoft uses BOM in UTF-8 because it can clearly distinguish UTF-8 from ASCII codes.
Otherwise, it may be garbled to open the CSV file in Excel.
The sample code is as follows:
Response. setContentType ("text/csv ");
Response. setHeader ("Content-Disposition", "attachment; filename =" + fileName );
Response. setCharacterEncoding ("UTF-8 ");
Byte [] uft8bom = {(byte) 0xef, (byte) 0xbb, (byte) 0xbf };
OutputStream out = response. getOutputStream (); // new FileOutputStream (new File (mainPath ));
Out. write (uft8bom );

OutputStreamWriter writer = new OutputStreamWriter (out );

// Write other content...

What is BOM:
BOM (byte-order mark), which is a special marker inserted into a Unicode file starting with a UTF-8, UTF16, or UTF-32-encoded Unicode file, to identify the encoding type of a Unicode file. BOM is not mandatory for UTF-8 because it is used to mark the encoding type and byte order of multibyte encoded files (big-endian or little-endian ).

BOMs file header:
00 00 fe ff = UTF-32, big-endian
Ff fe 00 = UTF-32, little-endian
Ef bb bf = UTF-8,
Fe ff = UTF-16, big-endian
Ff fe = UTF-16, little-endian

Another thing to note is that the UTF-8 Web Page code should not use BOM, otherwise it will often encounter errors:
Using BOM on a webpage is an error. BOM is not designed to support HTML and XML. To identify text encoding, HTML has the charset attribute and XML has the encoding attribute, so it is not necessary to pull the BOM. Although theoretically BOM can be used to identify HTML pages of UTF-16 code, few people do this in actual engineering. After all, this encoding of UTF-16 even ASCII are dubyte, it is not applicable to do web pages.

Reasons for using BOM in Windows:
Usually bomis used to mark the unicodepure character stream, used to identify a convenient character processing program reading the. txt file which is Unicode encoding (UTF-8, UTF-16BE, UTF-16LE ). Windows processes BOM better because it integrates Unicode recognition codes into APIs, mainly CreateFile (). When a text file is opened, it automatically identifies and removes the BOM. Windows has a historical reason, because it was originally originated from a multi-page environment. When Unicode is introduced, Windows designers hope to be able to be compatible with Unicode and non-Unicode (Multiple byte) text files without your attention, so they can only use this small trick.

Text Files with BOM are frequently encountered in Linux/unix environments:
Text File Parsing:
A text file corresponds to a text file that can be read by humans. How can we convert it from a binary system to a text file? Since the computer was invented in the United States, we naturally consider how to express it in English. There are a total of 26 English letters, with special characters and 128 characters. Seven characters can be expressed in bytes. This is the well-known ascill code. The correspondence is simple. One character corresponds to one byte. However, we soon found that the texts in other non-English countries far exceed the ascill Code. At this time, we certainly want to unify the codes. Different countries have their own encoding methods, china's gb2312 encoding method is self-developed. In this way, every country has its own encoding method, which is too troublesome to switch back and forth. At this time, there is a new encoding method, unicode encoding method, to unify the encoding, so it specifies the unicode code corresponding to each character. 1. Many files are ascii encoded. It is too wasteful to use unicode. 2. If there is no flag, the several bytes are parsed as a symbol. At this time, the world-saving utf emerged. utf is a unicode implementation, but it is smarter. Utf16 occupies two or four bytes, and utf32 occupies four bytes. Utf8 is a clever representation.
1. for single-byte symbols, the first byte is 0, and the last seven digits indicate Byte encoding.
2. For the n-byte symbol, the first n bits of the first byte are set to 1, the n + 1 bits are 0, and the remaining bits are used for encoding.
For different encodings, there are different signs at the beginning of the text. unicode usually has two characters to represent ff fe, or feff, respectively, fffe indicates litte-endian encoding. feff indicates big-endian encoding.
Utf8 starts with efbbbf. We can see that UTF-8 is self-explanatory, so most programs can recognize it without this flag. However, some programs cannot recognize this flag. For example, php directly parses this flag as text and does not ignore it.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.