What is the difference between Unicode, UTF-8, and ISO8859-1? utf-8iso8859-1

Source: Internet
Author: User

What is the difference between Unicode, UTF-8, and ISO8859-1? utf-8iso8859-1

Note: This article is reproduced on Sina Blog to facilitate knowledge summarization. Address: http://blog.sina.com.cn/s/blog_673c81990100t1lc.html

 

This article mainly includes the following aspects: Basic coding knowledge, java, system software, url, tool software, etc.

In the following description, we will take the word "Chinese" as an example. We can see that its GB2312 encoding is "d6d0 cec4" and Its Unicode encoding is "4e2d 6587 ", the UTF code is "e4b8ad e69687 ". Note that the two words do not have iso8859-1 encoding, but they can be represented by iso8859-1 encoding ".

2. Basic coding knowledge

The earliest encoding was iso8859-1, which is similar to ascii encoding. However, many standard encodings have gradually emerged to facilitate representation of various languages. The following are important.

2.1. iso8859-1

It is a single-byte encoded string with a maximum character range of 0-255. It is used in English series. For example, the letter a is encoded as 0x61 = 97.

It is obvious that the iso8859-1 encoding represents a narrow range of characters that cannot represent Chinese characters. However, because it is a single-byte encoding, and the computer's most basic representation unit, so many times, still use iso8859-1 encoding to represent. This encoding is used by default in many protocols. For example, although the word "Chinese" does not exist iso8859-1 encoding, The gb2312 encoding, for example, should be "d6d0 cec4" two characters, when using iso8859-1 encoding, it is split into 4 bytes to indicate: "d6 d0 ce c4" (in fact, it is also processed in bytes during storage ). For UTF Encoding, it is 6 bytes "e4 b8 ad e6 96 87 ". Obviously, this representation method also needs to be based on another encoding.

2.2. GB2312/GBK

This is the man's Country Code, specifically used to represent Chinese characters, is a dubyte encoding, while English letters and iso8859-1 consistent (compatible with iso8859-1 encoding ). Gbk encoding can be used to both traditional and simplified Chinese characters, while gb2312 can only represent simplified Chinese characters. gbk is compatible with gb2312 encoding.

2.3. unicode

This is the most unified encoding, which can be used to represent characters in all languages, and is a fixed-length dubyte (also four bytes) encoding, including English letters. So it can be said that it is not compatible with iso8859-1 encoding, is not compatible with any encoding. However, compared to iso8859-1 encoding, uniocode encoding only adds a 0 byte before, for example, the letter a is "00 61 ".

It should be noted that fixed-length encoding is easy for computer processing (note that GB2312/GBK is not fixed-length encoding), while unicode can be used to represent all characters, therefore, many software programs use unicode encoding, such as java.

2.4. UTF

Considering that unicode encoding is not compatible with iso8859-1 encoding and is easier to use, unicode also requires two bytes for English letters. Unicode is not easy to transmit and store. Therefore, utf Encoding is produced. utf Encoding is compatible with iso8859-1 encoding and can also be used to represent characters in all languages. However, utf Encoding is not long encoding, the length of each character ranges from 1 to 6 bytes. In addition, utf Encoding comes with a simple verification function. Generally, English letters are represented in one byte, while Chinese characters are represented in three bytes.

Note: Although utf is used to use less space, it is undoubtedly the most economical to use GB2312/GBK if it is known to be Chinese characters as compared with unicode encoding. On the other hand, it is worth noting that although utf uses three bytes for Chinese characters, even for Chinese webpages, utf Encoding will save compared with unicode encoding, because the webpage contains many English characters.

3. java processing of Characters

In java application software, character set encoding is involved in many cases. In some cases, correct settings are required, and in some cases, certain processing is required.

3.1. getBytes (charset)

This is a standard function for java string processing. Its function is to encode the characters represented by the string according to charset and represent them in bytes. Note that strings are always stored in the java memory in unicode encoding. For example, if "Chinese" is stored as "4e2d 6587" under normal circumstances (I .e. when there is no error), if charset is "gbk", it is encoded as "d6d0 cec4 ", then return The Byte "d6 d0 ce c4 ". If charset is "utf8", it is "e4 b8 ad e6 96 87 ". If it is a "iso8859-1", "3f 3f" (two question marks) will be returned because it cannot be encoded ).

3.2. new String (charset)

This is another standard function for java string processing. In contrast to the previous function, it combines byte arrays according to charset encoding and finally converts them to unicode storage. Referring to the above getBytes example, "gbk" and "utf8" both can get the correct result "4e2d 6587", but the iso8859-1 finally becomes "003f 003f" (two question marks ).

Because utf8 can be used to represent/encode all characters, new String (str. getBytes ("utf8"), "utf8") = str, that is, completely reversible.

3.3. setCharacterEncoding ()

This function is used to set the http request or the corresponding encoding.

For request, it refers to the encoding of the submitted content. After specified, the correct string can be obtained directly through getParameter (). If not specified, the iso8859-1 encoding is used by default and needs further processing. See "form input" below ". It is worth noting that no getParameter () can be executed before setCharacterEncoding () is executed (). Java doc Description: This method must be called prior to reading request parameters or reading input using getReader (). This parameter is only valid for the POST method and invalid for the GET method. The cause of the analysis should be that when the first getParameter () is executed, java will analyze all submitted content according to the encoding, and the subsequent getParameter () will not be analyzed, so setCharacterEncoding () invalid. For the form submitted by the GET method, the submitted content is in the URL, and all submitted content has been analyzed according to encoding at the beginning. setCharacterEncoding () is naturally invalid.

For response, the encoding of the output content is specified. At the same time, this setting is passed to the browser to tell the browser the encoding of the output content.

 

(This article is reproduced in Sina Blog, the original address: http://blog.sina.com.cn/s/blog_673c81990100t1lc.html)

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.