Java Character Set notes

Source: Internet
Author: User
Tags character set range relative

Overview

This article mainly includes the following aspects: Coding basic Knowledge, Java, System software, URL, tool software and so on.

In the following description, take the words "Chinese" as an example, the GB2312 encoding is "d6d0 cec4", the Unicode code is "4e2d 6587" and the UTF code is "E4b8ad e69687". Note that The two words are not iso8859-1 encoded, but can be "represented" using ISO8859-1 encoding.

2. Basic knowledge of coding

The earliest encoding was iso8859-1, similar to the ASCII encoding. However, in order to facilitate the presentation of a variety of languages, the gradual emergence of a number of standard coding, important are the following several.

2.1. Iso8859-1

is a single-byte encoding, the maximum range of characters is 0-255, applied to the English series. For example, the letter ' a ' is encoded as 0x61=97.

It is clear that the iso8859-1 encoding represents a narrow range of characters that cannot be represented in a Chinese character. However, because it is single-byte encoding, and the computer's most basic unit of representation, so many times, still use iso8859-1 encoding to express. And on many protocols, the encoding is used by default. For example, although the "Chinese" two words do not exist iso8859-1 encoding, take gb2312 encoding as an example, should be "d6d0 cec4" two characters, use iso8859-1 encoding it to open 4 bytes to represent: "D6 d0 ce C4" (in fact, in the storage time , and is also processed in bytes. In the case of UTF encoding, it is 6 bytes "E4 B8 ad E6 96 87". Obviously, this method of presentation also needs to be based on another encoding.

2.2. GB2312/GBK

This is the man's GB code, specifically used to express Chinese characters, is a double-byte code, while the English alphabet and iso8859-1 consistent (compatible ISO8859-1 encoding). The GBK encoding can be used to represent both traditional and simplified characters, while gb2312 can only represent simplified characters, and GBK is compatible with GB2312 encoding.

2.3. Unicode

This is the most uniform encoding that can be used to represent characters in all languages, and is a fixed-length double-byte (also four-byte) encoding, including English letters. So it can be said that it is incompatible with iso8859-1 encoding and is incompatible with any encoding. However, relative to the iso8859-1 encoding, the Uniocode encoding simply adds a 0 byte to the front, such as the letter ' a ' to ' 00 61 '.

It should be noted that the fixed-length encoding is convenient for computer processing (note that GB2312/GBK is not a fixed-length encoding), and Unicode can be used to represent all characters, so in many software is the use of Unicode encoding to deal with, such as Java.

2.4. UTF

Consider that Unicode encoding is incompatible with ISO8859-1 encoding and can easily occupy more space: Because of the English alphabet, Unicode also needs two bytes to represent. So Unicode is not easy to transfer and store. As a result, UTF encoding is generated, UTF encoding is compatible iso8859-1 encoding, and can be used to represent characters in all languages, however, UTF encoding is an indefinite length encoding, with each character varying from 1-6 bytes. In addition, the UTF code with a simple checksum function. Generally speaking, English letters are expressed in one byte, while Chinese characters use three bytes.

Note that although UTF is used to use less space, it is only relative to Unicode encoding, and using GB2312/GBK is undoubtedly the most economical if it is already known to be Chinese. On the other hand, it is worth noting that although the UTF encoding uses 3 bytes for Chinese characters, even for Chinese-language web pages, UTF encoding is more economical than Unicode encoding, because the Web page contains a lot of English characters.

3. Java Processing of characters

In Java application software, there are many related to character set coding, some places need to be set up correctly, some places need to do some degree of processing.

3.1. GetBytes (CharSet)

This is a standard function of Java string processing, which is to encode characters represented by strings according to CharSet, and to represent them in bytes. Note that strings are always stored in Unicode encoding in Java memory. For example, "Chinese", normally (that is, when there is no error) is stored as "4e2d 6587", if CharSet is "GBK", it is encoded as "d6d0 cec4", and then returns the byte "D6 d0 ce c4". If CharSet is "UTF8" then the last Is "E4 B8 AD E6 96 87 ". If it is" iso8859-1 ", then it returns" 3f 3f "(two question marks) because it cannot be encoded.

3.2. New String (CharSet)

This is another standard function of Java string processing, which, in contrast to the previous function, combines byte arrays with CharSet encoding and finally converts to Unicode storage. Referring to the GetBytes example above, "GBK" and "UTF8" can all produce the correct result "4e2d 6587", but Iso8859-1 finally becomes "003f 003f" (two question marks).

Because UTF8 can be used to represent/encode all characters, the new String (Str.getbytes ("UTF8"), "utf8") = = str, which is completely reversible.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.