Part I: Coding Basics
Why you need to encode: a computer-readable language (binary number) represents a wide variety of characters.
first, the basic concept
ASCII, Unicode, Big5, GBK, and so on, are character sets that define only what characters are in the character set, and what numbers are used respectively.
UTF-8 and UTF-16 Define how the Unicode character set can be transmitted and saved using a computer-readable language.
For example: Unicode character u+00a9 = 1010 1001 (copyright symbol) encoded in UTF-8:
11000010 10101001 = 0xC2 0xa9
In fact, it is not necessary to distinguish them strictly, some character sets are encoded in their own way, such as ASCII. They all indicate how a character is represented by a binary number.
So many places UTF-8, UTF-16 and ASCII, GBK and other unified as character encoding.
Second, the common code
Understand the need for communication in various languages, after translation is necessary, then how to translate it? There are many kinds of translation methods in calculation, such as ASCII, Iso-8859-1, GB2312, GBK, UTF-8, UTF-16 and so on. They can all be thought of as dictionaries, which prescribe the rules of conversion, and according to this rule, the computer can correctly represent our characters. The current encoding format is many, such as GB2312, GBK, UTF-8, UTF-16 These formats can represent a Chinese character, then we choose which encoding format to store Chinese characters? This will take into account other factors, whether the storage space is important or the efficiency of the coding is important. Based on these factors to correctly select the encoding format, the following is a brief introduction to these encoding formats.
ASCII Code
People who have learned the computer know the ASCII code, a total of 128, with a byte of the low 7-bit representation, 0~31 is the control characters such as newline return to delete, etc., 32~126 is a print character, can be entered by the keyboard and can be displayed.
Iso-8859-1
128 characters is obviously not enough, so ISO organization on the basis of the ASCII code to develop a number of columns to extend the ASCII encoding, they are iso-8859-1~iso-8859-15, where iso-8859-1 covers most of the Western European language characters, Most widely used in all applications. Iso-8859-1 is still a single-byte encoding, which can represent a total of 256 characters.
GB2312
Its full name is "the basic set of Chinese character encoding character set of information interchange", it is a double-byte encoding, the total encoding range is a1-f7, which from a1-a9 is the symbol area, a total of 682 symbols, from B0-f7 is the Chinese character area, contains 6,763 Chinese characters.
GBK
The full name is called "Chinese character Code extension Code", is the National Technical Supervision Bureau for the Windows95 of the new Chinese character code specification, its appearance is to expand GB2312, add more Chinese characters, its coding range is 8140~fefe (remove xx7f) total 23,940 code bit, it can express 21,003 Chinese characters, its encoding is compatible with GB2312, that is, the Chinese character encoded with GB2312 can be decoded with GBK, and there will be no garbled characters.
GB18030
The full name is the "Chinese character encoding character set for information interchange", which is a mandatory standard in China, it may be single-byte, double-byte or four-byte encoding, its encoding is compatible with GB2312 encoding, although this is the national standard, but the actual application system is not widely used.
UTF-16
When it comes to UTF that must refer to Unicode (Universal Code Uniform Code), ISO tries to create a new hyper-language dictionary that all languages in the world can translate with each other. It is conceivable how complex this dictionary is, and the detailed specification of Unicode can refer to the corresponding documentation. Unicode is the basis for Java and XML, and the following details the way Unicode is stored in a computer.
UTF-16 specifically defines how Unicode characters are accessed in a computer. UTF-16 uses two bytes to represent the Unicode conversion format, this is a fixed-length representation, no matter what character can be expressed in two bytes, two bytes is 16 bit, so called UTF-16. UTF-16 represents a very handy character, with every two bytes representing a single character, which greatly simplifies operations in the case of string manipulation, which is a very important reason for Java to use UTF-16 as a character storage format for memory.
UTF-8
The UTF16 is fixed using 2 bytes (or 4 bytes) to represent characters, which makes it incompatible with earlier, heavily used ASCII code, while some special characters have special meanings in UNIX systems, such as '/0 ' or '/', which have special meanings in filenames and other C library function parameters. In addition, some of the most commonly used characters (Western European characters) can be expressed in only one byte, and if you use UTF16, you waste bandwidth or storage space.
Three, UTF-8 detailed introduction
First, UCS and Unicode are just the encoded tables that assign integers to characters. There are several ways to represent a string of characters as a string of bytes. The two most obvious methods are storing Unicode text as a string of 2 or 4 byte sequences. The formal names of the two methods are UCS-2 and UCS-4 respectively. Unless otherwise specified, most of the bytes are like this (Bigendian Convention). Converting an ASCII or Latin-1 file to UCS-2 is simply to insert 0x00 before each ASCII byte. If you are converting to UCS-4, you must insert three 0x00 before each ASCII byte.
Using UCS-2 (or UCS-4) under Unix can cause very serious problems. These encoded strings contain special characters, such as '/0 ' or '/', which have special meanings in filenames and other C library function parameters. In addition, most of the tools under UNIX that use ASCII files cannot read 16-bit characters without significant modification. For these reasons, in filenames, text files, environment variables and other places, UCS-2 does not work with external encodings for Unicode.
The UTF-8 encoding defined in ISO 10646-1 Annex R and RFC 2279 does not have these problems. It is an obvious way to use Unicode in Unix-style operating systems.
UTF-8 has some features:
The UCS character u+0000 to u+007f (ASCII) is encoded as Byte 0x00 to 0x7F (ASCII compatible). This means that files that contain only 7-bit ASCII characters are the same in ASCII and UTF-8 two encoding methods.
All >u+007f UCS characters are encoded as a string of multiple bytes, each with a set of tag bits. Therefore, ASCII bytes (0x00-0x7f) are not possible as part of any other character.
The first byte of a multibyte string representing non-ASCII characters is always in the range 0xC0 to 0xFD, and indicates how many bytes the character contains. The remaining bytes of a multibyte string are in the 0x80 to 0xBF range. This makes resynchronization very easy and makes coding borderless, and is rarely affected by lost bytes.
Can be programmed into all 231 UCS codes possible
UTF-8 encoded characters can theoretically be up to 6 bytes long, whereas 16-bit BMP characters use up to 3 bytes long.
Bigendian the order in which the UCS-4 byte strings are arranged is predetermined.
Bytes 0xFE and 0xFF have never been used in UTF-8 encoding.
The following byte string is used to represent a character. Which string to use depends on the ordinal number of the character in Unicode.
U-00000000-u-0000007f:0xxxxxxx
U-00000080-u-000007ff:110xxxxx 10xxxxxx
U-00000800-u-0000ffff:1110xxxx 10xxxxxx 10xxxxxx
U-00010000-u-001fffff:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000-U-03FFFFFF:111110XX 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
u-04000000-u-7fffffff:1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The position of XXX is filled in by a bit of the binary representation of the character encoding number. The more the right x has the less special meaning. Use only the shortest multi-byte string that is sufficient to express the number of characters encoded. Note In a multibyte string, the number of "1" at the beginning of the first byte is the number of bytes in the entire string.
For example: Unicode character u+00a9 = 1010 1001 (copyright symbol) encoded in UTF-8:
11000010 10101001 = 0xC2 0xa9
The character u+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:
11100010 10001001 10100000 = 0xe2 0x89 0xA0
The official name of this encoding is spelled UTF-8, where UTF represents UCS transformation Format. Do not represent UTF-8 in any document, such as UTF8 or utf_8, unless you are referring to a variable name rather than the encoding itself.
Part Two: Java encoding
one, the encoding in string
1. The JVM's default encoding is UTF-8
System.out.println (Charset.defaultcharset ());
Output: UTF-8
2. Get the encoding of string
You can get the encoding of a string by public byte[] GetBytes (String charsetname) throws Unsupportedencodingexception
B. Get string encoding in various encoding formats string s = "az Chinese";//1, default Unicode encoding byte[] Bytesdefault = S.getbytes (); System.out.println (gethexstring (Bytesdefault));//2, also the default encoding byte[] BytesDefault2 = S.getbytes ( Charset.defaultcharset ()); System.out.println (gethexstring (BYTESDEFAULT2));//3, UTF-8 code byte[] bytesUTF8 = s.getbytes ("UTF-8"); System.out.println (gethexstring (bytesUTF8));//4, UTF-16 code byte[] bytesUTF16 = s.getbytes ("UTF-16"); System.out.println (gethexstring (bytesUTF16));//5, Unicode encoding byte[] Bytesunicode = s.getbytes ("Unicode"); System.out.println (gethexstring (Bytesunicode));//6, GBK code byte[] BYTESGBK = s.getbytes ("GBK"); System.out.println (gethexstring (BYTESGBK));
Output:
615ae4b8ade69687
615ae4b8ade69687
615ae4b8ade69687
feff0061005a4e2d6587
feff0061005a4e2d6587
615ad6d0cec4
As you can see, Unicode and UTF16 can be considered the same way.
3. Build string by byte[] encoding
Public String (byte[] bytes, string charsetname) throws Unsupportedencodingexception
In what way the generated encoding should be restored in the original way
Third, through the byte[] code to build a string, in what way generated encoding, it should be used in the original way to restore//1, the default encoding string sdefault = new string (Bytesdefault); System.out.println (Sdefault);//2, default encoding 2String SDEFAULT2 = new String (BYTESDEFAULT2, Charset.defaultcharset ()); System.out.println (SDEFAULT2);//3, UTF-8 encoded string sUTF8 = new String (bytesUTF8, "UTF8"); System.out.println (sUTF8);//4, UTF-16 encoded string sUTF16 = new String (bytesUTF16, "UTF16"); System.out.println (sUTF16);//5, Unicode encoded string sunicode = new String (Bytesunicode, "Unicode"); System.out.println (Sunicode);//6, GBK encoded string SGBK = new String (BYTESGBK, "GBK"); System.out.println (SGBK);
Output:
AZ Chinese
AZ Chinese
AZ Chinese
AZ Chinese
AZ Chinese
AZ Chinese
If string is constructed in another encoding format, garbled
If the restore by other encoding, then garbled//1, with UTF-16 encoding decoding UTF8 generated byte string Sutf16erro = new String (Bytesdefault, "UTF16"); System.out.println (Sutf16erro);//2, decoding UTF8 bytes generated with unicode encoding string sunicodeerror = new String (Bytesdefault, "Unicode"); System.out.println (sunicodeerror);//3, decoding UTF8 generated bytes with GBK encoding string sgbkerror = new String (Bytesdefault, "GBK"); System.out.println (Sgbkerror);
Output:
Ashamed ? 隇
Ashamed ? 隇
AZ Juan PO
Complete code and output included
Package Org.ljh.javademo.encode;import Java.io.ioexception;import Java.nio.charset.charset;public class Defaultencode {public static void main (string[] args) throws IOException {//One, gets the default encoding for the current JVM environment System.out.println ( Charset.defaultcharset ());//Two, get string encoding in various encoding format string s = "az Chinese";//1, default Unicode encoding byte[] Bytesdefault = S.getbytes ( ); System.out.println (gethexstring (Bytesdefault));//2, also the default encoding byte[] BytesDefault2 = S.getbytes ( Charset.defaultcharset ()); System.out.println (gethexstring (BYTESDEFAULT2));//3, UTF-8 code byte[] bytesUTF8 = s.getbytes ("UTF-8"); System.out.println (gethexstring (bytesUTF8));//4, UTF-16 code byte[] bytesUTF16 = s.getbytes ("UTF-16"); System.out.println (gethexstring (bytesUTF16));//5, Unicode encoding byte[] Bytesunicode = s.getbytes ("Unicode"); System.out.println (gethexstring (Bytesunicode));//6, GBK code byte[] BYTESGBK = s.getbytes ("GBK"); System.out.println (gethexstring (BYTESGBK));//third, through the byte[] code to build a string, in what way generated encoding, it should be used in the original way to restore//1, the default encoding string Sdefault = new String (Bytesdefault); System.ouT.println (Sdefault);//2, default encoding 2String SDEFAULT2 = new String (BYTESDEFAULT2, Charset.defaultcharset ()); System.out.println (SDEFAULT2);//3, UTF-8 encoded string sUTF8 = new String (bytesUTF8, "UTF8"); System.out.println (sUTF8);//4, UTF-16 encoded string sUTF16 = new String (bytesUTF16, "UTF16"); System.out.println (sUTF16);//5, Unicode encoded string sunicode = new String (Bytesunicode, "Unicode"); System.out.println (Sunicode);//6, GBK encoded string SGBK = new String (BYTESGBK, "GBK"); System.out.println (SGBK);//If the restore by other encoding, there is garbled//1, UTF-16 encoding decoding UTF8 generated bytes string Sutf16erro = new String (Bytesdefault, "UTF16"); System.out.println (Sutf16erro);//2, decoding UTF8 bytes generated with unicode encoding string sunicodeerror = new String (Bytesdefault, "Unicode"); System.out.println (sunicodeerror);//3, decoding UTF8 generated bytes with GBK encoding string sgbkerror = new String (Bytesdefault, "GBK"); System.out.println (sgbkerror);} Input byte[], convert it to 16 characters for output, such as input {90,20,21}, return 5a1415. Because byte is output by default in 10 binary format, private static String gethexstring (byte[] b) {String hexs = ""; for (int i = 0; I < B.length; i++) {String hex = integer.tohexstring (B[i] & 0xFF), if (hex.length () = = 1) {hex = ' 0 ' + hex;} Hexs + = hex;} return hexs;}}
Output:
UTF-8
615ae4b8ade69687
615ae4b8ade69687
615ae4b8ade69687
feff0061005a4e2d6587
feff0061005a4e2d6587
615ad6d0cec4
AZ Chinese
AZ Chinese
AZ Chinese
AZ Chinese
AZ Chinese
AZ Chinese
Ashamed ? 隇
Ashamed ? 隇
AZ Juan PO
Second, the encoding in JAVA IO
Summary of "Java Coding topics"