If you have developed software projects that involve multi-language support problems, then I believe you have not encountered any garbled issues, and then some concepts such as ASCII, ISO-8859-1, in the process of seeking to solve the problem, unicode, UTF-8, GBK, gb2312 and so on. This article helps you understand these concepts correctly. 1. ASCII uses 7-bit encoding to store English characters and some common symbols as numbers ranging from 0 to 127. 2. Western European languages such as ISO-8859-1 French, Spanish and German all use a coding system called ISO-8859-1 (also called Latin-1 "). It uses seven ASCII characters to indicate the characters from 0 to 127, but then expands to the range of 128-255 to indicate that there is a wavy line (241) on N ), and u with Two dots (252) characters. ASCII is a subset of ISO-8859-1. 3. Unicode uses a 2-byte number to represent each character, from 0 to 65535. Each 2-byte number represents a unique character used in at least one language. (The characters used in multiple languages share the same number code .) This ensures that each character has a number and each digit has a character. Unicode data will never be ambiguous. Unicode uses the same number to represent characters in ASCII and ISO-8859-1. These two types of encoding are represented by one byte, while Unicode is represented by two bytes. Therefore, Unicode indicates that only low bytes can be used for the two encoding characters, and the high byte is 0. 4. UTF-8 UTF-8 is a variable length encoding method, each UTF-8 can be 1 to 6 bytes long. It encodes Unicode characters in a variable length mode. The encoding of the ISO-8859-1 in Unicode is the same single-byte encoding as the ISO-8859-1. Other characters are encoded in more than two bytes. In fact, for the Unicode encoding of two bytes, The UTF-8 only needs three bytes. The first byte starts with N 1 (1 <n <= 6), and N indicates the number of encoded bytes. Each byte starts with 10, and the last 6 bytes are valid. Connect the remainder of the first digit with the last six digits of all subsequent bytes to the corresponding unicode encoded value. For example, the Chinese character "medium" encoding: UNICODE: 4E 2d01001110 00101101 UTF-8: e4 B8 ad1110
010010
11100010
101101
You can confirm by using notepad: Create a text file, input Chinese character "medium" Save As unicode format and UTF-8 format respectively. Disable the automatic identification UTF-8 file format option of ultraedit, and then use it to open the two files, select the binary viewing method, you can see: the UTF-8 format file is encoded as "Ef bb bf E4 B8 ad ". There is a three-byte prefix "Ef bb bf", which is the identifier of the UTF-8 format text file. But this prefix is not, some text viewing software can also identify the UTF-8 format by encoding. "E4 B8 AD" is the UTF-8 code of "medium. The full encoding of Unicode files is "FF Fe 2D 4E ". The preceding two-byte prefix "ff fe" is the encoding identifier of the Unicode text document. The code we see is "2D 4E", instead of "4E 2D" as I mentioned earlier. Why? Because numbers are stored in the order of low bytes in the first and later order, the actual unicode encoding is exactly "4e2d ". 5. Both gb2312 and GBK are chinese character encoding standards. The former is a subset of the latter. GBK-encoded text documents use the same single-byte representation of characters in ASCII; use dual-byte encoding for Chinese characters and punctuation marks in Chinese. The height is greater than 0x80, all ASCII characters are encoded less than 0x80. Therefore, ASCII and GBK characters can be mixed. There is no rule for converting the GBK character set and Unicode. You need to convert the table to convert it. 6. The following is a snippet of Unicode and UTF-8 conversion program written in Java language for your reference. Because Java characters are unicode encoded, the program converts the byte array of UTF-8-encoded strings and the string type of Java. Each character in the string object is a Unicode character. Public String utf8bytes2string (byte [] buff) {If (buff = NULL) return NULL; stringbuffer sb = new stringbuffer (); int idx = 0; if (buff [0] = (byte) 0xef & buff [1] = (byte) 0xbb & buff [2] = (byte) 0xbf) idx = 3; // skip utf8 header while (idx <buff. length) {int HB = buff [idx] & 0xff; int bcnt = 0; int check = 0x80; For (INT I = 0; I <8; I ++) {If (HB & check )! = 0) {bcnt ++; check >>= 1;} else break;} If (bcnt <= 1) {char C = 0; c | = buff [idx] & 0xff; sb. append (c); idx ++;} else if (bcnt = 2) {char C = 0; c | = buff [idx] & 0x03; c <= 6; If (buff [idx + 1] & 0xc0 )! = 0x80) return NULL; c | = buff [idx + 1] & 0x3f; idx + = 2; sb. append (c);} else if (bcnt = 3) {char C = 0; c | = buff [idx] & 0x0f; C <= 6; if (buff [idx + 1] & 0xc0 )! = 0x80) return NULL; c | = buff [idx + 1] & 0x3f; C <= 6; If (buff [idx + 2] & 0xc0 )! = 0x80) return NULL; c | = buff [idx + 2] & 0x3f; idx + = 3; sb. append (c);} else return NULL;} return sb. tostring ();} public byte [] string2utf8bytes (string Str) {If (STR = NULL) return NULL; bytearrayoutputstream Bos = new bytearrayoutputstream (); try {string2utf8stream (STR, BOS);} catch (ioexception e) {e. printstacktrace ();} return Bos. tobytearray ();} public void string2utf8stream (string STR, outputstream OS) throws ioexception {If (STR = NULL | OS = NULL) return; For (INT I = 0; I <Str. length (); I ++) {char c = Str. charat (I); If (C <0x80) {OS. write (byte) C);} else if (C> = 0x80 & C <0x100) {int HI = C> 6; Hi | = 0xc0; int Lo = C & 0x3f; lo | = 0x80; OS. write (HI); OS. write (LO);} else {int first = C> 12; first | = 0xe0; int second = C> 6; Second & = 0x3f; second | = 0x80; int third = C & 0x3f; Third | = 0x80; OS. write (first); OS. write (second); OS. write (third );}}}
Reference: caused by a waste battery