Recently in the study of IO technology always appear garbled problem, finally ask the big God only know is the file encoding different to lead to garbled. In order to avoid the next garbled, here on the coding problem to do a summary, because I beginner, inevitably there will be omissions or mistakes, welcome all the way to the prawn.
First look at what is coded
We know that computers use bytes to represent our human language-that is, binary, but because the human language is too many to be represented by byte, it is necessary to convert the human language into binary, which is encoded. It's like we're going to translate Chinese into English, and the process of translating is coding. Common coding commonly have ASCII, iso-8859-1, GB2312, GBK, UTF-8, UTF-16 and so on. GB2312, GBK, UTF-8, and UTF-16 can all represent Chinese characters.
ASCII Code
People who have learned the computer know the ASCII code, a total of 128, with a byte of the low 7-bit representation, 0~31 is the control characters such as newline return to delete, etc., 32~126 is a print character, can be entered by the keyboard and can be displayed.
Iso-8859-1
128 characters is obviously not enough, so ISO organization on the basis of the ASCII code to develop a number of columns to extend the ASCII encoding, they are iso-8859-1~iso-8859-15, where iso-8859-1 covers most of the Western European language characters, Most widely used in all applications. Iso-8859-1 is still a single-byte encoding, which can represent a total of 256 characters.
GB2312
Its full name is "the basic set of Chinese character encoding character set of information interchange", it is a double-byte encoding, the total encoding range is a1-f7, which from a1-a9 is the symbol area, a total of 682 symbols, from B0-f7 is the Chinese character area, contains 6,763 Chinese characters.
GBK
The full name is called "Chinese character Code extension Code", is the National Technical Supervision Bureau for the Windows95 of the new Chinese character code specification, its appearance is to expand GB2312, add more Chinese characters, its coding range is 8140~fefe (remove xx7f) total 23,940 code bit, it can express 21,003 Chinese characters, its encoding is compatible with GB2312, that is, the Chinese character encoded with GB2312 can be decoded with GBK, and there will be no garbled characters.
GB18030
The full name is the "Chinese character encoding character set for information interchange", which is a mandatory standard in China, it may be single-byte, double-byte or four-byte encoding, its encoding is compatible with GB2312 encoding, although this is the national standard, but the actual application system is not widely used.
utf-16
Speaking of UTF Must refer to Unicode (Universal Code Uniform Code), ISO is trying to create a new super-language dictionary, all the languages in the world can be translated from each other through this dictionary. It is conceivable how complex this dictionary is, and the detailed specification of Unicode can refer to the corresponding documentation. Unicode is the basis for Java and XML, and the following details the way Unicode is stored in a computer.
UTF-16 specifically defines how Unicode characters are accessed in the computer. UTF-16 uses two bytes to represent the Unicode conversion format, this is a fixed-length representation, no matter what character can be expressed in two bytes, two bytes is 16 bit, so called UTF-16. UTF-16 represents a very handy character, with every two bytes representing a single character, which greatly simplifies operations in the case of string manipulation, which is a very important reason for Java to use UTF-16 as a character storage format for memory.
UTF-8
UTF-16 Unified two bytes to represent a character, although the presentation is very simple and convenient, but also has its drawbacks, there are a large number of characters with a byte can be represented now to two bytes, storage space is magnified by one times, the current network bandwidth is very limited today, this will increase network traffic , and it's not necessary. The UTF-8 uses a variable-length technique, with different loadline lengths for each coded area. Different types of characters can be made up of 1~6 bytes.
UTF-8 has the following coding rules:
If one byte, the highest bit (8th bit) is 0, indicating that this is an ASCII character (00-7f). Visible, all ASCII encoding is already UTF-8.
If a byte, beginning with 11, the number of consecutive 1 implies the number of bytes of this character, for example: 110xxxxx means that it is the first byte of a double-byte UTF-8 character.
If a byte, starting with 10, indicates that it is not a first byte and needs to be searched forward to get the first byte of the current character
Talking about Java encoding type