The relationship between the "Python base" character encoding Ascii-gbk-unicode-utf-8

Source: Internet
Author: User


Character encoding because the computer only recognizes 0 and 1, in order to enable the computer to support symbols such as text and letters, convenient and practical operation of the computer so that the character encoding came into being, designed to solve the symbol and human language and computer 0 and 1 to establish a correspondence relationship it is said that the character encoding may be a lifelong regret, Take it out alone. History: Ascii-->unicode-->utf-8 ASCII is the earliest application in the United States, the establishment of a-Z and a number of special symbols, a total of 128 characters and the corresponding relationship between binary such as: lowercase letter W Decimal 135 Convert to binary 10000111 spaces 0 0 if the [space] [space]w binary is 0010000111 #The first two 0s represent two spaces There is a problem is that the binary is a string of cumbersome spaces and w how to define the boundary of each character, And then it was unified. All characters take up 8 bits, each time with a 8-bit boundary read if the [space] [space]w binary is 00000000 00000000 10000111 # #Here every 1 or 0 account1bit (bit) And then there's the ASCII code for today. 1bytes (byte) conversion relationship for each character: bit bit, the smallest representation unit in the computer, 0 and 1 8bit = 1bytes (b) bytes after each character go binary, minimum storage unit, 1bytes abbreviated to 1 B 1KB = 1024B 1MB = 1024KB 1GB = 1024MB ... Advantages: Establish the corresponding relationship disadvantage: not applicable to other national languages, other language parsing will appear garbled, such as Chinese, Korean and so on ... GBK in order to meet both Chinese and English (ASCII), the Chinese custom GBK, that is, this time Chinese and English use GBK no problem, if written in Japanese other languages will produce garbled gbk:2bytes represents a Chinese character, 1Bytes represents an English character other countries to meet their own , countries have customized their own code Japan to the Japanese Shift_JIS, South Korea to the EUC-KR Unicode to solve the problem of the coexistence of other countries, known as the Universal Code, only built in memory because of ASCII with 8 bits that is the maximum support 11111111 Convert to decimal 255 encoding Chinese I'm afraid it's not enough, Chinese has tens of thousands of, and later on the original 1 bytes (11111111) base and added 1 bytes ASCII English 11111111 maximum support 255 characters occupy 1 bytes of Unicode Support Chinese 11111111 11111111 100w+ occupies 2-4 bytes Unicode in order to be compatible with 8-bit ASCII, the original ASCII 8-bit based on the unified addition of 00000000 implementation of 2 bytes The reason that Unicode is fixed length is because all the characters are occupied 2bytes This is Unicode (fixed length), Unification uses 2Bytes to represent a character, although 2**16-1=65535, but Unicode can hold 100w+ characters, because Unicode stores mappings to other encodings, and it is accurate to say that Unicode is not a strictly symbolic character-encoding table UTF-8 It is clear that for text that is English throughout, Unicode is undoubtedly one-time storage space (the binary is ultimately stored in the storage medium in the form of electrical or magnetic), resulting in a UTF-8 (variable length, full Unicode transformation Format), The English characters only use 1Bytes, the Chinese characters with 3Bytes, to other uncommon words with more bytes to save. So now we are all in the UTF-8 one of the reasons for the whole process: ASCII---GBK------------and UTF-8 using the procedure: Based on the current situation, in-memory encoding fixed is Unicode, The only variable we have is the corresponding character encoding on the hard disk. At this point you may feel that if we develop the software later in the unified Unicode encoding, then not all unified, on the unification of this point of your thinking is right, but we can not use Unicode encoding to write program files, because in the case of the entire English language, The space consumed is almost one-fold, so that when the software reads into the memory or writes to the disk, it increases the number of Io, which reduces the execution efficiency of the program. Therefore, we should use a more precise character encoding utf-8 (in 1Bytes, 3Bytes in Chinese) When writing Program files, and again, the encoding in memory is fixed using Unicode. 1, in the disk, you need to convert Unicode into a more precise format, Utf-8: Full Unicode Transformation Format, the amount of data control to the most compact 2, when reading into memory, you need to turn Utf-8 into Unicode So we need to be clear: in-memory Unicode is to be compatible with the universal software, even if the hard disk has national coding software, Unicode has a corresponding mapping relationship, but in the current development, programmers generally use utf-8 encoding, It is estimated that in the future, when all the old software is eliminated, it can become: Memory utf-8<-> HDD Utf-8form of the.Character Encoding
Since the computer only recognizes 0 and 1, in order for the computer to support symbols such as text and letters, it is convenient and practical to operate the computer
So character encoding came into being, aiming to solve the relationship between symbols and human languages and computers 0 and 1
It is said that not understanding the character encoding may be the regret of a programmer's life. Take it out and summarize it separately.

Development history:
    ASCII-> Unicode-> UTF-8
    ASCII is the earliest application in the United States. It established the correspondence between A-z and a group of special symbols, a total of 128 characters and binary.
        For example: lowercase w decimal 135 is converted to binary 10000111
             Space 0 0
        If it means [space] [space] w binary is 0010000111 #The two zeros before the two represent two spaces
        The problem is that the binary has a string of tedious spaces and a dividing line to define the boundary of each character. As a result, all characters are unified to occupy 8 bits, and each time they are read with 8-bit boundaries
        If it means [space] [space] w Binary is 00000000 00000000 10000111 ## Each 1 or 0 occupies 1 bit (bit)
        As a result, today's ASCII code takes 1 byte per character.
            Conversion relationship:
                bit, the smallest unit of representation in the computer, after each character is converted to binary 0 and 1
                8bit = 1bytes (B) bytes, the smallest storage unit, 1bytes is abbreviated as 1B
                1KB = 1024B
                1MB = 1024KB
                1GB = 1024MB
                ...
        Pros: Correspondence established
        Disadvantages: Not applicable to other national languages, garbled characters will appear in other languages, such as Chinese, Korean, etc ...

    GBK In order to satisfy both Chinese and English (ASCII), the Chinese have customized GBK, which means that GBK is used in Chinese and English at this time. If you write in other languages in Japanese, it will cause garbled characters.
            GBK: 2Bytes represents a Chinese character, 1Bytes represents an English character
            Other countries satisfy themselves, each country has customized its own code
            Japan compiled Japanese into Shift_JIS, and South Korea compiled Korean into Euc-kr

    Unicode was born to solve the problem of the coexistence of languages in other countries. It is commonly known as Universal Coding and is only built in memory.
            Because ASCII uses 8 bits, that is, it supports up to 11111111 to convert to decimal 255 codes.
            I am afraid that Chinese is not enough. There are tens of thousands of Chinese. Later, one byte was added to the original one byte (11111111).

            ASCII English 11111111 supports up to 255 characters and occupies 1 byte
            Unicode supports Chinese 11111111 11111111 100W + occupies 2-4 bytes
            In order to be compatible with 8-bit ASCII at the same time, Unicode uniformly adds 00000000 on the original 8-bit ASCII to achieve 2 bytes.
            The reason why unicode is fixed-length is because all characters occupy 2bytes
            This is unicode (fixed length). Uniformly uses 2Bytes to represent a character. Although 2 ** 16-1 = 65535,
            But unicode can store 100w + characters, because unicode stores the mapping relationship with other encodings. To be precise, unicode is not a strict character encoding table.

    UTF-8 Obviously, for texts that are all in English, the unicode format undoubtedly doubles the storage space (the binary is ultimately stored in the storage medium in an electrical or magnetic manner)
            So UTF-8 (variable length, full name Unicode Transformation Format) was produced. It only uses 1Bytes for English characters, 3Bytes for Chinese characters, and
            Rare words are saved with more Bytes. So one of the reasons why everyone is implementing UTF-8 now

The entire development process:
    ASCII-> GBK-> Unicode-> UTF-8
Use process:
    Based on the current situation, the fixed encoding in memory is unicode, and the only variable we have is the corresponding character encoding on the hard disk.
    At this point, you may feel that if we use the unicode encoding in the development of soft time, will not all be unified? Your thinking about unification is correct, but we cannot use it.
    Unicode encoding to write program files, because in the case of English throughout, the space consumed is almost doubled, so when the software reads into memory or writes to the disk, it will increase the number of IO times.
    Thereby reducing the execution efficiency of the program. Therefore, we should uniformly use a more accurate character encoding UTF-8 when writing program files in the future (using 1Bytes for English and 3Bytes for Chinese).
    Once again, unicode is fixed for encoding in memory.
        1. When saving to disk, you need to convert unicode into a more precise format.
        2. When reading into memory, you need to convert UTF-8 to unicode
    So we need to be clear: the use of unicode in memory is to be compatible with IWC. Even if there are software written in various countries on the hard disk, unicode also has a corresponding mapping relationship.
    In the development, programmers generally use UTF-8 encoding. It is estimated that in the future, when all the old software is eliminated, it can become: UTF-8 in the form of memory. .


Reference: http://www.cnblogs.com/linhaifeng/articles/5950339.html
More illustrations please click Hyperlink



The relationship between the "Python base" character encoding Ascii-gbk-unicode-utf-8


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.