Python3-Character encoding

Source: Internet
Author: User

1.1 What is a byte

BYTE (byte) is a unit of measurement used by computer information technology to measure storage capacity, and also represents data types and language characters in some computer programming languages.

Bit (bit) the smallest unit in the computer, in the BITS computer system, each bit, which represents a digital signal of 0 or 1.

1.1.1 Byte representations in different character encodings
    1. ASCII: An English letter (not case) occupies one byte (1B) of space, a Chinese character occupies two bytes (2B) of space. A binary number sequence, as a digital unit in a computer, typically 8-bit binary number, converted to 10 binary, the minimum 0, the maximum value of 255.
    2. UTF-8 encoding: An English character equals one byte, and one Chinese (with traditional) equals 3 bytes.
    3. Unicode encoding: One English equals two bytes, and one Chinese (with traditional) equals 2 bytes.

Symbol: English punctuation occupies one byte, Chinese punctuation is two bytes. Example: English period "." Takes a byte size, Chinese period ". "Takes up two bytes in size.

1.2 Representation and conversion of the binary

The representation of the binary:

Binary: 0 1

8 binary: 0 1 2 3 4 5 6 7 (combination of 0-7) (8-Binary 2-bit, less than 3-bit complement 3)

10 binary: 0-9 combination

16 binary: 0 1 2 3 4 5 6 7 8 9 A B C D E F (combination of 0-f) (16-Binary 2-bit, less than 4-bit complement 4)

Conversions between the binaries:

Http://jingyan.baidu.com/article/495ba84109665338b30ede98.html

1.3 Data storage and transmission

Data storage

The hard disk manufacturer is calculated with 10 binary (that is, 10 of the 3 square =1000,1MB = 1000KB), while the computer is a binary representation (2 10, or 1MB = 1024KB).

On the computer a lot of 1TB hard drives, on the computer only

1TB = 1000GB = 1000000MB = 1000000000KB = 1000000000000B hard disk manufacturer indicated

1TB = 1000000000000b/1024/1024/1024 = 931GB computer capacity representation

1kb=1024b;1mb=1024kb=1024x1024b. Which 1024=210. 1B (Byte, byte) = 8 bit (see below); 1KB (Kibibyte, Kbytes) =1024b= 2^10 b;1mb (Mebibyte, MBytes, million-byte, abbreviation "Mega") =1024kb= 2^20 B;1GB (Gigabyte, gigabyte , 1 billion bytes, also known as "gigabit") =1024mb= 2^30 B;1TB (Terabyte, trillion bytes, MBytes) =1024gb= 2^40 b;1pb (petabyte, petabyte bytes, Pat bytes) =1024tb= 2^50 B ; 1EB (Exabyte, exascale bytes, ai bytes) =1024pb= 2^60 b;1zb (zettabyte, 10 trillion bytes, ze bytes) = 1024eb= 2^70 b;1yb (yottabyte, 100 million Bytes, Yao bytes) = 1024zb= 2 ^80 B;1BB (brontobyte, 100 billion bytes) = 1024yb= 2^90 b;1nb (nonabyte, 100 trillion bytes) = 1024x768 BB = 2^100 b;1db (doggabyte, 1 billion bytes) = 1024x768 NB = 2^110 B

Data transmission:

The data store is a byte ("byte") unit, in which data transfers are mostly in the "bit" ("Bit", aka "Bit") units, a bit representing 0 or q (i.e. binary), every 8 bits ("bit", abbreviated to B) constitute a byte (byte, shorthand B), Bit is the smallest level of information Unit

The smallest unit transmitted in a computer is the signal unit bit, the basic unit of the digital traffic is bit, and the basic unit of time is s (seconds), so bit/s (bits per second) is the basic unit that describes the broadband.

The bandwidth (bps) is the maximum bit data that can be passed within a fixed time (1 seconds).

BPS (bit per second)

Cases:

Some broadband carriers, a 20Mbps bandwidth, while the actual maximum download speed is about 2.5mb/s (which is downloaded in bytes per second, capital B)

20mb/s = 20/8 = 2.5mb/s

1mb/s = 1024kb/s = 1024KB/8 = 128kb/s

Your upload speed and download speed are shared with your bandwidth

1.4 Character encoding

Character encoding is: how a particular character corresponds to a specific digital standard

1.ASCII encoding

Since the computer was invented by the Americans, only 128 letters were encoded into the computer, that is, the letters, numbers, and some English symbols, which are called ASCII codes, originally using ASCII (American standared code for Information Interchange, United States Standard Information Interchange Code).

Although the standard ASCII character set has a limited number of characters, since the basic processing unit of a computer is byte (1Byte =1bit), a single byte is generally stored in an ASCII character, The extra one (highest bit) in each byte is usually kept as 0 inside the computer (the parity bit is available in the data transfer.) Due to the limited number of standard ASCII character sets, in practical applications are often not enough, and later added a lot of drawing tables need to use the underscore, vertical line, cross and other sequence number to the last state 255, from 128 to 255 the character set of this page is called "Extended character set".

2.GBK, GB18030, GB2312

GB2312 (1980) contains a total of 7,445 characters, including 6,763 Kanji and 682 other symbols. The inner code range of the Chinese character area is high byte from B0-f7, low byte from A1-fe, occupy code bit is 72*94=6768. 5 of these seats are d7fa-d7fe.

GB2312 supports too few Chinese characters. The 1995 Chinese character extension specification GBK1.0 contains 21,886 symbols, which are divided into Chinese characters and graphic symbol areas. The Chinese character area consists of 21,003 characters. The 2000 GB18030 is the official national standard for replacing GBK1.0. The standard contains 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major minority characters. Now the PC platform must support GB18030, the embedded products are not required. So mobile phones, MP3 generally only support GB2312.

GBK is the norm of Chinese character coding.

From ASCII, GB2312, GBK to GB18030, these coding methods are backwards compatible, meaning that the same character always has the same encoding in these scenarios, and the latter standard supports more characters.

3.unicode encoding

There are hundreds of languages all over the world, Japan made up Japanese Shift_JIS , South Korea to the Korean language, Euc-kr countries have national standards, will inevitably appear conflict, the result is, in the multi-language mixed text, the display will be garbled.

As a result, Unicode emerges. Unicode unifies all languages into a set of encodings, so there is no more garbled problem. The Unicode standard is also evolving, but it is most commonly used to represent a character in two bytes (4 bytes If a very remote character is used). Unicode is supported directly by modern operating systems and most programming languages.

UNICODE . (Universal multiple-octet Coded Character Set) abbreviation UCS

In Unicode, an era in which a Chinese character counts two English characters is almost past.

Whether it's half-width letters or full-width Chinese characters, they're all unified "one character"! At the same time, it's all a unified "two bytes".

Note that the two terms "character" and "byte" are different, " byte " is a 8-bit physical storage unit, and " character " is a culture-related symbol. In Unicode, a character is two bytes. The era of a Chinese character counting two English characters is almost past.

We already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, a character represents two bytes This is a huge waste of storage space, the size of the text file will be twice times larger, which is unacceptable.

4.utf-8

Unicode has not been promoted for a long time, until the advent of the Internet, in order to solve the problem of how Unicode is transmitted over the network, so many UTF (UCS Transfer Format) standards for transmission appear, as the name implies, that UTF-8 is a 8-bit transmission of data each time,

and UTF-16 is 16 bits at a time. UTF-8 is one of the most widely used Unicode implementations on the Internet, encoded for transmission and without Borders, so that it can display characters from all cultures around the world.

One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, depending on the symbol and change the length of the byte, when the character in the range of ASCII code, a byte is expressed, preserving the ASCII character of a byte encoding as part of it, note that Unicode one Chinese characters accounted for 2 bytes, And UTF-8 a Chinese character is 3 bytes). From Unicode to uft-8 is not a direct correspondence, but a number of algorithms and rules to convert.


Python3-Character encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.