Python3-Character encoding

Last Update:2016-11-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.1 What is a byte

BYTE (byte) is a unit of measurement used by computer information technology to measure storage capacity, and also represents data types and language characters in some computer programming languages.

Bit (bit) the smallest unit in the computer, in the BITS computer system, each bit, which represents a digital signal of 0 or 1.

1.1.1 Byte representations in different character encodings

ASCII: An English letter (not case) occupies one byte (1B) of space, a Chinese character occupies two bytes (2B) of space. A binary number sequence, as a digital unit in a computer, typically 8-bit binary number, converted to 10 binary, the minimum 0, the maximum value of 255.
UTF-8 encoding: An English character equals one byte, and one Chinese (with traditional) equals 3 bytes.
Unicode encoding: One English equals two bytes, and one Chinese (with traditional) equals 2 bytes.

Symbol: English punctuation occupies one byte, Chinese punctuation is two bytes. Example: English period "." Takes a byte size, Chinese period ". "Takes up two bytes in size.

1.2 Representation and conversion of the binary

The representation of the binary:

Binary: 0 1

8 binary: 0 1 2 3 4 5 6 7 (combination of 0-7) (8-Binary 2-bit, less than 3-bit complement 3)

10 binary: 0-9 combination

16 binary: 0 1 2 3 4 5 6 7 8 9 A B C D E F (combination of 0-f) (16-Binary 2-bit, less than 4-bit complement 4)

Conversions between the binaries:

Http://jingyan.baidu.com/article/495ba84109665338b30ede98.html

1.3 Data storage and transmission

Data storage

The hard disk manufacturer is calculated with 10 binary (that is, 10 of the 3 square =1000,1MB = 1000KB), while the computer is a binary representation (2 10, or 1MB = 1024KB).

On the computer a lot of 1TB hard drives, on the computer only

1TB = 1000GB = 1000000MB = 1000000000KB = 1000000000000B hard disk manufacturer indicated

1TB = 1000000000000b/1024/1024/1024 = 931GB computer capacity representation

1kb=1024b;1mb=1024kb=1024x1024b. Which 1024=210. 1B (Byte, byte) = 8 bit (see below); 1KB (Kibibyte, Kbytes) =1024b= 2^10 b;1mb (Mebibyte, MBytes, million-byte, abbreviation "Mega") =1024kb= 2^20 B;1GB (Gigabyte, gigabyte , 1 billion bytes, also known as "gigabit") =1024mb= 2^30 B;1TB (Terabyte, trillion bytes, MBytes) =1024gb= 2^40 b;1pb (petabyte, petabyte bytes, Pat bytes) =1024tb= 2^50 B ; 1EB (Exabyte, exascale bytes, ai bytes) =1024pb= 2^60 b;1zb (zettabyte, 10 trillion bytes, ze bytes) = 1024eb= 2^70 b;1yb (yottabyte, 100 million Bytes, Yao bytes) = 1024zb= 2 ^80 B;1BB (brontobyte, 100 billion bytes) = 1024yb= 2^90 b;1nb (nonabyte, 100 trillion bytes) = 1024x768 BB = 2^100 b;1db (doggabyte, 1 billion bytes) = 1024x768 NB = 2^110 B

Data transmission:

The data store is a byte ("byte") unit, in which data transfers are mostly in the "bit" ("Bit", aka "Bit") units, a bit representing 0 or q (i.e. binary), every 8 bits ("bit", abbreviated to B) constitute a byte (byte, shorthand B), Bit is the smallest level of information Unit

The smallest unit transmitted in a computer is the signal unit bit, the basic unit of the digital traffic is bit, and the basic unit of time is s (seconds), so bit/s (bits per second) is the basic unit that describes the broadband.

The bandwidth (bps) is the maximum bit data that can be passed within a fixed time (1 seconds).

BPS (bit per second)

Cases:

Some broadband carriers, a 20Mbps bandwidth, while the actual maximum download speed is about 2.5mb/s (which is downloaded in bytes per second, capital B)

20mb/s = 20/8 = 2.5mb/s

1mb/s = 1024kb/s = 1024KB/8 = 128kb/s

Your upload speed and download speed are shared with your bandwidth

1.4 Character encoding

Character encoding is: how a particular character corresponds to a specific digital standard

1.ASCII encoding

Since the computer was invented by the Americans, only 128 letters were encoded into the computer, that is, the letters, numbers, and some English symbols, which are called ASCII codes, originally using ASCII (American standared code for Information Interchange, United States Standard Information Interchange Code).

Although the standard ASCII character set has a limited number of characters, since the basic processing unit of a computer is byte (1Byte =1bit), a single byte is generally stored in an ASCII character, The extra one (highest bit) in each byte is usually kept as 0 inside the computer (the parity bit is available in the data transfer.) Due to the limited number of standard ASCII character sets, in practical applications are often not enough, and later added a lot of drawing tables need to use the underscore, vertical line, cross and other sequence number to the last state 255, from 128 to 255 the character set of this page is called "Extended character set".

2.GBK, GB18030, GB2312

GB2312 (1980) contains a total of 7,445 characters, including 6,763 Kanji and 682 other symbols. The inner code range of the Chinese character area is high byte from B0-f7, low byte from A1-fe, occupy code bit is 72*94=6768. 5 of these seats are d7fa-d7fe.

GB2312 supports too few Chinese characters. The 1995 Chinese character extension specification GBK1.0 contains 21,886 symbols, which are divided into Chinese characters and graphic symbol areas. The Chinese character area consists of 21,003 characters. The 2000 GB18030 is the official national standard for replacing GBK1.0. The standard contains 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major minority characters. Now the PC platform must support GB18030, the embedded products are not required. So mobile phones, MP3 generally only support GB2312.

GBK is the norm of Chinese character coding.

From ASCII, GB2312, GBK to GB18030, these coding methods are backwards compatible, meaning that the same character always has the same encoding in these scenarios, and the latter standard supports more characters.

3.unicode encoding

There are hundreds of languages all over the world, Japan made up Japanese Shift_JIS , South Korea to the Korean language, Euc-kr countries have national standards, will inevitably appear conflict, the result is, in the multi-language mixed text, the display will be garbled.

As a result, Unicode emerges. Unicode unifies all languages into a set of encodings, so there is no more garbled problem. The Unicode standard is also evolving, but it is most commonly used to represent a character in two bytes (4 bytes If a very remote character is used). Unicode is supported directly by modern operating systems and most programming languages.

UNICODE . (Universal multiple-octet Coded Character Set) abbreviation UCS

In Unicode, an era in which a Chinese character counts two English characters is almost past.

Whether it's half-width letters or full-width Chinese characters, they're all unified "one character"! At the same time, it's all a unified "two bytes".

Note that the two terms "character" and "byte" are different, " byte " is a 8-bit physical storage unit, and " character " is a culture-related symbol. In Unicode, a character is two bytes. The era of a Chinese character counting two English characters is almost past.

We already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, a character represents two bytes This is a huge waste of storage space, the size of the text file will be twice times larger, which is unacceptable.

4.utf-8

Unicode has not been promoted for a long time, until the advent of the Internet, in order to solve the problem of how Unicode is transmitted over the network, so many UTF (UCS Transfer Format) standards for transmission appear, as the name implies, that UTF-8 is a 8-bit transmission of data each time,

and UTF-16 is 16 bits at a time. UTF-8 is one of the most widely used Unicode implementations on the Internet, encoded for transmission and without Borders, so that it can display characters from all cultures around the world.

One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, depending on the symbol and change the length of the byte, when the character in the range of ASCII code, a byte is expressed, preserving the ASCII character of a byte encoding as part of it, note that Unicode one Chinese characters accounted for 2 bytes, And UTF-8 a Chinese character is 3 bytes). From Unicode to uft-8 is not a direct correspondence, but a number of algorithms and rules to convert.

Python3-Character encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python3-Character encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python3-Character encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support