"Python3 character encoding"

Source: Internet
Author: User

One, character set and character encoding 1. Definition

The information stored in the computer is represented by a binary number, and the characters we see on the screen, such as English, Kanji, and so on, are the result of binary conversion. Popularly speaking, according to what rules to store characters in the computer, such as ' a ' with what is called "coding", conversely, the stored in the computer binary number resolution display, called "Decoding", like cryptography and decryption. In the decoding process, if the wrong decoding rules are used, it causes ' a ' to parse to ' B ' or garbled.

character (Character): is a unit of information, in a computer, a Chinese character is a character, an English letter is a character, an Arabic numeral is a character, a punctuation mark is also a character.

Character Set (Charset): is a collection of all the abstract characters supported by a system. Usually in the form of a two-dimensional table, the content and size of the two-dimensional table is determined by the user's language, which can be English, Chinese, or Arabic.

character encoding (Character Encoding): A set of rules used to pair a set of natural language characters, such as an alphabet or a syllable table, with a collection of other things, such as numbers or electrical pulses. Here we encode the characters in the character set to a specific binary number so that it can be stored in the computer. Encoding is generally the algorithm for transforming the horizontal ordinate of a two-dimensional table. That is, it is a basic technique of information processing to establish correspondence between symbol set and digital system. That is: character--------(translation process)-------> binary number

2. Commonly used character sets and character encodings

Character sets and character encodings are generally paired, such as ASCII, GBK, Unicode, UTF-8, and so on, which represent the character set and the corresponding character encoding, hereafter referred to as encoding.

3. History of character encoding

First stage: Origin, ASCII

The computer is invented by the Americans, people use American English, the characters are relatively small, so the first design a small two-dimensional table, 128 characters, named ASCII (American standard Code for information interchange). However, the 7-bit coded character set can only support 128 characters, in order to indicate that more European characters commonly used characters are extended to ASCII, the ASCII extended character set uses 8 bits (BITS) to represent one character, with a total of 256 characters. That is, it can only be represented by a maximum of 8 bits (one byte).

Phase II: GBK

  When the computer to Asia, especially in East Asia, international standards were killed, roadside children casually say a word, 256 yards is not enough. As a result, China has customized the GBK. Represents a character (kanji) with 2 bytes. Other countries have also customized their own codes, such as:

Japan put the Japanese into the language Shift_JIS , South Korea in the Korean Euc-kr .

Phase III: Unicode  

When the internet swept through the world, the geographical restrictions were broken, different countries and regions of the computer in the process of exchanging data, there will be garbled problems, and the geographical isolation of the language is similar. In order to solve this problem, a great creation thought produced the--unicode (Universal code). The Unicode encoding system is designed to express any character of any language.

Specifies that all characters and symbols are represented by a minimum of 16 bits (2 bytes), that is: 2 **16 = 65536, note: This is said to be at least 2 bytes (16 bits), possibly more.

Stage four: UTF-8

  Unicode is encoded in a way that encompasses all nations, but it wastes too much storage space for characters such as English. Then there is the UTF-8, which is the compression and optimization of Unicode encoding, followed by the least representation with the fewest representation, he no longer uses a minimum of 2 bytes, but instead all the characters and symbols are categorized: the contents of the ASCII code are saved in 1 bytes, the characters in Europe are stored in 2 bytes, Characters in East Asia are saved with 3 bytes.

Add:

Unicode: Inclusive, the advantage is the character---digital conversion speed, the disadvantage is that occupy space large utf-8: accurate, different characters with different lengths, the advantage is to save space, the disadvantage is: character---number conversion speed is slow, Because each time you need to figure out how long a character needs to be bytes to accurately represent
The encoding used in memory is Unicode, with space for time, in order to quickly
Because the program needs to be loaded into memory to run, the memory should be as fast as possible.
In the hard disk or network transmission with UTF-8, network I/O latency or disk I/O latency is much larger than the utf-8 conversion delay, and I/O should be as much as possible to save bandwidth, ensure the stability of data transmission.
Because of the transmission of data, the pursuit of stability, high efficiency, the smaller the amount of data transmission is more reliable, so all turned into utf-8 format, rather than Unicode.

Such as:

4. Use of character encoding

1) How the text editor accesses the file (Nodepad++,pycharm,word)

Opening the editor opens a process that is in memory, so the content written in the editor is also stored in memory, and the data is lost after a power outage. So you need to save to your hard drive and click the Save button to swipe the data from memory to your hard drive. At this point, we write a py file (no execution), no different from writing other files, just writing a bunch of characters.

regardless of the editor, to prevent garbled files, The core rule is that what code the file is stored in, and what encoding it opens.

2) How the Python interpreter executes the py file (Python test.py)

First stage: The Python interpreter starts, which is equivalent to launching a text editor

Second stage: The Python interpreter is equivalent to a text editor to open the test.py file and read the contents of the test.py file into memory from the hard disk

Phase three: The Python interpreter interprets the code that executes the test.py that was just loaded into memory

  

Add:

Therefore, in writing code, in order not to appear garbled, recommended to use UTF-8, will add #-*-Coding:utf-8-*-

That

#!/usr/bin/env python#-*-coding:utf-8-*-  print "Hello, World"

The Python interpreter reads the second line of the test.py, #-*-Coding:utf-8-*-, to decide what encoding format to read into memory, and this line is to set the Python interpreter encoding of the software using the encoding format.

If you do not specify the header information #-*-coding:utf-8-*-in the Python file, use the default Python2 in default Ascii,python3 in the default use Utf-8

Summarize:

1) The Python interpreter interprets the contents of the executable file, so the Python interpreter has the ability to read the Py file, as is the case with a text editor

2) Unlike a text editor, the Python interpreter can read not only the contents of the file, but also the contents of the file.

Some of the differences between 5.python2 and Python3

1) default use in Python2 in Ascii,python3 utf-8

2) Python2, STR is the encoded result bytes,str=bytes, so s can only decode.

3) The string in Python3 and the U ' string ' in Python2 are Unicode and can only be encode, so printing is not garbled anyway, because it can be understood to print from memory to memory, memory---memory, Unicode->unicode

4) Python3, STR is Unicode, when the program executes, without adding u,str will also be in Unicode form to save the new memory space, STR can be directly encode into any encoding format, S.encode (' Utf-8 '), S.encode (' GBK ')

#unicode (str)-----encode---->utf-8 (bytes) #utf -8 (bytes)-----decode---->unicode

5) The Windows Terminal encoding for Gbk,linux is UTF-8.

"Python3 character encoding"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.