Seventh chapter, Python character encoding

Source: Internet
Author: User

Seventh chapter, Python character encodingfirst, the definition

The information stored in the computer is represented by a binary number, and the characters we see on the screen, such as English, Kanji, and so on, are the result of binary conversion. Popularly speaking, according to what rules to store characters in the computer, such as ' a ' with what is called "coding", conversely, the stored in the computer binary number resolution display, called "Decoding", like cryptography and decryption. In the decoding process, if the wrong decoding rules are used, it causes ‘a‘ parsing ‘b‘ or garbled characters.

    • Character (Character) : is a unit of information, in a computer, a Chinese character is a char, an English letter is a character, an Arabic numeral is a character, a punctuation mark is also a character.
    • Character Set (Charset) : is a collection of all the abstract characters supported by a system. Usually in the form of a two-dimensional table, the content and size of the two-dimensional table is determined by the user's language, which can be English, Chinese, or Arabic.
    • Character encoding: A set of (Character Encoding) rules that can be used to pair a set of natural language characters (such as an alphabet or a syllable table) with a set of other things (such as a number or electrical pulse). Here we encode the characters in the character set to a specific binary number so that it can be stored in the computer. Encoding is generally the algorithm for transforming the horizontal ordinate of a two-dimensional table. That is, it is a basic technique of information processing to establish correspondence between symbol set and digital system. That is: Character--------(translation process)-------> binary number
Ii. commonly used character sets and character encodings

Character set and character encoding are generally in pairs appear, such as,, ASCII GBK Unicode , and UTF-8 so on, are represented by the character set and the corresponding character encoding, later called 编码 .

The history of character coding
    • First stage: Origin, ASCII

The computer is invented by the Americans, people use American English, the characters are relatively small, so the first design a small two-dimensional table, 128 characters, named ASCII (American standard Code for information interchange). However, the 7-bit coded character set can only support 128 characters, in order to indicate that more European characters commonly used characters are extended to ASCII, the ASCII extended character set uses 8 bits (BITS) to represent one character, with a total of 256 characters. That is, it can only be represented by a maximum of 8 bits (one byte).

    • Phase II: GBK

When the computer to Asia, especially in East Asia, international standards were killed, roadside children casually say a word, 256 yards is not enough. As a result, China has customized the GBK. Represents a character (kanji) with 2 bytes. Other countries have also customized their own codes, such as:

Japan made the Japanese into the Shift_JIS, and Korea made the Korean into the EUC-KR.

    • Phase III: Unicode

When the internet swept through the world, the geographical restrictions were broken, different countries and regions of the computer in the process of exchanging data, there will be garbled problems, and the geographical isolation of the language is similar. In order to solve this problem, a great creation thought produced the--unicode (Universal code). The Unicode encoding system is designed to express any character of any language.

Specifies that all characters and symbols are represented by a minimum of 16 bits (2 bytes), that is: 2 **16 = 65536, note: This is said to be at least 2 bytes (16 bits), possibly more.

    • Stage four: UTF-8

Unicode is encoded in a way that encompasses all nations, but it wastes too much storage space for characters such as English. Then there is the UTF-8, which is the compression and optimization of Unicode encoding, followed by the least representation with the fewest representation, he no longer uses a minimum of 2 bytes, but instead all the characters and symbols are categorized: the contents of the ASCII code are saved in 1 bytes, the characters in Europe are stored in 2 bytes, Characters in East Asia are saved with 3 bytes.

Add:

    • Unicode: Inclusive, the advantage is the character---the conversion speed of the digital, the disadvantage is to occupy large space

    • Utf-8: precision, for different characters with different lengths, the advantage is to save space, the disadvantage is: character---number conversion speed is slow, because each time you need to calculate how long the character needs bytes to be able to accurately represent

The encoding used in memory is Unicode, using space for time, in order to be fast because the program needs to be loaded into memory to run, so the memory should be as fast as possible.

In the hard disk or network transmission with UTF-8, network I/O latency or disk I/O latency is much larger than the utf-8 conversion delay, and I/O should be as much as possible to save bandwidth, ensure the stability of data transmission. because of the transmission of data, the pursuit of stability, high efficiency, the smaller the amount of data transmission is more reliable, so all turned into utf-8 format, rather than Unicode.

Four, the use of character encoding
  • 1) How the text editor accesses the file (Nodepad++,pycharm, Word)

    • Open the editor opens a process that is in memory, so the content written in the editor is also stored in memory, and the data is lost after a power outage. So you need to save to your hard drive and click the Save button to swipe the data from memory to your hard drive. At this point, we write a py file (no execution), no different from writing other files, just writing a bunch of characters.

    • no matter what the editor, to prevent garbled files , the core rule is that the file is stored in what code, and it is opened in any coded way.

  • 2) How the Python interpreter executes the py file (Python test.py)

    • First stage: The Python interpreter starts, which is equivalent to launching a text editor

    • Second stage: The Python interpreter is equivalent to a text editor to open the test.py file and read the contents of the test.py file into memory from the hard disk

    • Phase three: The Python interpreter interprets the code that executes the test.py that was just loaded into memory

Add

Therefore, in writing code, in order not to appear garbled, recommended to use UTF-8, will join

# -*-coding:utf-8-*-

That

# !/usr/bin/env python # -*-coding:utf-8-*-  Print " Hello, World "

The Python interpreter reads the second line of test.py, #-coding:utf-8--to determine what encoding format to read into memory, This line is to set up the Python interpreter for this code using the encoding format of this software.

If you do not specify header information in the Python file #--coding:utf-8--, then use the default Python2 in default use Ascii,python3 in the default use Utf-8

Summarize:

1) The Python interpreter interprets the contents of the executable file, so the Python interpreter has the ability to read the Py file, as is the case with a text editor

2) Unlike a text editor, the Python interpreter can read not only the contents of the file, but also the contents of the file.

Seventh chapter, Python character encoding

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.