Baptism of the soul, practice python (3) -- expose coding problems, operating principles and syntax habits from a simple print code, pythonprint

Source: Internet
Author: User
Tags encode string

Baptism of the soul, practice python (3) -- expose coding problems, operating principles and syntax habits from a simple print code, pythonprint

After the preliminary work is ready, you can open the IDE editor. You can select the built-in python IDLE or a third party. Here I use pycharm-an IDE dedicated to python.

By convention, "hello, world" is required for printing the first python code in all development languages "? No, you are wrong.

 

The first line is encoding. UTF-8 is an international standard. If I do not add it, it is easy to make mistakes.

The print statement in the second line is a keyword Statement of python. It can print a character, a number, and anything you want to print, if you want it to be displayed, you can print it using print.

The following C: \ python... It is the location where python is installed. It can be seen from the side that I am currently using python 2 to run this code.

1 -- that is, the result of the current Code running, print 1 to the screen.

 

As a result, there are several problems:

Question 1. What is encoding? Why encoding?

A: This is an international standard. A simple understanding is that the data format stored in computers is 0/1, therefore, encoding is needed to convert 0/1 into something that humans can understand. The characters written by humans also need to be converted and encoded for computer recognition. Code written in the development language. For advanced languages (Advanced languages are used for code that is close to human languages, and machine languages are used for code 0 and 1 ), the interpreter must be interpreted as a character that the machine can recognize.

The initial character encoding is ASCII.

ASCII: used by Americans. It can only explain numbers and English letters. ASCII is an ANSI standard and contains 128 characters (7 bits). We call it ANSI encoding. It is usually an ASCII extended code (because the default encoding for windows is ANSI ), it extends the ascii code to 8 bits and adds 128 characters in total to 0x80-0xff. In cjk (chinese japanese korean, chinese, japanese, and korean) systems, ansi often refers to multi-byte internal code encoding. It is not hard to see that the so-called ANSI encoding is an ASCII-encoded, non-unicode character set that is compatible without international standardization (there is no way to standardize, because the internal code of the extended part has an intersection.

EASCII: because German and other European languages use derived Latin characters.

However, these are not enough for other languages in the world, such as Chinese, Japanese, and Korean, and require multiple bytes.
GBK series: in order to solve Chinese encoding problems, we have compiled GBK encoding sets, which are compatible with ASCII. Note that there are compatibility problems in different encoding sets. GBK contains all Chinese characters, traditional Chinese characters and simplified Chinese characters, but traditional Chinese characters in mainland China do not work. Therefore, GB2312 encoding only contains simplified Chinese characters. Note that a Chinese character in GBK is represented in two bytes.
Although GBK solves the problem of Chinese encoding, if China uses its own collection, Japan and South Korea also use its own, in this way, if the computer of the other party does not have the corresponding replica set for information interaction, the decoded data is wrong or garbled. can we develop a universal replica set in the world? Unicode came into being, so Unicode is a big unified,Unicode encoding is not required
Unicode: the delimiter set uses four bytes to represent one character. It can hold all the characters in the world. However, the problem is also obvious. Assume that you want to upload an English document, the difference between the use of ASCII encoding and the use of Unicode is 4 times that, in other words, Unicode transmission efficiency is too low; To solve this problem, there is a UTF-8, It is a Unicode implementation method.

Unicode range UTF-8 Coding
Single-byte: 0000 0000-0000 007F 0 xxxxxxx
Double Byte: 0000 0080-0000 07FF 110 xxxxx 10 xxxxxx
Three Bytes: 0000 0800-0000 FFFF 1110 xxxx 10 xxxxxx 10 xxxxxx
Four bytes: 0001 0000-001F FFFF 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

Unicode encoding specification under the UTF-8, UTF-16, UTF-32 three specific implementations.
Each character in the UTF-32 is represented in 4 bytes.
UTF-8, with Variable Length Technology, occupies 1 to 4 bytes, compatible with ASCII encoding, Chinese characters occupy 3 bytes. From this we can see that UTF-8 is the most flexible, and it takes up to 4 bytes, saving a lot of resources.
The UTF-16 represents a single character in two bytes.

An additional benefit of UTF-8 encoding is that ASCII encoding can actually be seen as part of UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding, because an English letter is a byte in ascii and a byte in UTF-8, ascii does not support Chinese characters.

For example, the Chinese word "medium:

The preceding u indicates Unicode encoding.

From top to bottom, they are GBK, unicode, and utf Encoding. Note that the characters cannot be directly converted from GBK to UTF-8. You can convert the Unicode string encode ("UTF-8") to UTF8, you can convert the Utf-8 string decode ("UTF-8") to the Unicode string

Encode mainly converts unicode encode to UTF-8. decode mainly ranges from UTF-8 to unicode. The windows kernel uses unicode.

Python automatically sets the encode of the string to sys. stdout. encoding during print. The default encoding is used for execution.

Note: An error is returned when python executes an encode string.

 

In a metaphor, Unicode is like a pawnshop. I need money to get the pawnshop for valuation, exchange, or sell (decode), and convert it into money (encode for money ), if I need an object, I have to go to the pawnshop to change the money or buy it (the process of money is decode) the object (the process of changing to an object is called encode ), you cannot directly use money to buy or pay back for objects, because it is easy to lose money without a pawn shop for price valuation (of course, this metaphor is a bit logical, but you only need to understand the principles of it, I will see it later in the crawler chapter ).

 

Since coding is mentioned, the compiler and interpreter are also mentioned together. High-level languages are very close to human terms, so machines cannot recognize them. This requires the interpreter to explain.

 

Bytecode and machine code:

Differences between bytecode and machine code (or native code:

The C code is compiled into a machine code and will be directly executed on the processor. Each command controls CPU operations, and python is written in C.

The Java code is compiled into bytecode and will be executed on the abstract computer of the Java Virtual Machine (JVM. Each Command is processed by the JVM. The JVM interacts with the computer itself, and is then interpreted or translated into executable files by the interpreter.

In short: machine code is much faster, but bytecode is easier to migrate and safer.

Explanatory language definition:

The program does not need to be compiled. It is translated only when the program is running. Every statement is translated only when it is executed. In this way, each execution of an explanatory language requires a line-by-line translation, which is less efficient.

Modern explanatory languages usually compile the source program into intermediate code, and then use the interpreter to translate the intermediate code into the target machine code and execute it one by one.

When you run the code later, you will find that Python is interpreted.

Compilation language definition:

Before a program written in a compilation language is executed, a special compilation process is required to compile the program into a machine language file, such as an exe file, if you want to run it later, you don't need to translate it again. You just need to use the compilation result (exe file). Because the translation is only done once, you don't need to translate it during the runtime, therefore, the execution efficiency of compiled languages is high.

Python working process:

Python is an interpreted programming language. You can also compile the python script into a pyc file. Otherwise, it is also a python virtual command that runs in python.

Python first compiles the code into bytecode and implements the interpretation of bytecode. The bytecode corresponds to the PyCodeObject object in the python virtual machine program, and the pyc file is the representation of the bytecode on the disk.

Question 2. Why is an error reported if I do not add it?

A: If UTF-8 is not added, an error is reported. The error message indicates that the default encoding is not set.

In python3, the official team has solved this encoding problem, because the encoding problem of python2 (ASCII by default) is really annoying. This problem can be reflected in crawlers later.

Note that if you print this code under python3

Note:

  • 1. In python3, print has been changed to a built-in function, which is no longer a syntax keyword. Therefore, brackets must be added. In python2, if printt is added with parentheses, no error will be reported.
  • 2. the default encoding in python3 is Unicode and can be printed normally. However, we recommend that you add the default encoding: #-*-coding: UTF-8 -*-, in fact, writing directly # coding: UTF-8 is also acceptable, but the former is an international habit, a good habit can reflect your programming capabilities.
  • 3. to print a string, you must enclose it in quotation marks, which will be discussed later in the type section.
  • If you are using the python built-in IDLE, the IDLE in python2 uses cp936 encoding, which is an ASCII code.

Question 3. Why only UTF-8 encoding is used?

A: As we have mentioned before, Unicode is a big unified. UTF-8 is a Unicode and the best choice. Therefore, UTF-8 is used.

Question 4: Can I write another word "print? How can I print a piece of Chinese text?

A: The keyword of python is a set syntax keyword that cannot be changed. It can be redefined as a variable. In principle, it cannot be redefined or replaced by others.

The effect of printing Chinese characters has been shown above, and the code is attached to your own exercises:

#-*-Coding: UTF-8-*-print ('my ')

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.