Geek College Python Learning notes

Source: Internet
Author: User

Basis

Character encoding

    • Character: The character is a unit of information, it is a general term for all kinds of words and symbols, including national text, punctuation, graphic symbols, numbers, etc.
    • Character set: The character set is a collection of characters. There are many types of character sets, and each character set contains a different number of characters. Common character sets are ASCII character sets, GB2312 character sets, Unicode character sets, etc.
    • Character encoding: Refers to a character in a character set that encodes it into a specific binary number for the computer to process. Common character encodings are ASCII encoded, UTF-8 encoded, GBK encoded
        • ASCII:   In the 1960s, the United States developed a set of character encoding schemes that stipulate the conversion of letters, numbers, and some common symbols to binary, known as ASCII codes.
        • unicode: 

          ASCII codes only specify 128-character encodings, which is sufficient in the United States. However, the computer later spread to Europe, Asia, and all over the world, and the world's language is almost completely different, ASCII code to express other languages is far from enough, so, different countries and regions have developed their own coding scheme, such as the Chinese mainland GB2312 code and GBK code, etc. Japan's shift_jis coding and so on.

          Although various countries and regions can develop their own coding scheme, but different countries and regions of the computer in the process of data transmission will be a variety of garbled (mojibake), this is undoubtedly a disaster. What about

          ? The idea is simply to unify all the languages of the world into a set of coding schemes called Unicode, which sets a unique binary encoding for each character in each language, so that text can be processed across languages and across platforms

        • UTF-8:

          Unicode appears to be perfect, and unification has been implemented. However, there is a big problem with Unicode: waste of resources.

          Why do you say so? Originally, Unicode, in order to be able to represent all the languages of the world, began with two bytes, and later found that two bytes were not enough, and used four bytes. At this point, the problem is, if the previous ASCII character set is also expressed in this way, it is not a waste of storage space. To solve this problem, on the basis of Unicode, people realized the UTF-16, UTF-32 and UTF-8. The

          UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode that uses one to four bytes to represent characters, such as ASCII characters that continue to be encoded in a byte, Arabic, Greek and so on using two byte encoding, commonly used Chinese characters using three byte encoding, and so on.

          Therefore, we say that UTF-8 is one of the ways Unicode is implemented, and other implementations include UTF-16 (characters in two or four-byte representations) and UTF-32 (characters in four-byte notation).

    • Python's default encoding: The default encoding for Python2 is ASCII, and the default encoding for Python3 is Utf-8

Import Sys

Sys.getdefaultencoding ()

      • There are two types in python2 that are related to characters: STR and Unicode, and their parent class is basestring. where the str type string is encoded in several ways, by default ASCII, and Gbk,utf-8 and so on, the Unicode type string is used u‘...‘ as a representation of the
      • Convert UTF-8 encoded string ' xxx ' to Unicode string u ' xxx ' with decode(‘utf-8‘) method
      • convert u ' xxx ' to UTF-8 encoded ' xxx ' encode(‘utf-8‘) method
      • When a string operation containing both the STR type and the Unicode type is performed, Python2 will decode Str (decode) into Unicode and then it is easy to appear unicodedecodeerror
      • If an object such as a function or class receives a string of type STR, but you pass it by default to encode it to the STR type using ASCII, it is easy to unicode,python2 unicodeencodeerror.

Geek College Python Learning notes

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.