Types of str and bytes in Python 3

Source: Internet
Author: User

Types of str and bytes in Python3

One of the most important new features of Python3 is a clear distinction between string and binary data streams. Text is always Unicode, represented by the STR type, and binary data is represented by the bytes type. Python3 does not mix str and bytes in any implicit way, we cannot stitch strings and byte streams, we cannot search strings in a byte stream (and vice versa), and we cannot pass strings into parameters as byte streams (and vice versa).

History of coding development

Before bytes and STR, we need to talk about the development of coding first. In the early days of computer history, the English-speaking countries represented by the United States dominated the entire computer industry, and 26 English letters formed a variety of English words, sentences, articles. Therefore, the earliest character encoding specification is the ASCII code, a 8-bit code that is a bytecode code, it can cover the entire English department coding needs.

What is encoding? Encoding is the representation of a character in a binary. We all know that everything, whether in English, Chinese or symbols, is ultimately stored on disk as 010101 of these things. Inside a computer, reading and storing data boils down to a bit stream of 0 and 1. The question is, how can human beings not understand these bitstream, and how to make these 010101 recognition classes become readable? So there is the character encoding, he is a translation machine, somewhere inside the computer, transparent help us to translate the bitstream into a class can directly understand the text. For the average user, there is no need to know what the process is and how it is executed, but it is a problem that must be made clear to the programmer.
In ASCII encoding, for example, it specifies that 1 bytes 8 bits represent 1-character encodings, that is, "00000000" so wide, one-byte interpretation. For example: 01000001 denotes capital A, and sometimes we use the 65 decimal to denote the encoding of a in ASCII. 8 bits, which can represent up to 2 of the 8-square (255) characters without repetition.

Later the computer was popularized, Chinese, Japanese, Korean and so on the language needs in the computer, the ASCII 255 is far from enough. The standard organization then developed a universal code called Unicode , which specifies that any one character (regardless of country) is represented at least 2 bytes, which can be more. Where the English alphabet is 2 bytes, and the kanji is 3 bytes. This code is good enough to meet everyone's requirements, but it is incompatible with ASCII and takes up more memory space. Because in the computer world more characters are English letters, it can be expressed in 1 bytes, not to use 2.
So UTF-8 (variable-length character encoding for Unicode) came into being, which stipulated that the English alphabet series was expressed in 1 bytes, the characters were represented by 3 bytes, and so on. Therefore, it is compatible with ASCII, can be the early documentation of the code, and soon UTF-8 has been widely used. In the development process of coding, China has also created its own coding methods, such as: GBK, GB2312, BIG5, they are confined to domestic use, not recognized by foreign countries. In GBK encoding, Chinese characters account for 2 bytes.

The similarities and differences between bytes and STR

Back to bytes and str, Bytes is a bit stream, and its existence is in the form of 01010011. Whether we are writing code or reading an article, it is certain that no one will read this bitstream directly, and he must have a way of encoding it into meaningful bitstream rather than a bunch of obscure 01 combinations. Because of the different encoding, the interpretation of this bitstream will be different, the actual use caused a lot of trouble. Let's look at how Python handles this series of coding problems:

>>> s = "中文">>> s‘中文‘>>> type(s)<class ‘str‘>>>> b = bytes(s, encoding="utf-8")>>> bb‘\xe4\xb8\xad\xe6\x96\x87‘>>> type(b)<class ‘bytes‘>

As you can see from the example above, S is a string type. Python's built-in function bytes () can convert the string str type to bytes type, B is actually a string of 01 combinations, but for the sake of our relatively intuitive observation in the IDE environment, it has been changed out of the box B ' \xe4\xb8\xad\xe6\x96\x87 ' In this form, the beginning of B indicates that this is a bytes type. \xe4 is the hexadecimal representation, it occupies 1 bytes of length, so "Chinese" is encoded into utf-8, we can count the total used 6 bytes, each Chinese character occupies 3, which confirms the above discussion. When using the built-in function bytes (), you must explicitly encoding the parameters, not omit them.

As we all know, the string type Str has a encode () method, which is the encoding process from string to bit stream . The bytes type happens to have a decode () method, which is the process of decoding from a bit to a string. In addition, we look at the Python source code will find that bytes and Str have almost a similar list of methods, the biggest difference is encode and decode. In essence, strings are stored on disk as a combination of 01, which also requires encoding and decoding. If the above statement does not allow you to figure out the difference between the two, then please keep the following words in mind:

    • 1. In the process of saving strings and reading from disk, Python automatically helps us with coding and decoding, and we don't need to care about the process;
    • 2. Using the bytes type, actually tells Python, does not need it to help us to do the encoding and decoding work automatically, but the user manually, and specify the encoding format;
    • 3.Python has a strict distinction between bytes and str two data types, we can not use the str parameter when the bytes type parameter is required, and vice versa, which is easily encountered when reading and writing disk files;

In the process of mutual conversion between bytes and STR, it is the process of encoding and decoding, which must be shown to specify the encoding format:

>>> bb‘\xe4\xb8\xad\xe6\x96\x87‘>>> type(b)<class ‘bytes‘>>>> res = str(b)>>> res"b‘\\xe4\\xb8\\xad\\xe6\\x96\\x87‘">>> type(res)<class ‘str‘>>>> result = str(b, encoding="utf-8")>>> result‘中文‘>>> type(result)<class ‘str‘>

We then convert the res to the GBK encoded bytes type:

>>> gbk_b = bytes(result, encoding="gbk")>>> gbk_bb‘\xd6\xd0\xce\xc4‘>>> s = str(gbk_b, encoding="gbk")>>> s‘中文‘>>> type(s)<class ‘str‘>

Types of str and bytes in Python 3

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.