Python string encoding

Source: Internet
Author: User
Tags readable python script

text, which usually refers to characters or other tokens that appear on the screen, but the computer cannot directly handle these characters and tags; they only recognize bits and bytes (byte). In fact, every piece of text on the screen is saved in the same way that a character is encoded (character encoding). Roughly speaking, character encoding provides a mapping that corresponds to what is displayed on the screen and what is stored in memory and on-disk. There are many different character encodings, some designed and optimized for specific languages, such as Russian, Chinese, or English, while others can be encoded in multiple languages.

In practice, it is more complex than described above. Many characters are common in several encodings, but in actual memory or on disk, different encoding methods may use different byte sequences to store them. So, you can encode the character encoding as a decoding key. When someone gives you a sequence of bytes--Files, Web pages, or something--and tells you that they are text, you need to know what encoding they are using before they can decode these byte sequences into characters. If they give you the wrong key or don't give you the key at all, you have to hack the code yourself, which is a difficult task. It is possible that you have used the wrong decoding method, and then there are some inexplicable results.

The knowledge you know about strings is wrong.

You must have seen such a Web page where the apostrophe (") was replaced by a strange character like a question mark. This usually means that the author of the page does not correctly declare the encoding used, and the browser can only guess by itself, and the result is a mixture of some correct and unexpected characters. If the original is English, it is simply inconvenient to read; in other locales, the results may be completely unreadable.

The existing character encodings provide coding solutions for every major language in the world. Each character encoding is optimized for a specific language because each language is different, and the memory and hard disk are expensive in the past. The above phrase means that each encoding uses numbers (0–255) to represent the characters of the language. For example, you might be familiar with ASCII encoding, which stores characters in English as numbers from 0–127. (65 means that uppercase a,97 represents lowercase a, & c.) The English alphabet is simple, so it can be expressed in less than 128 digits. If you know the 2 binary count, it only uses 7 bits in one byte.

Some languages in Western Europe, such as French, Spanish and German, have more letters than English. Or, more precisely, these languages contain letters that are combined with diacritical marks (diacritical marks), like N in Spanish. The most commonly used encoding for these languages is CP-1252, also known as windows-1252, because it is widely used in Microsoft's Windows operating system. CP-1252 and ASCII are the same characters in the range of 0–127, but CP-1252 is n (n-with-a-tilde-over-it, 241), Ü (u-with-two-dots-over-it, 252) This type of character extends to the range of 128–255. However, it is still a single-byte encoding, and the maximum possible number is 255, which can still be expressed in one byte.

However, languages such as Chinese, Japanese, and Korean have so many characters that they have to require multibyte-coded character sets. That is, use a two-byte number (0–255) to represent each character. But as with the different single-byte encodings, there is the same problem between multibyte encodings, where the numbers they use are the same, but the content is different. They only use a wider range of numbers relative to single-byte encoding because there are more characters to be represented.

In the era of no network, the text entered by themselves, occasionally will be printed out, in most cases the use of the above coding scheme is feasible. There was not much plain text at that time. The source code uses ASCII encoding, and others use the word processor, which defines their own format (non-textual), which is recorded along with character encoding information and style styles, & C. People use the same word-processing software as the original author to read these documents, so they are more or less able to use them.

Now, let's consider the advent of global networks such as email and the web. A large number of "plain text" files are circulating around the world, and they are written on a single computer, transmitted through a second computer, and finally displayed on another computer. Computers can only recognize numbers, but these numbers may express something else. Oh no! How to do it ... Well, the system must be designed to carry encoded information on every "plain text". Remember that encoding is a decoding key that maps a computer-readable number to a human-readable character. Losing the decoding key means confusing, inexplicable information, or worse.

Now we are thinking about trying to store multiple pieces of text in the same place, such as placing all the databases that receive mail. This still requires that each piece of text be stored with its associated character encoding information, so that they can be displayed correctly. Is it difficult? Try searching your email database, which means that you need to convert between encodings at run time. It's funny, isn't it …

Now let's analyze another possibility, a multilingual document, in which characters from several different languages are mixed together in the same document. (Tip: Programs that handle such documents typically use escape characters to switch between different modes (modes).) Pop! Now is the Russian koi8-r mode, so 241 stands for Я; Now to Mac Greek mode, so 241 represents?. Of course, you will also want to search for these documents. There is no such thing as plain text at all.

--------------------------------------Split Line--------------------------------------

CentOS on source installation Python3.4 http://www.linuxidc.com/Linux/2015-01/111870.htm

The second edition of Python core programming. (Wesley J. Chun). [HD PDF Chinese] http://www.linuxidc.com/Linux/2013-06/85425.htm

A detailed description of the Python development technology. (Zhou Wei, Zongjie). [HD PDF scan + with book video + code] http://www.linuxidc.com/Linux/2013-11/92693.htm

Python script gets Linux system Information http://www.linuxidc.com/Linux/2013-08/88531.htm

Using Python to build desktop algorithmic trading research environment in Ubuntu http://www.linuxidc.com/Linux/2013-11/92534.htm

A brief history of Python language development http://www.linuxidc.com/Linux/2014-09/107206.htm

Unicode

The Unicode encoding system is designed to express any character of any language. It uses 4-byte numbers to express each letter, symbol, or ideographs (Ideograph). Each number represents the only symbol that is used in at least one language. (Not all numbers are used, but the total is more than 65535, so a 2-byte number is not enough.) Characters that are common to several languages are usually encoded using the same number, unless there is a justification for etymology (etymological) to do so. Regardless of this case, each character corresponds to a number, and each digit corresponds to one character. That is, there is no ambiguity. Recording mode is no longer required. u+0041 always represents "a", even if the language does not have the character "a".

For the first time, it seems great to face this creation. One encoding method solves all problems. Documents can contain multiple languages. There is no longer a need for schema conversions between various encodings. But soon, a clear question sprang to our face. 4 bytes? Just for a single character this seems too wasteful, especially for languages like English and Spanish, where they only need less than 1 bytes to express the required characters. In fact, the method of Pictographic-based languages (such as Chinese) is also wasted, because the characters in these languages never need to be more than 2 bytes to express.

There is a Unicode encoding that uses 4 bytes per 1 characters. It is called UTF-32 because 32 bits = 4 bytes. UTF-32 is an intuitive encoding method that contains each Unicode character (4-byte number) and then represents the character with that number. This approach has its advantages, and the most important thing is that you can locate the nth character in a string in a constant time, since the nth character is from 4× Nth bytes start. In addition, it has its drawbacks, and most obviously it uses 4 odd bytes to store every weird character …

Although there are many Unicode characters, most people don't actually use more than the first 65,535 characters. Therefore, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes the characters in the 0–65535 range into 2 bytes, and if you really need to express those Unicode characters that are more than 65535 of the range in the Astral plane, you need to use some tricky tricks. The most obvious advantage of UTF-16 encoding is that it is twice times more efficient in space than UTF-32, because each character requires only 2 bytes to be stored (except the 65535 range), not 4 bytes in the UTF-32. And, if we assume that a string does not contain any of the characters in the awn layer, then we can still find the nth character in the constant time, until it is not established. This is always a good inference …

But there are some other non-obvious drawbacks to the UTF-32 and UTF-16 coding methods. Different computer systems store bytes in a different order. This means that the character u+4e2d may be saved as 4E 2D or 2D 4E in UTF-16 encoding, depending on whether the system is using a large end (Big-endian) or a small end (Little-endian). (For UTF-32 encoding, there are a number of possible byte permutations.) As long as the document does not leave your computer, it is still safe-different programs on the same computer use the same byte order. But when we need to transfer this document between systems, perhaps on the World Wide Web, we need a way to indicate how our bytes are currently stored. Otherwise, the computer receiving the document will not know whether the two bytes 4E 2D expression is u+4e2d or u+2d4e.

To solve this problem, the multibyte Unicode codec??? The method defines a byte order mark, which is a special nonprinting character that you can include in the beginning of the document to indicate the byte order you are using. For UTF-16, the byte order token is U+feff. If you receive a UTF-16 encoded document that begins with the byte FF FE, you can be sure that its byte order is unidirectional (one way), and that if it starts with Fe FF, it can be determined that the byte order is reversed.

However, UTF-16 is not perfect, especially when dealing with many ASCII characters. If you think about it, even a Chinese page will contain many ASCII characters-all elements (element) and attributes (attribute) that surround the printable Chinese characters. The ability to find the Nth character within a constant time is certainly very good, but there is still a problem with the tangled star-mount character, which means that you cannot guarantee that each character is 2 bytes long, so unless you maintain another index, you cannot actually locate the nth character in the constant time. In addition, friends, there must be a lot of ASCII text in the world …

Others are pondering these questions, and they have found a solution:

UTF-8

UTF-8 is a variable length (variable-length) encoding system designed for Unicode. That is, different characters can be encoded using a different number of bytes. For ASCII characters (A-Z, & C.) The UTF-8 is encoded using only 1 bytes. In fact, the first 128 characters (0–127) in UTF-8 are encoded in the same way as ASCII. Extended Latin characters like ñ and ö (Extended Latin) are encoded using 2 bytes. (The byte here is not a Unicode code point as simple as UTF-16); it uses some bit transformations (bit-twiddling). ) Chinese characters, for example, occupy 3 bytes. The rarely used star Mount character occupies 4 bytes.

Cons: Because each character uses a different number of byte encodings, finding the nth character in a string is an O (n) complexity-that is, the longer the string, the more time it takes to locate a particular character. At the same time, bit transformations are needed to encode characters into bytes and decode the bytes into characters.

Pros: Very effective in handling ASCII characters that are often used. It is also no worse than UTF-16 in dealing with the extended Latin character set. For Chinese characters, it's better than UTF-32. At the same time, you have to trust me in this article because I'm not going to show you the math. By the nature of bit manipulation, the problem of byte order is no longer present with UTF-8. A document encoded in UTF-8 is the same bit stream between different computers.

How to encode Python source code

Python 3 assumes that our source code-the. py file-is using the UTF-8 encoding method.

In Python 2, the. py file is encoded by default in ASCII. Python 3 's source code is the default encoding method of UTF-8

If you want to use a different encoding to save Python code, we can place the encoding Declaration (Encoding Declaration) on the first line of each file. The following declaration defines the. py file using the windows-1252 encoding method:

#-*-coding:windows-1252-*-

Technically, a character-encoded overload declaration can also be placed on the second line if the first row is occupied by the Hash-bang command in a UNIX-like system.

#!/usr/bin/python3#-*-coding:windows-1252-*-

For more information, see PEP 263: Specify how the python source code is encoded.

In Python 3, all strings are sequences of characters that use Unicode encoding. There is no longer a case of UTF-8 or CP-1252 encoding. In other words, is this string encoded in UTF-8? is no longer a valid issue. UTF-8 is a way to encode characters into a sequence of bytes. If you need to convert a string to a specific encoded byte sequence, Python 3 can do it for you. If you need to convert a sequence of bytes into a string, Python 3 will do the same for you. Bytes are bytes, not characters. Characters are only an abstraction within a computer. The string is an abstract sequence.

>>> s = "Deep Python" >>> len (s) 9>>> s[0] "deep" >>> S + "3" "Deep python 3"
    1. In order to create a string, enclose it in quotation marks. Python strings can be defined by single quotation marks (") or double quotation marks (").

    2. The built-in function Len () returns the length of the string, which is the number of characters. This function is the same as getting the length of a list, tuple, collection, or dictionary. In Python, a string can be imagined as a tuple of characters.

    3. Just like getting individual items out of a list, you can get the individual characters out of a string using index notation. You can also get a character in a string by following the tag number, just like the element in the list.

    4. Similar lists, you can use the + operator to concatenate (concatenate) strings.

    • This article from: Linux tutorials

Python string encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.