Python Advanced Path---1.5python data type-string

Source: Internet
Author: User
Tags escape quotes stdin

String

Python string and character encoding character encoding

Character encoding

As we've already said, strings are also a type of data, but a special string is a coding problem.

Because a computer can only handle numbers, if you are working with text, you must convert the text to a number before processing it. The oldest computer was designed with 8 bits (bit) as a byte (byte), so a single word energy-saving representation of the largest integer is 255 (binary 11111111 = decimal 255), if you want to represent a larger integer, you must use more bytes. For example, two bytes can represent the largest integer is 65535, 4 bytes can represent the largest integer is 4294967295.

Since the computer was invented by the Americans, only 127 letters were encoded into the computer, that is, letters, numbers, and symbols, which are referred to as ASCII encoding, such as the code for capital A is 65, and the lower case z is encoded as 122.

But to deal with the Chinese is clearly a byte is not enough, at least two bytes, but also cannot and ASCII encoding conflict, so, China has developed a GB2312 code, used to put Chinese into.

What you can imagine is that there are hundreds of languages all over the world, Japan has made Japanese into Shift_JIS, South Korea has made Korean into EUC-KR, and countries have standards, inevitably conflict, the result is that in multi-language mixed text, the display will be garbled.

As a result, Unicode emerges. Unicode unifies all languages into a set of encodings, so there is no more garbled problem.

The Unicode standard is also evolving, but it is most commonly used to represent a character in two bytes (4 bytes If a very remote character is used). Unicode is supported directly by modern operating systems and most programming languages.

Now, the difference between ASCII encoding and Unicode encoding is smoothed: ASCII encoding is 1 bytes, and Unicode encoding is usually 2 bytes.

The letter A with ASCII encoding is decimal 65, binary 01000001;

The character 0 is in ASCII encoding is decimal 48, binary 00110000, note that the character ' 0 ' and the integer 0 are different;

The ASCII encoding range has been exceeded in Chinese characters, with Unicode encoding being decimal 20013, binary 01001110 00101101.

You can guess that if you encode ASCII-encoded A in Unicode, you only need to make 0 on the front, so the Unicode encoding for A is 00000000 01000001.

The new problem arises again: If Unicode encoding is unified, the garbled problem disappears. However, if you write text that is basically all in English, using Unicode encoding requires more storage space than ASCII encoding, which is not cost-effective in storage and transmission.

Therefore, in the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:

Character ASCII Unicode UTF-8 A 01000001 00000000 01000001 01000001 medium x 01001110 00101101 11100100 10111000 10101101 from the table above can also be found, One additional benefit of UTF-8 encoding is that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

Figuring out the relationship between ASCII, Unicode, and UTF-8, we can summarize how the current computer system works with character encoding:

In computer memory, Unicode encoding is used uniformly, and is converted to UTF-8 encoding when it needs to be saved to the hard disk or when it needs to be transferred.

When editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the edits are complete, the conversion of Unicode to UTF-8 is saved to the file:

When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser:

So you see a lot of pages of the source code will have similar information, that the page is exactly the UTF-8 encoding.

Python string

Several ways to express a Python string can be enclosed in single or double quotation marks:

name = ‘lixigli‘user = "lixingli"print(name,user)>>lixingli lixingli

If a string contains single quotes but no double quotes, the string is enclosed in double quotation marks, otherwise enclosed in single quotation marks. For such an input string, the print () function produces an easier-to-read output.

literal strings that span rows can be represented in the following ways. Use a continuation character, that is, after the last character in each line, use a backslash to indicate that the next line is a continuation of the previous line of logic: The following uses \ n to add a new row:

>>> ‘"Isn\‘t," she said.‘‘"Isn\‘t," she said.‘>>> print(‘"Isn\‘t," she said.‘)"Isn‘t," she said.>>> s = ‘First line.\nSecond line.‘  # \n 意味着新行>>> s  # 不使用 print(), \n 包含在输出中‘First line.\nSecond line.‘>>> print(s)  # 使用 print(), \n 输出一个新行First line.Second line.

The following uses backslashes (\) to continue the line:

hello = "This is a rather long string containing\nseveral lines of text just as you would do in C.\n    Note that whitespace at the beginning of the line is significant."print(hello)

Note that the newline character is still represented by \ n-the newline character after the backslash is discarded. The above example is output as follows:

This is a rather long string containingseveral lines of text just as you would do in C.    Note that whitespace at the beginning of the line is significant.

Alternatively, the string can be enclosed by "" "(three double quotes) or" ' (three single quotes). When using three quotation marks, the newline character does not need to be escaped, they are included in the string, and the example uses an escape character to avoid creating an empty line that is not needed at the very beginning.

print("""Usage: thingy [OPTIONS]     -h                        Display this usage message     -H hostname               Hostname to connect to""")

The output is as follows:

Usage: thingy [OPTIONS]     -h                        Display this usage message     -H hostname               Hostname to connect to

If we use the "raw" string, then \ n does not convert rows, the backslash at the end of the line, and the line break in the source code are included as data within the string. For example:

hello = r"This is a rather long string containing\nseveral lines of text much as you would do in C."print(hello)

Note that r means that using the original characters in Python will output:

This is a rather long string containing\nseveral lines of text much as you would do in C.
String operations

Strings can be concatenated with the + operator or repeated using the * operator:

>>> word = ‘Help‘ + ‘A‘>>> word‘HelpA‘>>> ‘<‘ + word*5 + ‘>‘‘<HelpAHelpAHelpAHelpAHelpA>‘

Two adjacent literal strings will automatically be concatenated; the first line of the business example can also be written as Word = ' help ' A '; Such an operation is only valid for two literals and cannot be used in a string expression:

>>> ‘str‘ ‘ing‘                   #  <-  这样操作正确‘string‘>>> ‘str‘.strip() + ‘ing‘   #  <-  这样操作正确‘string‘>>> ‘str‘.strip() ‘ing‘     #  <-  这样操作错误  File "<stdin>", line 1, in ?    ‘str‘.strip() ‘ing‘                      ^SyntaxError: invalid syntax

strings can be indexed; like the C language, the first character of a string has an index of 0 and no separate character type; a character is a string of length one, just like the icon programming language, a substring can be specified by using a split character: Two indexes separated by colons.

word = ‘Help you‘print(word[3]) #第四个字符print(word[:2]) #第0到3个字符print(word[2:4]) #第3到4个字符

The output is as follows: P He LP The default tangent index is useful: The default first index is zero, and the second index defaults to the length at which the string can be cut.

Unlike the C character, the Python string cannot be changed, and assigning a value to an index position causes an error:

>>> word[0] = ‘x‘Traceback (most recent call last):  File "<stdin>", line 1, in ?TypeError: ‘str‘ object does not support item assignment>>> word[:1] = ‘Splat‘Traceback (most recent call last):  File "<stdin>", line 1, in ?TypeError: ‘str‘ object does not support slice assignment

However, it is simple and efficient to create new characters using the combined content method:

>>> ‘x‘ + word[1:]‘xelpA‘>>> ‘Splat‘ + word[4]‘SplatA‘

There is a very useful rule when splitting an operation string: S[:i] + s[i:] equals S.

>>>word = ‘Help A‘>>> word[:2] + word[2:]‘HelpA‘>>> word[:3] + word[3:]‘HelpA‘

In addition to numbers, Python can manipulate strings. There are several ways to express a string, which can be enclosed in single or double quotation marks:

>>> ‘spam eggs‘‘spam eggs‘>>> ‘doesn\‘t‘"doesn‘t">>> "doesn‘t""doesn‘t">>> ‘"Yes," he said.‘‘"Yes," he said.‘>>> "\"Yes,\" he said."‘"Yes," he said.‘>>> ‘"Isn\‘t," she said.‘‘"Isn\‘t," she said.‘

Python uses backslashes to escape quotes and other special characters to accurately represent them. If a string contains single quotes but no double quotes, the string is enclosed in double quotation marks, otherwise enclosed in single quotation marks. For such an input string, the print () function produces an easier-to-read output. literal strings that span rows can be represented in the following ways. Use a continuation character, that is, after the last character in each line, use a backslash to indicate that the next line is a continuation of the previous line of logic: The following uses \ n to add a new row:

>>> ‘"Isn\‘t," she said.‘‘"Isn\‘t," she said.‘>>> print(‘"Isn\‘t," she said.‘)"Isn‘t," she said.>>> s = ‘First line.\nSecond line.‘  # \n 意味着新行>>> s  # 不使用 print(), \n 包含在输出中‘First line.\nSecond line.‘>>> print(s)  # 使用 print(), \n 输出一个新行First line.Second line.

The following uses backslashes (\) to continue the line:

hello = "This is a rather long string containing\nseveral lines of text just as you would do in C.\n    Note that whitespace at the beginning of the line is significant."print(hello)

Note that the newline character is still using \ n-the line break after the backslash is discarded. The above example will output as follows:

This is a rather long string containingseveral lines of text just as you would do in C.    Note that whitespace at the beginning of the line is significant.

Alternatively, the string can be enclosed by "" "(three double quotes) or" ' (three single quotes). When you use three quotation marks, the newline characters do not need to be escaped, and they are included in the string. The following example uses an escape character to avoid creating an empty line that is not needed at the very beginning.

print("""Usage: thingy [OPTIONS]     -h                        Display this usage message     -H hostname               Hostname to connect to""")其输出如下:Usage: thingy [OPTIONS]     -h                        Display this usage message     -H hostname               Hostname to connect to

If we use the "raw" string, then \ n will not be converted to newline, the backslash at the end of the line, and the line break in the source code will be included in the string as data. For example:

hello = r"This is a rather long string containing\nseveral lines of text much as you would do in C."

Print (hello) will output:

This is a rather long string containing\nseveral lines of text much as you would do in C.

Strings can be concatenated with the + operator string, or repeated with the * operator: >>> word = ' help ' + ' A ' >>> word ' helpa ' >>> ' < ' + word*5 + ' > ' two immediate literal string will automatically be concatenated; the first line of the above example can also be written as Word = ' help ' A '; Such an operation is valid only between two literals and cannot be used in a string expression:

>>> ‘str‘ ‘ing‘                   #  <-  这样操作正确‘string‘>>> ‘str‘.strip() + ‘ing‘   #  <-  这样操作正确‘string‘>>> ‘str‘.strip() ‘ing‘     #  <-  这样操作错误  File "<stdin>", line 1, in ?    ‘str‘.strip() ‘ing‘                      ^SyntaxError: invalid syntax

strings can be indexed; just like C, the first character of a string has an index of 0. There is no separate character type; one character is a string of one length. Just like the icon programming language, substrings can be specified by using a split character: Two indexes separated by colons.

>>> word[4]‘A‘>>> word[0:2]‘Hl‘>>> word[2:4]‘ep‘

The default tangent index is useful: The default first index is zero, and the second index defaults to the length at which the string can be cut.

>>> word[:2]    # 前两个字符‘He‘>>> word[2:]    # 除了前两个字符之外,其后的所有字符‘lpA‘

Unlike the C string, the Python string cannot be changed. Assigning a value to an index location can result in an error:

>>> word[0] = ‘x‘Traceback (most recent call last):  File "<stdin>", line 1, in ?TypeError: ‘str‘ object does not support item assignment>>> word[:1] = ‘Splat‘Traceback (most recent call last):  File "<stdin>", line 1, in ?TypeError: ‘str‘ object does not support slice assignment

However, it is simple and efficient to create new strings using the combined content method:

>>> ‘x‘ + word[1:]‘xelpA‘>>> ‘Splat‘ + word[4]‘SplatA‘

There is a very useful rule when splitting an operation string: S[:i] + s[i:] equals S.

>>> word[:2] + word[2:]‘HelpA‘>>> word[:3] + word[3:]‘HelpA‘

The processing of biased indexes is also elegant: an oversized index will be replaced by the size of the string, and an empty string will be returned if the upper limit is less than the lower value.

>>>word = ‘Help A‘>>> word[1:100]‘elpA‘>>> word[10:]>>> word[2:1]

Negative numbers can be used in the index, which counts from right to left. For example:

>>>word = ‘Help A‘>>> word[-1]     # 最后一个字符‘A‘>>> word[-2]     # 倒数第二个字符‘p‘>>> word[-2:]    # 最后两个字符‘pA‘>>> word[:-2]    # 除了最后两个字符之外,其前面的所有字符‘Hel‘但要注意, -0 和 0 完全一样,所以 -0 不会从右开始计数!>>> word[-0]     # (既然 -0 等于 0)‘H‘

An out-of-range negative index is truncated, but do not attempt to use it in a cell index (non-indexed):

>>> word[-100:]‘HelpA‘>>> word[-10]    # 错误Traceback (most recent call last):  File "<stdin>", line 1, in ?IndexError: string index out of range

There is a way for you to remember how the slitting index works, imagine that the index is pointing to the character, and the number to the left of the first character is 0. Next, the string with n characters to the right of the last character is index n, for example:

 +---+---+---+---+---+ | H | e | l | p | A | +---+---+---+---+---+ 0   1   2   3   4   5-5  -4  -3  -2  -1

The first line of the number 0...5 gives the position of the index in the string, and the second line gives the corresponding negative index. The slitting sections from I to J are made up of all the characters at the edges that are labeled I and J, respectively. For non-negative segments, if the index is within the valid range, the length of the cut-off part is the difference between the indexes. For example, the length of Word[1:3] is 2. The built-in function Len () is used to return the length of a string:

The first line of the number 0...5 gives the position of the index in the string, and the second line gives the corresponding negative index. The slitting sections from I to J are made up of all the characters at the edges that are labeled I and J, respectively. For non-negative segments, if the index is within the valid range, the length of the cut-off part is the difference between the indexes. For example, the length of Word[1:3] is 2. The built-in function Len () is used to return the length of a string:

>>> s = ‘supercalifragilisticexpialidocious‘>>> len(s)34
Python formatted output

The last common question is how to output a formatted string. We will often output similar ' Dear XXX Hello! You xx month's bill is XX, the balance is xx ' and so on the string, and the XXX content is varies according to the variable, therefore, needs a simple format string the way. In Python, the format used is consistent with the C language and is implemented as a%, for example:

>>> ‘Hello, %s‘ % ‘world‘‘Hello, world‘>>> ‘Hi, %s, you have $%d.‘ % (‘Michael‘, 1000000)‘Hi, Michael, you have $1000000.‘

Another approach is to:

name = ‘hello,{0} 的电话是{1}。‘info = name.format(‘brutyli‘,13853966793)print(info)

The output is as follows:

hello,brutyli 的电话是13853966793。

As you may have guessed, the% operator is used to format the string. Inside the string,%s is replaced with a string,%d is replaced with an integer, there are several% placeholder, followed by a number of variables or values, the order to correspond well. If there is only one%, the parentheses can be omitted.

Common placeholders are:

Python Advanced Path---1.5python data type-string

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.