Details about text processing in Python,

Last Update:2015-04-13 Source: Internet

Author: User

Tags character classes string methods

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Details about text processing in Python,

String-unchangeable Sequence

Like most advanced programming languages, variable-length strings are the basic types in Python. Python allocates memory in the background to save strings (or other values), so programmers don't have to worry about it. Python also has some string processing functions not available in other advanced languages.

In Python, strings are "unchangeable sequences ". Although a string (such as a byte group) cannot be modified by position, a program can reference the element or subsequence of a string, just as it uses any sequence. Python uses flexible "sharding" operations to reference subsequences. The format of character snippets is similar to a specific range of rows or columns in a workbook. The following interactive session describes how to use strings and character snippets:
String and shard

>>> s =     "mary had a little lamb">>> s[0]     # index is zero-based    'm'>>> s[3] =     'x'     # changing element in-place failsTraceback (innermost last): File     "<stdin>", line 1,     in     ?TypeError: object doesn't support item assignment>>> s[11:18]     # 'slice' a subsequence    'little '>>> s[:4]     # empty slice-begin assumes zero    'mary'>>> s[4]     # index 4 is not included in slice [:4]    ' '>>> s[5:-5]     # can use "from end" index with negatives    'had a little'>>> s[:5]+s[5:]     # slice-begin & slice-end are complimentary    'mary had a little lamb'

Another powerful string operation is the simple in keyword. It provides two intuitive and effective structures:
In keyword

>>> s =     "mary had a little lamb">>>     for     c     in     s[11:18]:     print     c,     # print each char in slice...l i t t l e>>>     if    'x'     in     s:     print    'got x'     # test for char occurrence...>>>     if    'y'     in     s:     print    'got y'     # test for char occurrence...got y

In Python, there are several methods to form string text. You can use single or double quotation marks, as long as the left and right quotation marks match, there are other commonly used forms of quotation marks. If a string contains line breaks or embedded quotation marks, the triple quotation marks can easily define such strings, as shown in the following example:
Use of triple quotation marks

>>> s2 =     """Mary had a little lamb... its fleece was white as snow... and everywhere that Mary went... the lamb was sure to go""">>>     print     s2Mary had a little lambits fleece was white as snow    and     everywhere that Mary wentthe lamb was sure to go

Before a string that uses single or triple quotes, you can add a letter "r" to indicate that Python should not interpret special characters in the Rule expression. For example:
Use "r-strings"

>>> s3 =     "this \n and \n that">>>     print     s3this    and    that>>> s4 = r    "this \n and \n that">>>     print     s4this \n     and     \n that

In "r-strings", the backslash that may constitute another code-changing character may be treated as a regular backslash. This topic will be further described in future discussions on Rule expressions.

File and string variables

When talking about "text processing", we usually refer to the content to be processed. Python reads the content of a text file into a string variable that can be operated easily. The file object provides three "read" Methods:. read (),. readline (), and. readlines (). Each method can accept a variable to limit the amount of data read each time, but they usually do not use variables. . Read () reads the entire file every time. It is usually used to put the file content into a string variable. However. read () generates the most direct string representation of the file content, but it is unnecessary for continuous row-oriented processing. If the file is larger than the available memory, this processing is impossible.

. Readline () and. readlines () are very similar. They are all used in a structure similar to the following:
Python. readlines () Example

    fh = open(    'c:\\autoexec.bat')    for     line     in     fh.readlines():     print     line

The difference between. readline () and. readlines () is that the latter reads the entire file at a time, like. read .. Readlines () automatically analyzes the file content into a list of rows, which can be processed by the structure of Python for... in. On the other hand,. readline () reads only one row at a time, which is usually much slower than. readlines. Use. readline () only when there is not enough memory to read the entire file at a time ().

If you are using a standard module for file processing, you can use the cStringIO module to convert the string to a "Virtual File" (if you need to generate a subclass of the module, you can use the StringIO module, this is not necessary for beginners ). For example:
CStringIO Module

>>>     import     cStringIO>>> fh = cStringIO.StringIO()>>> fh.write(    "mary had a little lamb")>>> fh.getvalue()    'mary had a little lamb'>>> fh.seek(5)>>> fh.write(    'ATE')>>> fh.getvalue()    'mary ATE a little lamb'

However, remember that the cStringIO "Virtual File" is not permanent, which is different from the real file. If you do not save it (such as writing it to a real file or using the shelve module or database), it will disappear when the program ends.

Standard module: string

The string module may be the most commonly used module in the Python. * standard release. In fact, in Python 1.6 or later versions, functions in the string module will be used as built-in string methods (at the time of writing this article, details have not yet been released ). Of course, any program that executes a text processing task may start with the following line:
Start using the string method

Import string

The general rule of thumb tells us that if the string module can be used to complete the task, it is the correct method. Compared with the re (Rule expression), the string function is usually faster, which is easier to understand and maintain in most cases. Third-party Python modules, including some fast modules written in C, are suitable for specialized tasks, but we recommend that you use string whenever possible for portability and familiarity. If you are used to other languages, there will be exceptions, but as much as you think.

The string module contains several types of things, such as functions, methods, and classes. It also contains strings of common constants. For example:
String Example 1

>>>     import     string>>> string.whitespace    '\011\012\013\014\015 '>>> string.uppercase    'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

Although you can write these constants by hand, the string version makes sure that the constant is correct for the national language and platform that runs the Python script.

String also includes functions that convert strings in common ways (which can be combined to form several rare conversions. For example:
String usage Example 2

>>>     import     string>>> s =     "mary had a little lamb">>> string.capwords(s)    'Mary Had A Little Lamb'>>> string.replace(s,     'little',     'ferocious')    'mary had a ferocious lamb'

There are many other conversions not described here; you can find details in the Python manual.

You can also use the string function to report string attributes, such as the length or position of a substring, for example:
String usage Example 3

>>>     import     string>>> s =     "mary had a little lamb">>> string.find(s,     'had')5>>> string.count(s,     'a')4

Finally, string provides a very Python-based thing .. Split () and. join () provide a quick way to convert strings and byte groups, and you will find them very useful. Easy to use:
String usage Example 4

>>>     import     string>>> s =     "mary had a little lamb">>> L = string.split(s)>>> L[    'mary',     'had',     'a',     'little',     'lamb']>>> string.join(L,     "-")    'mary-had-a-little-lamb'

Of course, apart from. join (), you may use the list to do other things (for example, some things involving the for... in... structure we are familiar ).

Standard module: re

The re module discards the regex and regsub modules used in the old Python code. Although there are still several limited advantages over regex, these advantages are insignificant and are not worth using in new code. Outdated modules may be deleted from future Python releases, and version 1.6 may have an improved re module compatible with interfaces. Therefore, the rule expression will still use the re module.

The rule expression is complex. Some may write books on this topic, but in fact there are already many people doing this! This article attempts to capture the "full form" of the rule expression so that readers can master it.

A rule expression is a concise method used to describe patterns that may appear in the text. Will some characters appear? In a specific order? Does the sub-mode repeat for a certain number of times? Will other sub-modes be excluded from matching? In terms of concept, it seems that the natural language cannot intuitively describe the pattern. The trick is to use the concise syntax of the Rule expression to encode this description.

When processing a rule expression, it is handled as its own programming problem, even if only one or two lines of code are involved; these lines effectively constitute a applet.

Start from the very beginning. Basically, any rule expression involves matching a specific "character class ". The simplest character class is a single character, which is only a word in the mode. Generally, you want to match a type of characters. You can enclose a class in square brackets to indicate that it is a class. In brackets, there can be a set of characters or a range specified by a break. You can also use many named character classes to determine your platform and country language. The following are some examples:
Character class

>>>     import     re>>> s =     "mary had a little lamb">>>     if     re.search(    "m", s):     print    "Match!"     # char literalMatch!>>>     if     re.search(    "[@A-Z]", s):     print    "Match!"     # char class...     # match either at-sign or capital letter...>>>     if     re.search(    "\d", s):     print    "Match!"     # digits class...

The character class can be regarded as the "atom" of the Rule expression. It usually combines those atoms into "molecules ". You can use groups and loops to complete this operation. Grouping is represented by parentheses: Any subexpression in the brackets is considered as an atom used for future grouping or loops. A loop is represented by one of the following operators: "*" indicates "zero or multiple"; "+" indicates "one or more ";"? "Indicates" zero or one ". For example, see the following example:
Sample rule expression

ABC ([d-w] * \ d ?) + XYZ

For a string that matches this expression, it must start with "ABC" and end with "XYZ" -- but what must be in the middle? The subexpression in the middle is ([d-w] * \ d ?), The "one or more" operators are followed. Therefore, the string must contain one (or two, or one thousand) character or string that matches the subexpression in parentheses. The string "ABCXYZ" does not match because it does not have any necessary characters in the middle.

But what is this internal subexpression? It starts with zero or multiple letters in the d-w range. Note: Zero letters are valid matches. Although the English word "some" is used to describe it, it may be awkward. Then, the string must exactly have a number, followed by zero or an additional number. (The first numeric character class does not have a cyclic operator, so it only appears once. The second numeric character class is "? "Operator .) All in all, this will be translated into one or two numbers ". The following are some strings that match the rule expression:
String Matching the sample expression

ABC1234567890XYZABCd12e1f37g3XYZABC1XYZ

There are also some expressions that do not match rule expressions (think about why they do not match ):
String that does not match the sample expression

ABC123456789dXYZABCdefghijklmnopqrstuvwXYZABcd12e1f37g3XYZABC12345%67890XYZABCD12E1F37G3XYZ

Some exercises are required to create and understand rule expressions. However, once you have mastered the rule expression, you have powerful expression capabilities. That is to say, it is usually easy to use rule expressions to solve the problem, and such problems can actually be solved using simpler (and faster) tools, such as string.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More