Detailed text processing in Python

Last Update:2016-06-06 Source: Internet

Author: User

Tags character classes string methods

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

string--immutable sequence

Like most high-level programming languages, variable-length strings are basic types in Python. Python allocates memory in the background to hold strings (or other values), and programmers don't have to worry about them. Python also has some other high-level languages that do not have string handling capabilities.

In Python, strings are "immutable sequences". Although you cannot modify a string by location, such as a byte group, a program can refer to the elements or subsequence of a string as if it were any sequence. Python uses flexible "Shard" operations to refer to a subsequence, where the format of a character fragment is similar to a range of rows or columns in a spreadsheet. The following interactive session illustrates the use of string and character fragments:
Strings and Shards

>>> s =     "Mary Had a Little Lamb" >>> s[0]     # index is zero-based    ' m ' >>> s[3] =     ' x '     # changing element in-place Failstraceback (innermost last): File     "
 
  
   
  ", line 1,     in     ? Typeerror:object doesn ' t support item assignment>>> S[11:18]     # ' slice ' a subsequence    ' little ' >& Gt;> S[:4]     # empty slice-begin assumes zero    ' Mary ' >>> s[4]     # Index 4 is not included in slice [ : 4]    ' >>> s[5:-5]     # can use "from end" index with negatives    ' had a little ' >>> s[:5]+s[5: ]     # Slice-begin & Slice-end is complimentary    ' Mary had a Little Lamb '

Another powerful string operation is the simple in keyword. It provides two intuitive and efficient constructs:
in keyword

>>> s =     "Mary Had a Little lamb" >>>     for C     in s[11:18]:     print     C,     # Print each char in SLICE...L i T t l e>>>     if    ' x '     in     s:     print    ' Got X '     # test for Char occurrence...>>>     if    ' y '     in     s:     print    ' Got y '     # test for Char Occurrence...got y

In Python, there are several ways to compose string literals. Single or double quotation marks can be used, as long as the left and right quotes match, and other quotes are commonly used. If the string contains a newline character or embedded quotation mark, Sanchong quotation marks can easily define such a string, as in the following example:
Use of Sanchong Quotes

>>> s2 = ""     Mary had a little lamb ... its fleece is white as snow ... and everywhere that Mary went ... the L Amb is sure to go "" ">>>     print     s2mary had a little lambits fleece is white as snow    and     every Where that Mary wentthe lamb is sure to go

strings that use single or triple quotes can be preceded by a letter "R" to indicate that Python should not interpret the regular expression special characters. For example:
Use "R-strings"

>>> s3 =     "this \ n and \ n" >>>     print     s3this and    that>>> s4 = R    " This \ n and \ That ">>>     print     s4this     \     n \ That

In "R-strings", a backslash that may form a new code break is treated as a normal backslash. This topic will be further explained in a later rule expression discussion.

File and string variables

When we talk about "text processing," we usually refer to what we are dealing with. Python reads the contents of a text file into a string variable that can be manipulated very easily. The file object provides three "read" methods:. Read (),. ReadLine (), and. ReadLines (). Each method can accept a variable to limit the amount of data that is read each time, but they typically do not use variables. Read () reads the entire file each time, and it is typically used to place the contents of the file into a string variable. however. Read () produces the most direct string representation of the file content, but it is not necessary for continuous row-oriented processing and is not possible if the file is larger than the available memory.

. ReadLine () and. ReadLines () are very similar. They are used in structures similar to the following:
Python. ReadLines () example

    FH = open (    ' c:\\autoexec.bat ') for line in     fh.readlines ():     print     Line

The difference between. ReadLine () and. ReadLines () is that the latter reads the entire file one at a time, like. Read (): ReadLines () automatically parses the contents of the file into a list of rows that can be used by Python for ... in ... Structure for processing. On the other hand,. ReadLine () reads only one line at a time, usually much slower than. ReadLines (). You should use. ReadLine () only if there is not enough memory to read the entire file at once.

If you are using a standard module for working with files, you can use the Cstringio module to convert strings to "virtual files" (if you need to generate a subclass of the module, you can use the Stringio module, which is not necessary for beginners). For example:
Cstringio Module

>>>     Import     cstringio>>> fh = Cstringio.stringio () >>> fh.write (    "Mary had a Little Lamb ") >>> fh.getvalue ()    ' Mary had a Little Lamb ' >>> fh.seek (5) >>> Fh.write (    ' ate ') >>> fh.getvalue ()    ' Mary ate a Little lamb '

However, keep in mind that Cstringio "virtual files" are not permanent, unlike real files. If you do not save it (such as writing it to a real file, or using a shelve module or database), it will disappear when the program ends.

Standard module: string

The string module may be the most commonly used module in the Python 1.5.* standard release. In fact, in Python version 1.6 or later, the functionality in the string module will be used as a built-in string method (details are not yet published at the time of writing this article). Of course, any program that performs a text processing task might start with the following line:
To start using string methods

Import string

The general rule of thumb tells us that if you can use the string module to accomplish a task, then that is the correct approach. String functions are usually faster than re (regular expressions), and in most cases they are easier to understand and maintain. Third-party Python modules, including some quick modules written in C, are suitable for specialized tasks, but portability and familiarity suggest using string whenever possible. If you're used to other languages, there are exceptions, but not as much as you might think.

The string module contains several types of things, such as functions, methods, and classes, and it also contains a string of common constants. For example:
String in Law 1

>>>     Import     string>>> string.whitespace    ' \011\012\013\014\015 ' >>> String.uppercase    ' abcdefghijklmnopqrstuvwxyz '

Although these constants can be written by hand, the string version more or less ensures that the constants are correct for the national language and platform that runs the Python script.

The string also includes functions that convert strings in a common way, which can be combined in such a way as to make several rare conversions. For example:
String in Law 2

>>>     Import     string>>> s =     "Mary Had a Little Lamb" >>> string.capwords (s)    ' Mary had a Little Lamb ' >>> string.replace (S,     ' Little ',     ' ferocious ')    ' Mary had a ferocious Lamb '

There are many other transformations that are not specifically described here; you can find more information in the Python manual.

You can also use the String function to report string properties, such as the length or position of a substring, for example:
String Usage Example 3

>>>     Import     string>>> s =     "Mary Had a Little Lamb" >>> string.find (S,     ' had ') ) 5>>> String.count (S,     ' a ') 4

Finally, string provides a very Python-odd thing: Split () and. Join () pairs provide a quick way to convert between strings and byte groups, and you'll find them useful. The usage is simple:
String in Law 4

>>>     Import     string>>> s =     "Mary Had a Little lamb" >>> L = String.Split (s) >> > l[    ' Mary ',     ' had ',     ' a ',     ' little ',     ' lamb ']>>> string.join (L,     "-")    ' Mary-had-a-little-lamb '

Of course, in addition to. Join (), you might use the list to do other things (such as something that involves our familiarity with the for ... in ... Structure of things).

Standard module: RE

The RE module discards the regex and regsub modules used in old Python code. While there are several limited advantages to regex, these advantages are trivial and not worth using in new code. Outdated modules may be removed from future Python distributions, and version 1.6 may have an improved interface-compatible RE module. Therefore, the regular expression will still use the re-module.

Rule expressions are complex. Maybe someone will write a book on the subject, but in fact, many people have done it! This article attempts to capture the "full form" of a regular expression so that the reader can master it.

A regular expression is a concise way to describe patterns that may appear in text. Will some characters appear? Do they appear in a specific order? Does the sub-mode repeat a certain number of times? Will the other sub-modes be excluded from the match? Conceptually, it seems that it is not possible to use natural language to visually describe patterns. The trick is to use the concise syntax of a regular expression to encode this description.

When a rule expression is processed, it is handled as its own programming problem, even if only one or two lines of code are involved, and these lines effectively form a small program.

Start at the very beginning. On the most basic, any regular expression involves matching a specific "character class". The simplest character class is a single character, which is just a word in the pattern. Typically, you want to match a class of characters. You can indicate that this is a class by enclosing the class in square brackets, and in parentheses, you can have a set of characters or a range of characters specified with dashes. You can also use a number of named character classes to determine your platform and national language. Here are some examples:
Character class

>>>     Import     re>>> s =     "Mary Had a Little lamb" >>>     if     re.search (    "M ", s):     print    " match! "     # char literalmatch!>>>     if     re.search (    "[@a-z]", s):     print    "match!"     # char class ...     # match either At-sign or capital letter...>>>     if     re.search (    "\d", s):     print    "match !"     # digits class ...

Character classes can be thought of as "atoms" of regular expressions, often combining those atoms into "molecules." You can use grouping and looping together to do this. Grouping by parentheses: Any subexpression contained in parentheses is treated as an atom for later grouping or looping. A loop is represented by one of several operators: "*" means "0 or more"; "+" means "one or more"; "?" means "0 or one". For example, consider the following example:
Sample Rule expression

ABC ([d-w]*\d\d?) +xyz

For a string to match this expression, it must start with "ABC" and End With "XYZ"-but what must be in the middle of it? The intermediate subexpression is ([d-w]*\d\d?), followed by the "one or more" operator. Therefore, the middle of the string must include one (or two, or 1000) characters or a string that matches the subexpression in parentheses. The string "abcxyz" does not match because it does not have the necessary characters in the middle.

But what is this inner sub-expression? It starts with 0 or more letters in the range of d-w. It is important to note that 0 letters are valid matches, although using the English word "some" (some) to describe it may feel awkward. The string must then have exactly one number, followed by 0 or an additional number. (The first numeric character class does not have a loop operator, so it appears only once.) The second numeric character class has the '? ' operator. All in all, this translates into "one or two numbers". Here are some strings that match the regular expression:
String matching a sample expression

Abc1234567890xyzabcd12e1f37g3xyzabc1xyz

There are also expressions that do not match the regular expressions (think about why they do not match):
String that does not match the sample expression

Abc123456789dxyzabcdefghijklmnopqrstuvwxyzabcd12e1f37g3xyzabc12345%67890xyzabcd12e1f37g3xyz

It takes some practice to get used to creating and understanding rule expressions. However, once you have mastered the rule expression, you have strong expressive power. That said, it is often easy to use regular expressions instead, which can actually be solved by using simpler (and faster) tools such as String.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More