Detailed text processing in Python _python

Source: Internet
Author: User
Tags character classes numeric readline python script

Strings--Immutable sequences

As with most advanced programming languages, variable-length strings are the basic type in Python. Python allocates memory in the background to hold strings (or other values) that programmers don't have to worry about. Python also has some string processing capabilities not in other high-level languages.

In Python, strings are "immutable sequences." Although you cannot modify a string, such as a byte group, by location, the program can refer to the element or subsequence of the string as if you were using any sequence. Python uses a flexible "fragment" operation to refer to a subsequence, which is similar in format to a range of rows or columns in a spreadsheet. The following interactive sessions illustrate the use of strings and character fragments:
Strings and fragments

>>> s = 
    "Mary Had a Little lamb"
>>> s[0] 
    # index is zero-based

    ' m '
>>> s[ 3] = 
    ' x ' 
    # changing element in-place fails
traceback (innermost last):
 File 
    "<stdin>", line 1, 
    in
     ?
Typeerror:object doesn ' t support item assignment
>>> s[11:18] 
    # ' slice ' a subsequence

    ' little '
>>> S[:4] 
    # empty slice-begin assumes zero

    ' Mary '
>>> s[4] 
    # Index 4 is not I Ncluded in Slice [: 4]

    '
>>> s[5:-5] 
    # can use ' from end ' to index with negatives

    ' had a little '
>>> s[:5]+s[5:] 
    # Slice-begin & slice-end are complimentary

    ' Mary had a Little lamb '

Another powerful string operation is the simple in keyword. It provides two intuitive and efficient constructs:
in keyword

>>> s = 
    "Mary Had a Little lamb"
>>>
     for C
     in s[11:18]: 
    print
     C, 
    # Print each char in slice ...
L i t l e
>>> 
    if
    ' x ' in
     s: 
    print
    ' Got X ' 
    # test for Char occurrence<
c20/> ... >>> 
    if
    ' y ' 
    in
     s: 
    print
    ' Got y ' 
    # test for char occurrence
...
got Y

In Python, there are several ways to form string literals. You can use either single or double quotes, as long as the left and right quotes match, and there are other quotes that are commonly used in the form of variations. If the string contains line breaks or embedded quotes, Sanchong quotes can easily define such strings, as in the following example:
use of Sanchong quotes

>>> s2 = "" " 
    Mary had a little lamb ... its fleece is white as
snow ... and
everywhere that Mary W Ent. The lamb is sure to go ""
>>> 
    print
     s2
Mary had a little lamb its
fleece W As white as snow and everywhere of that
     Mary went the
lamb is sure to go

A string that uses single or triple quotes can be preceded by a letter "R" to indicate that Python should not interpret regular expression special characters. For example:
Use "R-strings"

>>> s3 = 
    "this \ n"
>>> 
    print S3 This

    and that
>> > S4 = R
    "this \ n"
>>> 
    print S4 this
\
     \ n that

In "R-strings", a backslash may be used as a regular backslash, possibly with a different code character. This topic will be further explained in future regular expression discussions.

File and string variables

When we talk about "text processing", we usually refer to what we are dealing with. Python is easy to read the contents of a text file into a string variable that can be manipulated. The file object provides three read methods:. Read (),. ReadLine (), and. ReadLines (). Each method can accept a variable to limit the amount of data read at a time, but they usually do not use a variable. read () reads the entire file at a time, and is typically used to place the contents of the file in a string variable. However, the. Read () generates the most direct string representation of a file's content, but it is unnecessary for sequential row-oriented processing, and is not possible if the file is larger than available memory.

. ReadLine () and. ReadLines () are very similar. They are all used in structures similar to the following:
Python. ReadLines () example

    FH = open (
    ' c:\\autoexec.bat ') for line in
     fh.readlines ():
 
    print
     Line

The difference between ReadLine () and. ReadLines () is that the latter reads the entire file at once, like. Read (). ReadLines () automatically analyzes the contents of the file into a list of rows that can be used by Python for ... Structure for processing. On the other hand,. ReadLine () reads only one row at a time, usually much slower than. ReadLines (). You should use the. ReadLine () only if there is not enough memory to read the entire file at once.

If you are using a standard module that processes files, you can use the Cstringio module to convert the string to a "virtual file" (if you need to generate a subclass of the module, you can use the Stringio module, which is not necessary for beginners). For example:
Cstringio Module

>>> 
    Import
     cstringio
>>> fh = Cstringio.stringio ()
>>> fh.write (
    "Mary Had a Little Lamb")
>>> fh.getvalue ()
    ' Mary had a Little Lamb '
>>> fh.seek (5)
>>> Fh.write (
    ' ATE ')
>>> fh.getvalue ()
    ' Mary ATE a Little Lamb '

However, keep in mind that Cstringio "virtual files" are not permanent, unlike real files. If you do not save it (such as writing it to a real file or using a shelve module or database), it will disappear when the program ends.

Standard module: string

The string module is perhaps the most commonly used module in the Python 1.5.* standard release. In fact, in Python 1.6 or later, the functionality in the string module will be used as a built-in string method (details are not published at the time of this writing). Of course, any program that performs a text processing task should probably start with the following line:
Ways to start using string

Import string

The general rule of thumb tells us that if you can use the string module to complete a task, then that's the right approach. String functions are usually faster than re (regular expressions), and in most cases they are easier to understand and maintain. Third-party Python modules, including some fast modules written in C, apply to specialized tasks, but portability and familiarity suggest using string whenever possible. If you are accustomed to using other languages, there are exceptions, but not as much as you think.

The string module contains several types of things, such as functions, methods, and classes, and it also contains a string of common constants. For example:
String by law 1

>>> 
    Import
     string
>>> string.whitespace
    ' \011\012\013\014\015 '
>> > String.uppercase
    ' abcdefghijklmnopqrstuvwxyz '

Although you can write these constants by hand, the string version is more or less sure that the constants will be correct for the national language and platform running the Python script.

String also includes functions that convert strings in a common way (which can be combined to form several rare transformations). For example:
String by law 2

>>> 
    Import
     string
>>> s = 
    "Mary Had a Little lamb"
>>> String.capwords (s)
    ' Mary Had A Little Lamb '
>>> string.replace (S, 
    ' Little ', 
    ' ferocious ')
    ' Mary had a ferocious lamb '

There are many other transformations that are not specifically described here; you can find details in the Python manual.

You can also use the String function to report string properties, such as the length or position of a substring, for example:
String Usage Example 3

>>> 
    Import
     string
>>> s = 
    "Mary Had a Little lamb"
>>> String.find (S, 
    ' had ') 5>>> String.count (S, ' 
    a ') 4

Finally, string provides the strangest things that are very Python. The split () and. Join () pairs provide a quick way to convert between strings and byte groups, and you find them useful. The usage is simple:
String by Law 4

>>> 
    Import
     string>>> s = 
    "Mary Had a Little lamb"
>>> L = String.Split (s)
>>> L
[
    ' Mary ', 
    ' had ', 
    ' a ', 
    ' little ', 
    ' lamb ']
>>> string.join (L, 
    "-")
    ' Mary-had-a-little-lamb '

Of course, in addition to. Join (), you might use a list to do something else, such as some of the familiar for ... in ... Structure of things).

Standard module: RE

The RE module discards the regex and regsub modules used in old Python code. While there are still a few limited advantages to regex, these advantages are trivial and are not worth using in new code. Obsolete modules may be removed from a future Python release, and version 1.6 may have an improved interface-compatible RE module. Therefore, regular expressions will still use the RE module.

Regular expressions are complex. Maybe someone will write a book on the subject, but in fact, a lot of people have done it! This article attempts to capture the "complete form" of the rule expression, allowing the reader to master it.

A rule expression is a concise way to describe a pattern that might appear in text. Will some characters appear? Does it appear in a specific order? Does child mode repeat a certain number of times? Are other child modes excluded from the match? Conceptually, it seems impossible to use natural language to visually describe patterns. The trick is to encode this description using the concise syntax of the regular expression.

When a rule expression is processed, it is treated as its own programming problem, even if it involves only one or two lines of code, and these lines effectively form a small program.

Start from the very small. At its most basic, any rule expression involves matching a particular "character class". The simplest character class is a single character, which is just a word in the pattern. Typically, you want to match a class of characters. You can indicate that this is a class by enclosing the class in square brackets, and in parentheses you can have a set of characters or a range of characters specified with dashes. You can also use many named character classes to determine your platform and the national language. Here are some examples:
Character class

>>> 
    Import
     re
>>> s = 
    "Mary Had a Little lamb"
>>> 
    if
     Re.search (
    "M", s): 
    print
    "match!" 
    # char literal
match!
>>> 
    if
     re.search (
    "[@a-z]", s): 
    print
    "match!" 
    # char class
... 
    # match either at-sign or capital letter
...
>>> 
    if
     re.search (
    "\d", s): 
    print
    "match!" 
    # digits Class
...

You can consider a character class as the "atom" of a regular expression, and usually combine those atoms into "molecules." You can use grouping and looping to do this together. Grouping is represented by parentheses: any subexpression contained in parentheses is considered to be an atom for later grouping or looping. Loops are represented by one of the following operators: "*" means "0 or more"; "+" means "one or more"; "?" means "0 or one". For example, consider the following example:
Sample Rule expressions

ABC ([d-w]*\d\d?) +xyz

For a string to match this expression, it must begin with "ABC" and End With "XYZ"-but what must it have in the middle? The Middle subexpression is ([d-w]*\d\d?) and follows the "one or more" operator. Therefore, the middle of the string must include one (or two, or 1000) characters or strings that match the subexpression in parentheses. The string "abcxyz" does not match because there are no necessary characters in the middle of it.

But what is this inner subexpression? It starts with 0 or more letters in the D-w range. Be sure to note that 0 letters are valid matches, although it may feel awkward to describe it using the English word "some" (some). The string must then have exactly one number, followed by 0 or an additional number. (The first numeric character class does not have a looping operator, so it appears only once.) The second numeric character class has the "?" operator. In summary, this translates into "one or two digits". Here are some strings that match the rule expression:
String that matches the sample expression

ABC1234567890XYZ
abcd12e1f37g3xyz
abc1xyz

There are also expressions that do not match regular expressions (think about why they do not match):
String that does not match the sample expression

ABC123456789DXYZ
abcdefghijklmnopqrstuvwxyz
abcd12e1f37g3xyz
abc12345%67890xyz
Abcd12e1f37g3xyz

It takes some practice to get used to creating and understanding regular expressions. However, once you have mastered the regular expression, you have a strong ability to express. That is, it is usually easy to use a rule expression instead, and such a problem can actually be solved by using simpler (and faster) tools, such as String.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.