Python for ICs ICS Chapter 4 Regular Expressions (4), pythoninformatics

Source: Internet
Author: User

Python for ICs ICS Chapter 4 Regular Expressions (4), pythoninformatics

Note: The original article is from Python for Informatics by Dr Charles Severance.

11.3 Combined Query and Extraction

If we want to find a number in the row starting with "X-", it is like the following two strings:

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

However, we do not only need any floating point number in any row, but have the number in the row in the above format.

We can create the following regular expression to select such rows:

^ X-. *: [0-9.] +

The expression starts with "X-", followed by any character ". * ", followed by a colon": "And space" ", after which is a number or a decimal point" [0-9.] + ". Note that "[.]" in square brackets does not match any character, but matches true ".", which must be distinguished from "." outside square brackets.

This is a very compact expression that will perfectly match the rows we are interested in:

import rehand = open('mobx-short.txt')for line in hand:    line = line.rstrip()    if re.search('^X-.*: [0-9.]+', line)        print(line)

When we run this program, we can see that the data we want is perfectly filtered and displayed.

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000

However, we must use split to extract numbers. However, when this problem can be easily solved using split, we can use another feature of regular expressions to achieve the search and resolution functions step by step.

Parentheses () are another special character in a regular expression. When we add parentheses to expressions, they will be ignored during string matching, but when you use findall (), parentheses indicate that you want the entire regular expression to be matched, however, you only extract the strings you are interested in parentheses.

So we modify the program as follows:

import rehand = open('mbox-short.txt')for line in hand:    line = line.rstrip()    x = re.findall('^X-.*: ([0-9.]+)', line)    if len(x) > 0 :        print(x)

In the regular expression, we add parentheses to the part matching floating-point numbers and use findall () instead of search () to return the desired floating-point number. The output of this program is as follows:

['0. 8475 ']
['0. 0000 ']
['0. 6178 ']
['0. 0000 ']
['0. 6961 ']
['0. 0000 ']
..

Although the numbers in the list still need to be converted from a string to a floating point number, we use the regular expression to search for and extract information we are interested in at the same time.

The following is another example of using this technique. If you view the file, you will find that there are many rows in this format:

Details: http://source.sakaiproject.org/viewsvn? View = rev & rev = 39772

If we want to use the same technique to extract all revision numbers (integers at the end of the line), we can write code like this:

import rehand = open('mbox-short.txt')for line in hand:    line = line.rstrip()    x = re.findall('^Details:.*rev=([0-9]+)', line)    if len(x) > 0 :        print(x)

Our regular expression starts with "Details:" and can be any character ". *", then "rev =", and finally a number above. We want the row to match the entire regular expression, but we only need the number "[0-9] +" in the parentheses. When we run the program, we will get the following output:

['123']
['123']
['123']
['123']
...

Remember, "[0-9] +" is greedy, and it will try to extract any possible number, so every string we get has five numbers. The Regular Expression Library is expanded at the beginning and end of a row to count only to a non-numeric character.

We can use regular expressions to redo a previous exercise in this book. In this exercise, we are interested in the time of each email. The format of the row we are looking for is as follows:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

In addition, we want to extract the hour information from each row. Previously, we called the split implementation twice. For the first time, we split the row into words, and then we split the fifth word again based on the colon to pull out the two characters we are interested in.

If the row to be searched is well formatted, you only need to think of less code. However, when you add the necessary error check (or a try/try t block) to ensure that the program fails to meet such a format, the code will expand to 10-15 lines, and it is hard to understand.

We can use the following regular expression to make work easier:

^ From. * [0-9] [0-9]:

This expression starts with "From" (note the space), followed by any character ". * ", followed by a space, followed by two numbers" [0-9] [0-9] ", followed by a colon. We are looking for rows in this format.

In order to extract only two digits representing the hour in findall, we modify the expression as follows:

^ From. * ([0-9] [0-9]):

Finally, this program is like this:

import rehand = open('mbox-short.txt')for line in hand:    line = line.rstrip()    x = re.findall('ˆFrom .* ([0-9][0-9]):', line)    if len(x) > 0 :         print(x)    

The program running result is as follows:

['09']
['18']
['16']
['15']
...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.