A detailed explanation of the use of Python's regular expressions _python

Source: Internet
Author: User
Tags lowercase

Since learning Python, it has often been found that Python is a tool. Especially in the text processing, it is more comfortable to use.

When it comes to text processing, regular expressions are a great tool, and they can be searched or replaced in a very concise way by a number of complex characters.

We are dealing with text, or query crawl, or replace.

One. Find
If you want to implement such a function module, enter an IP address to get the details of the area where the IP address is located.

And then you find out that http://ip138.com can find very detailed data.

However, the API is not provided for external calls, but we can simulate the query by code and then crawl the results.

By looking at the source code of the corresponding page, we can find that the result is placed in the three <li></li>

Copy Code code as follows:

<table width= "80%" border= "0" align= "center" cellpadding= "0" cellspacing= "0" >
<tr>
&LT;TD align= "center" ></tr>
<tr>
&LT;TD align= "center" ></tr>
<tr>

&LT;TD align= "center" ><ul class= "UL1" ><li> main data: Alibaba, Hangzhou, Zhejiang province </li><li> reference data one: Zhejiang Province, Hangzhou, Alibaba </li><li> reference data two: Zhejiang Province, Hangzhou, Alibaba </li></ul></td>
</tr>
<tr>
&LT;TD align= "center" > If you find that the query results are not detailed or incorrect, use <a href= "ip_add.asp?ip=121.0.29.231" ><font color= "#006600" ><b>ip Database Self-Service add </b></font></a> function modification <br/><br/>
<iframe src= "/jss/bd_460x60.htm" frameborder= "no" width= "460" height= "" border= "0" marginwidth= "0" marginheight= "0" scrolling= "no" ></iframe><br/><br/></td>

</tr>
<form method= "Get" action= "ips8.asp" name= "Ipform" onsubmit= "return Checkip ();" >
<tr>
&LT;TD align= "center" &GT;IP address or domain name: <input type= "text" name= "IP" size= "" > <input type= "Submit" value= "Query" ><input type= "hidden" name= "Action" value= "2" ></td>
</tr><br>
<br>
</form>
</table>

If you know the regular expression, you might write

Regular expressions

Copy Code code as follows:

(?<=<li>). *? (?=</li>)

Here's a preview: Lookahead Speaks: Lookbehind, the advantage is that the matching results will not include the HTML Li tag.

If you're not confident about your regular expression, you can do some testing on the online or local regular test tool to make sure it's correct.

The next thing to do is to use Python to implement such a function, first we have to express the regular expression:

Copy Code code as follows:

R "(?<=<li>). *? (?=</li>) "

The character in Python preceded by a leading R, which means that the string is an R aw string (the original string), that is, the Python string itself does not escape the characters in the string. This is because the regular expression also has an escape character Fu Zhi said, if double escapes, the readability is very poor.

This string in Python, we call it "regular expression pattern."

If we were to compile the pattern,

Copy Code code as follows:

Prog = Re.compile (r) (?<=<li>). *? =</li>) ")

We can then get a regular Expression object regular expression object, through which we can do related operations.

Like what

Copy Code code as follows:

Result=prog.match (String)
# #这个等同于
Result=re.match (R "(?<=<li>). *? =</li>) ", String)
# #但是如果这个正则需要在程序匹配多次, the way through regular expression objects is more efficient

The next step is to look up, assuming that our HTML results have been stored in text in HTML format, then

Copy Code code as follows:

Result_list = Re.findall (r) (?<=<li>). *? =</li>) ", text)

You can get a list of the results you want.

Two. Replace
It's very flexible to replace with regular expressions.

For example, when I was reading the source code of the Wiki module in the TRAC system, I found that the implementation of the wiki syntax was done through a regular substitution.

The concept of group grouping in regular expressions is involved when replacing is used.

Suppose it is used in wiki syntax! The functional characters that represent the escape character, the exclamation point, are output as is, and the bold syntax is

Wrote
"" is shown here as bold ""
Then there are regular expressions for

Copy Code code as follows:

R "(? p<bold>! "") "

In here? P<bold> is part of the Python regular syntax, which indicates that the group is named "Bold"

Here is the scene when the substitution occurs, where the first argument of the sub function is pattern, the second argument can be a string or a function, and if it is a string, the result of the target match is replaced with the specified result, and if it is a function, then the function accepts a match object argument and returns the replacement string, and the third argument is the source string.

Copy Code code as follows:

result = Re.sub (?) p<bold>!? ') ', replace, line

Whenever a three single quote is matched, the Replace function runs once, possibly requiring a global variable to record whether the current three single quotes are open or closed in order to add the appropriate markup.

When the actual TRAC wiki is implemented, it is this way that some tag variables are used to record the opening and closing of some grammatical tags to determine the results of the Replace function.

--------------------

Example

I. Determine if the string is all lowercase

Code

Copy Code code as follows:

#-*-coding:cp936-*-
Import re
S1 = ' ADKKDK '
S2 = ' ABC123EFG '

an = Re.search (' ^[a-z]+$ ', S1)
If an:
print ' S1: ', An.group (), ' all lowercase '
Else
Print S1, "Not all lowercase!" "

an = Re.match (' [a-z]+$ ', S2)
If an:
print ' s2: ', An.group (), ' all lowercase '
Else
Print S2, "Not all lowercase!" "

Results

Investigate the reason

1. Regular expressions are not part of Python and need to be referenced by the RE module

2. The matching form is: Re.search (regular expression, with matching string) or re.match (regular expression, with matching string). The difference is that the latter default begins with the start character (^). So

Re.search (' ^[a-z]+$ ', S1) is equivalent to Re.match (' [a-z]+$ ', S2 ')
3. If the match fails, an = Re.search (' ^[a-z]+$ ', S1) returns none

Group is used to group matching results

For example

Copy Code code as follows:

Import re
A = "123abc456"
Print Re.search ([0-9]*) ([a-z]*) ([0-9]*), a). Group (0) #123abc456, return to the whole
Print Re.search ([0-9]*) ([a-z]*) ([0-9]*), a). Group (1) #123
Print Re.search ([0-9]*) ([a-z]*) ([0-9]*), a). Group (2) #abc
Print Re.search ([0-9]*) ([a-z]*) ([0-9]*), a). Group (3) #456

1 The three sets of parentheses in the regular expression divide the result into three groups

Group () is the same group (0) that matches the overall result of the regular expression

Group (1) lists the first bracket matching part, Group (2) lists the second bracket matching part, and group (3) lists the third bracket matching part.

2) No match succeeded, Re.search () returns none

3 of course Zheng is the expression of no parentheses, group (1) certainly wrong.

Two. Acronym expansion

Specific examples

FEMA Emergency Management Agency
IRA Irish Republican Army
DUP Democratic Unionist Party

FDA Food and Drug Administration
OLC Office of Legal counsel
Analysis

Abbreviated Words FEMA
Decomposed into f*** e*** m*** a***
Regular uppercase + lowercase (greater than or equal to 1) + Space
Reference Code

Copy Code code as follows:

Import re
Def expand_abbr (Sen, abbr):
LENABBR = Len (abbr)
Ma = '
For I in range (0, LENABBR):
Ma + + abbr[i] + "[a-z]+" + "
print ' ma: ', MA
Ma = Ma.strip (')
p = Re.search (MA, sen)
If P:
Return P.group ()
Else
Return ""

Print Expand_abbr ("Welcome to Algriculture Bank", ' ABC ')

Results

Problem

The code above is correct for the first 3 in the example, but the next two are wrong, because the words at the beginning of the capital letter are mixed with lowercase letters.

Law

Uppercase + lowercase (greater than or equal to 1) + space + [lowercase + space] (0 or 1 times)

Reference Code

Copy Code code as follows:

Import re
Def expand_abbr (Sen, abbr):
LENABBR = Len (abbr)
Ma = '
For I in range (0, lenabbr-1):
Ma + + abbr[i] + [a-z]+] + ' + ' ([a-z]+)?
Ma + + abbr[lenabbr-1] + "[a-z]+"
print ' ma: ', MA
Ma = Ma.strip (')
p = Re.search (MA, sen)
If P:
Return P.group ()
Else
Return ""

Print Expand_abbr ("Welcome to Algriculture Bank of", ' ABC ')

Skills

The middle of the lowercase letter set + a space, as a whole, put parentheses. Either at the same time, or not at the same time, so it needs to be used? To match the whole of the front.

Three. Remove the comma from the number

Specific examples

When dealing with natural language 123,000,000 if the punctuation is divided, there will be problems, a good number of a comma dismembered, so you can first hand the number processing clean (comma removed).

Analysis

A number is often a group of 3 digits followed by a comma, so the rule is: ***,***,***

Regular type

[A-z]+,[a-z]?

Reference Code 3-1

Copy Code code as follows:

Import re

Sen = "ABC,123,456,789,MNP"
p = re.compile ("\d+,\d+?")

For COM in p.finditer (SEN):
mm = Com.group ()
Print "Hi:", mm
Print "Sen_before:", Sen
Sen = sen.replace (mm, Mm.replace (",", ""))
Print "Sen_back:", Sen, ' \ n '

Results

Skills

Using Functions Finditer (string[, pos[, Endpos]) | Re.finditer (pattern, string[, flags]):

Searches for a string that returns an iterator that accesses each matching result (match object) sequentially.

Reference Code 3-2

Copy Code code as follows:

Sen = "ABC,123,456,789,MNP"
While 1:
MM = Re.search ("\d,\d", Sen)
if mm:
mm = Mm.group ()
Sen = sen.replace (mm, Mm.replace (",", ""))
Print Sen
Else
Break

Results

Extended

Such a program for the specific problem, that is, the number 3-digit group, if the number mixed with the letter, kill the comma between the numbers, that is, "ABC,123,4,789,MNP" into "ABC,1234789,MNP"

Ideas

The more specific is to find the regular type "number, number" to find a replacement with the comma removed

Reference Code 3-3

Copy Code code as follows:

Sen = "ABC,123,4,789,MNP"
While 1:
MM = Re.search ("\d,\d", Sen)
if mm:
mm = Mm.group ()
Sen = sen.replace (mm, Mm.replace (",", ""))
Print Sen
Else
Break
Print Sen

Results

Four. Year conversion of Chinese processing (e.g.: 1949--->1949 years)

Chinese processing involves coding problems. For example, when the lower program identifies the year (* *)

Copy Code code as follows:

#-*-coding:cp936-*-
Import re
M0 = "Founded in 1949 New China"
M1 = "5.2% lower than 1990"
M2 = ' Man defeated the Russian army in 1996 and achieved substantial independence '

def fuc (m):
A = Re.findall ("[0 | one | two | |-three | |-Five | six | seven | | eight | Nine]+ years", m)
If a:
For key in a:
Print key
Else
Print "NULL"

FUC (M0)
FUC (M1)
Fuc (m2)

Run results

You can see that the second and third are all errors.

Improvement--quasi-Unicode recognition

Copy Code code as follows:

#-*-coding:cp936-*-
Import re
M0 = "Founded in 1949 New China"
M1 = "5.2% lower than 1990"
M2 = ' Man defeated the Russian army in 1996 and achieved substantial independence '

def fuc (m):
m = M.decode (' cp936 ')
A = Re.findall (u "[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\u5e74", M)

If a:
For key in a:
Print key
Else
Print "NULL"

FUC (M0)
FUC (M1)
Fuc (m2)

Results

Recognize that you can replace the Chinese characters with numbers by replacing them.

Reference

Copy Code code as follows:

Numhash = {}
numhash[' 0 '. Decode (' utf-8 ')] = ' 0 '
numhash[' one '. Decode (' utf-8 ')] = ' 1 '
numhash[' II ' Decode (' utf-8 ')] = ' 2 '
numhash[' three '. Decode (' utf-8 ')] = ' 3 '
numhash[' four ' decode (' utf-8 ')] = ' 4 '
numhash[' five ' decode (' utf-8 ')] = ' 5 '
Numhash[' VI ' decode (' utf-8 ')] = ' 6 '
Numhash[' seven ' decode (' utf-8 ')] = ' 7 '
Numhash[' eight ' decode (' utf-8 ')] = ' 8 '
numhash[' Nine ' decode (' utf-8 ')] = ' 9 '

Def change2num (words):
    print "words:", words
    newword = '
 & nbsp;  for key in words:
        print key
         if key in Numhash:
            Newword + = Numhash[key]
        else:
             Newword + = key
    return Newword

def chi2num (line):
A = Re.findall (u "[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\u5e74", line)
If a:
Print "------"
Print Line
For words in a:
Newwords = Change2num (words)
Print words
Print Newwords
line = Line.replace (words, newwords)
Return line

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.