Python Regular Expressions Four cases

Source: Internet
Author: User

This article turns from http://www.cnblogs.com/kaituorensheng/p/3489492.html

Read Catalogue

    • I. To determine if a string is all lowercase
    • Two. Acronym extensions
    • Three. Remove the comma from the number
    • Four. Chinese processing year conversion (ex: 1949--->1949 years)
the grammar that will be used

Regular characters

Interpretation

Example

+

The preceding element appears at least once

Ab+:ab, ABBBB, etc.

*

The preceding element appears 0 or more times

AB*:A, AB, ABB, etc.

?

Match the previous or 0 times

Ab?: A, AB, etc.

^

As the start tag

^A:ABC, AAAAAA, etc.

$

As an end tag

C$:ABC, CCCC, etc.

\d

Digital

3, 4, 9, etc.

\d

Non-digital

A, a,-etc.

[A-z]

Any letter between A and Z

A, p, m, etc.

[0-9]

Any number between 0 and 9

0, 2, 9, etc.

Attention:

1. Escape character

>>> s ' (ABC) def ' >>> m = Re.search ("(\ (. *\)). *", s) >>> print M.group (1) (ABC)

Group () Usage Reference

2. Repeat the preceding string multiple times

12345678910 >>> a = "kdlal123dk345">>> b = "kdlal123345">>> m = re.search("([0-9]+(dk){0,1})[0-9]+", a)>>> m.group(1), m.group(2)(‘123dk‘‘dk‘)>>> m = re.search("([0-9]+(dk){0,1})[0-9]+", b)>>> m.group(1)‘12334‘>>> m.group(2)>>>
Example back to top one. Determines whether a string is all lowercase

Code

#-*-coding:cp936-*-import re  s1 = ' adkkdk ' s2 = ' abc123efg ' an = Re.search (' ^[a-z]+$ ', S1) if a:    print ' S1: ', an . Group (), ' all lowercase ' else:    print S1, ' Not all lowercase! "an = Re.match (' [a-z]+$ ', S2] If an:    print ' s2: ', An.group (), ' all lowercase ' else:    print s2," Not all lowercase! "

Results

The reason

1. Regular expressions are not part of Python and need to refer to the RE module when used

2. The matching form is: Re.search (regular expression, with matching string) or re.match (regular expression, with matching string). The difference is that the latter begins with the start character (^) by default. So

Re.search (' ^[a-z]+$ ', S1) is equivalent to Re.match (' [a-z]+$ ', S2)

3. If the match fails, an = Re.search (' ^[a-z]+$ ', S1) returns none

Group is used to group matching results

For example

Import rea = "123abc456" Print Re.search ("([0-9]*) ([a-z]*] ([0-9]*)", a). Group (0)   #123abc456, return to overall print re.search (" ([0-9]*) ([a-z]*) ([0-9]*) ", a). Group (1)   #123print re.search (" ([0-9]*) ([a-z]*] ([0-9]*) ", a). Group (2)   #abcprint re.search (" ([ 0-9]*) ([a-z]*) ([0-9]*) ", a). Group (3)   #456

1) Three sets of parentheses in regular expressions divide matching results into three groups

Group () with group (0) is the overall result of matching regular expressions

Group (1) lists the first bracket matching section, Group (2) lists the second bracket matching part, and group (3) lists the third bracket matching part.

2) No match succeeded, Re.search () return None

3) Of course there are no brackets in the Zheng expression, and group (1) must be wrong.

Back to top of two. Acronym extensions

Specific examples

FEMA   Federal Emergency Management Agencyira    Irish Republican armydup    Democratic Unionist Party
FDA Food and Drug administrationolc Office of Legal counsel

Analysis

Abbreviation FEMA decomposition to f*** e*** m*** a*** regular  capital letter + lowercase (greater than or equal to 1) + spaces

Reference Code

Import Redef expand_abbr (Sen, abbr):    lenabbr = Len (abbr)    ma = '     for I in range (0, lenabbr):        ma + = abbr [i] + "[a-z]+" + '    print ' ma: ', ma    ma = Ma.strip (')    p = Re.search (MA, sen)    if P:        return P.gro Up ()    else:        return ' Print expand_abbr ("Welcome to Algriculture Bank China", ' ABC ')

Results

Problem

The above code is correct for the first 3 of the examples, but the latter two are wrong, because the words that begin with capital letters are mixed with lowercase words.

Law

Uppercase + lowercase (greater than or equal to 1) + spaces + [lowercase + spaces] (0 or 1 times)

Reference Code

Import Redef expand_abbr (Sen, abbr):    lenabbr = Len (abbr)    ma = '     for I in range (0, lenabbr-1):        ma + = abbr[ I] + "[a-z]+" + "+" ([a-z]+)?    Ma + = abbr[lenabbr-1] + "[a-z]+"    print ' ma: ', ma    ma = Ma.strip (")    p = Re.search (MA, sen)    if p:        re Turn p.group ()    else:        return ' Print expand_abbr ("Welcome to algriculture Bank of China", ' ABC ')

Skills

The middle of the lower case letter set + a space, as a whole, add a parenthesis. Either at the same time or not at the same time, it needs to be used? , matching the whole in front.

Back to Top three. Remove the comma from the number

Specific examples

When dealing with natural language 123,000,000 if the punctuation is divided, there will be a problem, a good number is a comma dismembered, so you can start to clean the numbers (comma removed).

Analysis

The number is often a group of 3 numbers followed by a comma, so the rule is: ***,***,***

Regular type

[A-z]+,[a-z]?

Reference Code 3-1

Import Resen = "ABC,123,456,789,MNP" p = re.compile ("\d+,\d+?") For COM in p.finditer (sen):    mm = Com.group ()    print "Hi:", mm    print "Sen_before:", sen    sen = sen.replace (M M, Mm.replace (",", ""))    print "Sen_back:", Sen, ' \ n '

Results

Skills

Using function finditer (string[, pos[, Endpos]) | Re.finditer (pattern, string[, flags]):

Searches for a string that returns an iterator that accesses each matching result (match object) sequentially.

Reference Code 3-2

Sen = "Abc,123,456,789,mnp" while 1:    mm = Re.search ("\d,\d", sen)    if mm:        mm = Mm.group ()        sen = Sen.replace (MM, Mm.replace (",", ""))        Print sen    else:        break

Results

Extended

Such a program for the specific problem, that is, the number 3 bit a group, if the numbers mixed with the letter, kill the comma between the numbers, that is, "ABC,123,4,789,MNP" into "ABC,1234789,MNP"

Ideas

More specifically, find the regular "numbers, numbers," and then remove the comma after the replacement

Reference Code 3-3

Sen = "Abc,123,4,789,mnp" while 1:    mm = Re.search ("\d,\d", sen)    if mm:        mm = Mm.group ()        sen = Sen.repla CE (mm, mm.replace (",", ""))        print sen    else:        Breakprint Sen

Results

Back to top of four. Year conversion of Chinese processing (ex: 1949--->1949 years)

Chinese processing involves coding problems. For example, the year in which the program is identified below (* * *)

#-*-coding:cp936-*-import rem0 =  "New China established in 1949" M1 =  "5.2% lower than 1990" m2 =  ' People in 1996 defeated the Russian army and achieved substantial independence ' def fuc (m): C4/>a = Re.findall ("[0 | one | two | three | four | five | six | seven | eight | Nine]+ year", m)    if a: for        key in a:            print key    else:        print "NULL" FUC (M0) FUC (M1) fuc (m2)

Run results

It can be seen that the second and third errors have occurred.

Improved--quasi-Unicode recognition

#-*-coding:cp936-*-import rem0 =  "New China established in 1949" M1 =  "5.2% lower than 1990" M2 = ' man 1996 defeated Russian Army, achieved substantial independence ' def fuc (m):    m = M.decode (' cp936 ')    a = Re.findall (u "[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\ U5e74 ", m)    if a: for        key in a:            print key    else:        print" NULL "fuc (M0) FUC (M1) fuc (m2)

Results

Identification can be replaced by the way the Chinese characters are replaced by numbers.

Reference

Numhash = {}numhash[' 0 '. Decode (' utf-8 ')] = ' 0 ' numhash[' one '. Decode (' utf-8 ')] = ' 1 ' numhash[' two '. Decode (' utf-8 ')] = ' 2 ' numhash[' three '. Decode (' utf-8 ')] = ' 3 ' numhash[' IV '. Decode (' utf-8 ')] = ' 4 ' numhash[' five '. Decode (' utf-8 ')] = ' 5 ' numhash[' six '. Decode (' utf-8 ')] = ' 6 ' numhash[' seven '. Decode (' utf-8 ')] = ' 7 ' numhash[' VIII '. Decode (' utf-8 ')] = ' 8 ' numhash[' IX '. Decode (' Utf-8 ')  ] = ' 9 ' def change2num (words):    print "Words:", words    Newword = ' for    key in words:        print key        if key in Numhash:            Newword + = Numhash[key]        else:            Newword + = key    return newworddef Chi2num (line):    a = Re.findall (U "[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\u5e74", line)    if a:        print "------"        print line for        words in a:            newwords = change2num (words)            print words            Print Newwords Line            = Line.replace (words, newwords)    return line

Python Regular Expressions Four cases

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.