This article turns from http://www.cnblogs.com/kaituorensheng/p/3489492.html
Read Catalogue
- I. To determine if a string is all lowercase
- Two. Acronym extensions
- Three. Remove the comma from the number
- Four. Chinese processing year conversion (ex: 1949--->1949 years)
the grammar that will be used
Regular characters |
Interpretation |
Example |
+ |
The preceding element appears at least once |
Ab+:ab, ABBBB, etc. |
* |
The preceding element appears 0 or more times |
AB*:A, AB, ABB, etc. |
? |
Match the previous or 0 times |
Ab?: A, AB, etc. |
^ |
As the start tag |
^A:ABC, AAAAAA, etc. |
$ |
As an end tag |
C$:ABC, CCCC, etc. |
\d |
Digital |
3, 4, 9, etc. |
\d |
Non-digital |
A, a,-etc. |
[A-z] |
Any letter between A and Z |
A, p, m, etc. |
[0-9] |
Any number between 0 and 9 |
0, 2, 9, etc. |
Attention:
1. Escape character
>>> s ' (ABC) def ' >>> m = Re.search ("(\ (. *\)). *", s) >>> print M.group (1) (ABC)
Group () Usage Reference
2. Repeat the preceding string multiple times
12345678910 |
>>> a =
"kdlal123dk345"
>>> b =
"kdlal123345"
>>> m = re.search(
"([0-9]+(dk){0,1})[0-9]+"
, a)
>>> m.group(
1
), m.group(
2
)
(
‘123dk‘
,
‘dk‘
)
>>> m = re.search(
"([0-9]+(dk){0,1})[0-9]+"
, b)
>>> m.group(
1
)
‘12334‘
>>> m.group(
2
)
>>>
|
Example back to top one. Determines whether a string is all lowercase
Code
#-*-coding:cp936-*-import re s1 = ' adkkdk ' s2 = ' abc123efg ' an = Re.search (' ^[a-z]+$ ', S1) if a: print ' S1: ', an . Group (), ' all lowercase ' else: print S1, ' Not all lowercase! "an = Re.match (' [a-z]+$ ', S2] If an: print ' s2: ', An.group (), ' all lowercase ' else: print s2," Not all lowercase! "
Results
The reason
1. Regular expressions are not part of Python and need to refer to the RE module when used
2. The matching form is: Re.search (regular expression, with matching string) or re.match (regular expression, with matching string). The difference is that the latter begins with the start character (^) by default. So
Re.search (' ^[a-z]+$ ', S1) is equivalent to Re.match (' [a-z]+$ ', S2)
3. If the match fails, an = Re.search (' ^[a-z]+$ ', S1) returns none
Group is used to group matching results
For example
Import rea = "123abc456" Print Re.search ("([0-9]*) ([a-z]*] ([0-9]*)", a). Group (0) #123abc456, return to overall print re.search (" ([0-9]*) ([a-z]*) ([0-9]*) ", a). Group (1) #123print re.search (" ([0-9]*) ([a-z]*] ([0-9]*) ", a). Group (2) #abcprint re.search (" ([ 0-9]*) ([a-z]*) ([0-9]*) ", a). Group (3) #456
1) Three sets of parentheses in regular expressions divide matching results into three groups
Group () with group (0) is the overall result of matching regular expressions
Group (1) lists the first bracket matching section, Group (2) lists the second bracket matching part, and group (3) lists the third bracket matching part.
2) No match succeeded, Re.search () return None
3) Of course there are no brackets in the Zheng expression, and group (1) must be wrong.
Back to top of two. Acronym extensions
Specific examples
FEMA Federal Emergency Management Agencyira Irish Republican armydup Democratic Unionist Party
FDA Food and Drug administrationolc Office of Legal counsel
Analysis
Abbreviation FEMA decomposition to f*** e*** m*** a*** regular capital letter + lowercase (greater than or equal to 1) + spaces
Reference Code
Import Redef expand_abbr (Sen, abbr): lenabbr = Len (abbr) ma = ' for I in range (0, lenabbr): ma + = abbr [i] + "[a-z]+" + ' print ' ma: ', ma ma = Ma.strip (') p = Re.search (MA, sen) if P: return P.gro Up () else: return ' Print expand_abbr ("Welcome to Algriculture Bank China", ' ABC ')
Results
Problem
The above code is correct for the first 3 of the examples, but the latter two are wrong, because the words that begin with capital letters are mixed with lowercase words.
Law
Uppercase + lowercase (greater than or equal to 1) + spaces + [lowercase + spaces] (0 or 1 times)
Reference Code
Import Redef expand_abbr (Sen, abbr): lenabbr = Len (abbr) ma = ' for I in range (0, lenabbr-1): ma + = abbr[ I] + "[a-z]+" + "+" ([a-z]+)? Ma + = abbr[lenabbr-1] + "[a-z]+" print ' ma: ', ma ma = Ma.strip (") p = Re.search (MA, sen) if p: re Turn p.group () else: return ' Print expand_abbr ("Welcome to algriculture Bank of China", ' ABC ')
Skills
The middle of the lower case letter set + a space, as a whole, add a parenthesis. Either at the same time or not at the same time, it needs to be used? , matching the whole in front.
Back to Top
three. Remove the comma from the number
Specific examples
When dealing with natural language 123,000,000 if the punctuation is divided, there will be a problem, a good number is a comma dismembered, so you can start to clean the numbers (comma removed).
Analysis
The number is often a group of 3 numbers followed by a comma, so the rule is: ***,***,***
Regular type
[A-z]+,[a-z]?
Reference Code 3-1
Import Resen = "ABC,123,456,789,MNP" p = re.compile ("\d+,\d+?") For COM in p.finditer (sen): mm = Com.group () print "Hi:", mm print "Sen_before:", sen sen = sen.replace (M M, Mm.replace (",", "")) print "Sen_back:", Sen, ' \ n '
Results
Skills
Using function finditer (string[, pos[, Endpos]) | Re.finditer (pattern, string[, flags]):
Searches for a string that returns an iterator that accesses each matching result (match object) sequentially.
Reference Code 3-2
Sen = "Abc,123,456,789,mnp" while 1: mm = Re.search ("\d,\d", sen) if mm: mm = Mm.group () sen = Sen.replace (MM, Mm.replace (",", "")) Print sen else: break
Results
Extended
Such a program for the specific problem, that is, the number 3 bit a group, if the numbers mixed with the letter, kill the comma between the numbers, that is, "ABC,123,4,789,MNP" into "ABC,1234789,MNP"
Ideas
More specifically, find the regular "numbers, numbers," and then remove the comma after the replacement
Reference Code 3-3
Sen = "Abc,123,4,789,mnp" while 1: mm = Re.search ("\d,\d", sen) if mm: mm = Mm.group () sen = Sen.repla CE (mm, mm.replace (",", "")) print sen else: Breakprint Sen
Results
Back to top of four. Year conversion of Chinese processing (ex: 1949--->1949 years)
Chinese processing involves coding problems. For example, the year in which the program is identified below (* * *)
#-*-coding:cp936-*-import rem0 = "New China established in 1949" M1 = "5.2% lower than 1990" m2 = ' People in 1996 defeated the Russian army and achieved substantial independence ' def fuc (m): C4/>a = Re.findall ("[0 | one | two | three | four | five | six | seven | eight | Nine]+ year", m) if a: for key in a: print key else: print "NULL" FUC (M0) FUC (M1) fuc (m2)
Run results
It can be seen that the second and third errors have occurred.
Improved--quasi-Unicode recognition
#-*-coding:cp936-*-import rem0 = "New China established in 1949" M1 = "5.2% lower than 1990" M2 = ' man 1996 defeated Russian Army, achieved substantial independence ' def fuc (m): m = M.decode (' cp936 ') a = Re.findall (u "[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\ U5e74 ", m) if a: for key in a: print key else: print" NULL "fuc (M0) FUC (M1) fuc (m2)
Results
Identification can be replaced by the way the Chinese characters are replaced by numbers.
Reference
Numhash = {}numhash[' 0 '. Decode (' utf-8 ')] = ' 0 ' numhash[' one '. Decode (' utf-8 ')] = ' 1 ' numhash[' two '. Decode (' utf-8 ')] = ' 2 ' numhash[' three '. Decode (' utf-8 ')] = ' 3 ' numhash[' IV '. Decode (' utf-8 ')] = ' 4 ' numhash[' five '. Decode (' utf-8 ')] = ' 5 ' numhash[' six '. Decode (' utf-8 ')] = ' 6 ' numhash[' seven '. Decode (' utf-8 ')] = ' 7 ' numhash[' VIII '. Decode (' utf-8 ')] = ' 8 ' numhash[' IX '. Decode (' Utf-8 ') ] = ' 9 ' def change2num (words): print "Words:", words Newword = ' for key in words: print key if key in Numhash: Newword + = Numhash[key] else: Newword + = key return newworddef Chi2num (line): a = Re.findall (U "[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\u5e74", line) if a: print "------" print line for words in a: newwords = change2num (words) print words Print Newwords Line = Line.replace (words, newwords) return line
Python Regular Expressions Four cases