Python Data Analysis learning-re Regular expression module

Source: Internet
Author: User

Regular expressions provide the basis for advanced text pattern matching, extraction, and/or text-based search and replace functionality. In short, regular expressions (referred to as regex) are strings of characters and special symbols that describe the repetition of patterns or the expression of multiple characters, so that regular expressions can match a series of strings with similar characteristics according to a pattern. In other words, they can match multiple strings ... A regular expression pattern that matches only one string is tedious and useless, isn't it? Python supports regular expressions through the RE module in the standard library

List of special characters for regular expressions

'. ' matches all strings except \ n
'-' denotes range [0-9]
' * ' Matches the preceding subexpression 0 or more times. To match the * character, use \*.
' + ' matches the preceding subexpression one or more times. to match the + character, use the \+
' ^ ' match string to begin with
' $ ' match string end re
' escape character so that the latter character changes the original meaning if there are characters in the string * Need match, can \*
'? ' match previous string 0 or 1 times
' {m} ' match previous character m times
' {n,m} ' Matches the previous character N to M times
' \d ' matches the number, equals [0-9]
' \d ' matches non-numeric, equals [^0-9]
' \w ' match letters and numbers equal to [a-za-z0-9]
' \w ' matches non-English letters and numbers, equals [^a-za-z0-9]
' \s ' Match whitespace characters
' \s ' matches non-whitespace characters
' \a ' match string start
' \z ' match string End
' \b ' matches the word's first and last words, which are defined as an alphanumeric sequence, so that the ending is a blank or non-alphanumeric representation of the
' \b ' in contrast to \b, Match
' (?) only if the current position is not at the word boundary. P Group, except that the original number specifies an additional alias
[] is a range of characters that defines the match. For example [a-za-z0-9] means that the corresponding position of the character to match the English characters and numbers. [\s*] denotes a space or * number.

Methods provided by the RE regular expression module of Python
re.match(pattern, string, flags=0)      #从字符串的起始位置匹配,如果起始位置匹配不成功的话,match()就返回nonere.search(pattern, string, flags=0)     #扫描整个字符串并返回第一个成功的匹配re.findall(pattern, string, flags=0)    #找到RE匹配的所有字符串,并把他们作为一个列表返回re.finditer(pattern, string, flags=0)   #找到RE匹配的所有字符串,并把他们作为一个迭代器返回re.sub(pattern, repl, string, count=0, flags=0#替换匹配到的字符串
Function parameter Description:

pattern: matched Regular expression string: string to match
Flags: flags are used to control how regular expressions are matched, such as: case sensitivity, multiline matching, and so on.
repl: substituted string, also available as a function
count: Maximum number of substitutions after pattern matching, default 0 means replace all matches

import= re.match(r‘f..‘r‘begin fool hello‘# match 从字符串的开始位置进行搜索,如果需从字符串任意位置进行搜索,需使用下文中的search方法ifisnotNone:    print(‘found : ‘+ m.group())else:    print(‘not found!‘)
not found!

The search () function searches for the first occurrence of the pattern in the string, and strictly searches the string from left to right.

= re.search(r‘foo‘‘beginfool hello‘)ifisnotNone:    print(‘found : ‘+ m.group())else:    print(‘not found...‘)
found : foo

Match any single string

=‘.end‘=‘bend‘# 点号匹配‘b’ifisnotNoneprint(m.group())
bend
=‘end‘# 不匹配任何字符串ifisnotNoneprint(m.group())
=‘\nbend‘# 除了‘\n‘之外的任何字符串ifisnotNoneprint(m.group())
=‘The end.‘# 点号匹配‘ end’ifisnotNoneprint(m.group())

Use of the Gorup () and groups () methods

m =  Re.match ( r ' (\w  {3}  )-(\d  {3}  ) ' ,  ' abc-123 ' ) if  m is  not  none : print  ( ' M.group (): '  +  M.group ()) print  ( ' M.group (1): '  +  m.group (1 )) print  ( +  m.group (2 )) print  ( ' m.groups (): '  +  str  (M.groups ()))  
m.group(): abc-123m.group(1): abcm.group(2): 123m.groups(): (‘abc‘, ‘123‘)

FindAll () A non-recurring occurrence of all occurrences of a regular expression pattern in a query string. This is similar to search () when performing string searches, but differs from match () and search () in that findall () always returns a list. If FindAll () does not find a matching section, it returns an empty list, but if the match succeeds, the list will contain all successful matches (from left to right in the order in which they appear).

re.findall(‘car‘‘car‘)
[‘car‘]
re.findall(‘car‘‘scary‘)
[‘car‘]
re.findall(‘car‘‘carry the brcardi to the car‘)
[‘car‘, ‘car‘, ‘car‘]

Finditer () and FindAll () return a match string compared to the Finditer () iteration in the matching object.

=‘This and that.‘re.findall(r‘(th\w+)‘, s, re.I)
[‘This‘, ‘that‘]
iter= re.finditer(r‘(th\w+)‘, s, re.I)iter
<callable_iterator at 0x594a780>
foriniter# findall 返回一个列表,而finditer返回一个迭代器
[]

There are two functions/methods for implementing Search and replace functions: Sub () and Subn (). The two are almost the same, and all the parts of a string that match a regular expression are replaced in some way. The part used to replace is usually a string, but it can also be a function that returns a string to replace. SUBN () is the same as sub (), but SUBN () also returns a total number representing replacements, followed by a replacement string and a number representing the total number of replacements, as a tuple of two elements.

print(re.sub(‘X‘‘Mr. Iceman‘‘attn: X\n\nDear X,\n‘))
attn: Mr. Iceman Dear Mr. Iceman,
print(re.subn(‘X‘‘Mr. Iceman‘‘attn: X\n\nDear X,\n‘))
(‘attn: Mr. Iceman\n\nDear Mr. Iceman,\n‘, 2)

The object method of the re module and the regular expression split () is similar to how a string should work, but rather than splitting a fixed string, they are based on a pattern-delimited string of regular expressions, adding some extra power to the string-delimited functionality.

Re.split () works the same way as str.split () if the given delimiter is not a regular expression that uses a special symbol to match multiple patterns

re.split(‘:‘‘str1:str2:str3‘)
[‘str1‘, ‘str2‘, ‘str3‘]
= (    ‘Mountain View, CA 94040‘,    ‘Sunnyvale, CA‘,    ‘Los Altos, 94023‘,    ‘Cupertino 95014‘,    ‘Palo Alto CA‘)for item  in DATA:    print( re.split(‘, |(?= (?:\d{5}|[A-Z]{2})) ‘, item))
[‘Mountain View‘, ‘CA‘, ‘94040‘][‘Sunnyvale‘, ‘CA‘][‘Los Altos‘, ‘94023‘][‘Cupertino‘, ‘95014‘][‘Palo Alto‘, ‘CA‘]
import osimport rewith os.popen(‘tasklist /nh‘‘r‘as f:    forinlist(f)[:5]:        # print(re.split(r‘\s\s+|\t‘, line.rstrip())) #pid 和会话名未分解        print(re.findall(r‘([\w.]+(?: [\w.]+)*)\s\s*(\d+)\s(\w+)\s\s*(\d+)\s\s*([\d,]+\sK)‘, line.strip()))
[][(‘System Idle Process‘, ‘0‘, ‘Services‘, ‘0‘, ‘24 K‘)][(‘System‘, ‘4‘, ‘Services‘, ‘0‘, ‘2,852 K‘)][(‘smss.exe‘, ‘364‘, ‘Services‘, ‘0‘, ‘1,268 K‘)][(‘csrss.exe‘, ‘612‘, ‘Services‘, ‘0‘, ‘6,648 K‘)]

This article ends with a complete example, which uses regular expressions to manipulate strings in different ways. First use the script to create random data for regular expression exercises, and then extract the generated data from the numbers and email addresses

 fromRandomImportRandrange, Choice fromStringImportAscii_lowercase asLc fromDatetimeImportDatetimeImportTimeImportReresult_data=[]# Gen DataTLDs=(' com ',' CN ',' edu ',' Net ',' gov ',' org ') forIinch Range(Randrange (4,9)): Max_seconds= int(DateTime.Now (). Timestamp ()) Dtint=Randrange (Max_seconds)#dtstr = str (datetime.fromtimestamp (dtint))Dtstr=CTime (Dtint) Llen=Randrange (4,8) Login= "'. Join (Choice (LC) forJinch Range(Llen)) Dlen=Randrange (Llen, -) Dom= "'. Join (Choice (LC) forJinch Range(Dlen)) Result_data.append ('%s::%s@%s.%s::%d-%d-%d' %(DTSTR, login, DOM, choice (TLDs), Dtint, Llen, Dlen))#print (result_data)#test reRe_patt= ' ^ (\w{3}).*::(? P<email>\[email protected]\w+.\w+)::(? p<number>\d+-\d+-\d+) ' forIteminchResult_data:m=Re.match (Re_patt, item)ifM is  not None:Print(' * '* -)Print(item)Print("Email:" +M.group (' Email '))Print(' number: ' +M.group (' number '))
Tue Jan 15:34:09 1992::[email protected]::696584049-7-11email: [email  Protected]number:696584049-7-11******************************thu Dec 22:35:52 1971::[email protected]: : 62346952-6-7email: [Email protected]number:62346952-6-7******************************sat Jan 25 11:26:50 2003 :: [Email protected]::1043465210-6-8email: [email protected]number:1043465210-6-8********************* Wed Sep 23:37:34 1977::[email protected]::244309054-7-10email: [Email protected]number: 244309054-7-10******************************sun Feb 7 12:08:11 1988::[email protected]::571205291-4-9email: [ Email protected]number:571205291-4-9******************************sat 00:04:58 1996::[email  Protected]::841421098-4-8email: [Email protected]number:841421098-4-8******************************tue Oct 9 05:32:20 1984::[email protected]::466119140-7-7email: [email protected]number:466119140-7-7 

Python data Analysis learning-re Regular expression module

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.