Python Regular expression re module
Search for a field in a hit text
Python has a library re
Import re
The regular expression. Point denotes any character
[A-z] This position must be a lowercase a to Z letter
Print (Lent (result))
Search for a field in a hit text
Python has a library re
Import re
The regular expression. Point denotes any character
[A-z] This position must be a lowercase a to Z letter
Print (Lent (result))
#!/usr/bin/pythonimport retext= ' file = open (' shi.txt ') for line in file:text=text+linefile.close () result = Re.findall ( ' a[z-z][a-z] ', text) print (result)
result = Re.findall (' (a.[a-z]) ', text) # Plus () means that the rest of the brackets are not
So that the left and right sides of the space are removed
Remove the duplicate result method:
result = Re.findall (' (a.[a-z]) ', text)
result = Set (Result)
Print (Result)
It's all started with a.
Uppercase: [Aa] indicates that the first letter can be uppercase A or lowercase letter a
* Can match more than one or no
A *:
Empty
A
Aa
Aaaaaaaaaaaaaaaa
Spaces
Can match a lot of space ' * ' can have a space can have countless spaces
result = Re.findall (' * ([aa].[ A-z]) ', text)
To match the afe reason except safe:
Afe preceded by a space with no space behind it
Solution: Two-stage filtering
result = Re.findall (' * ([aa].[ A-z]) | ([A]. [A-z]) ', text)
The results are divided into two pairs.
Look at the code:
Two parentheses the first parenthesis does not match, so use an empty one to indicate the right side.
* Can match more than one or no A *: Empty aaaaaaaaaaaaaaaaaaa space * can match a lot of spaces ' * ' can have spaces can have countless spaces result = Re.findall (' * ([aa].[ A-z] ', text) unexpectedly matches except safe's afe reason: there is no space behind the AFE to resolve: two-stage filter result = Re.findall (' * ([aa].[ A-z]) | ([A]. [A-z]) ', text) results are divided into two pairs of a pair of code: two parentheses the first parenthesis does not match the words to use an empty to show empathy to the right
#!/usr/bin/pythonimport retext= ' file = open (' shi.txt ') for line in file:text=text+linefile.close () result = Re.findall ( ' * ([Aa]. [A-z]) | ([A]. [A-z]) ', text) Final_result = set () #set () is a set for-pair in Result:if pair[0] not in Final_result:final_result.add (Pair[0]) #左边规则对应 Out called Pair[0]if pair[1] not in Final_result:final_result.add (pair[1]) #右边规则对应出来的叫pair [1]final_result.remove (') print ( Final_result)
A little summary:
The dot indicates that there is one character at any one character in this position
\d must be a number
\d+ has at least one number
(The difference A * can match to empty)
Use it for a moment:
#!/usr/bin/pythonimport retext= ' File=open (' shi.txt ') for line in File:text = text+linefile.close () result = Re.findall ( ' \d+ ', text) print (result)
\d{2} just matched to two
\d{2,3} can match to two to three
\w matches a letter ' a-za-z '
\w{2,3} matches two or three letters
A character that starts with a
F=open (' Imooc.txt ') for line in F:if line.startswith (' Imooc '):p rint line with a character beginning and ending with a statement #!/usr/bin/pythonimport Redef FIND_START_IMOOC (fname): F=open (fname) for line in F:if line.startswith (' Mooc '):p rint Line#find_start_imooc (' Imooc.txt ') def FIND_IN_MOOC (fname): F=open (fname) for line in F:if line.startswith (' Imooc ') and Line.endswith (' imooc\n ') ): #每一行结束都有/nprint Linefind_in_mooc (' imooc.txt ') #!/usr/bin/pythonimport redef Find_start_imooc (fname): F=open (fname F:if line.startswith (' Mooc '):p rint Line#find_start_imooc (' Imooc.txt ') def FIND_IN_MOOC (fname): F=open ( fname) for line in F:if line.startswith (' Imooc ') and Line[:-1].endswith (' Imooc '): #切片操作print Linefind_in_mooc (' Imooc.txt ')
Match the name of the variable with the beginning of the dash and letter
s3= ' 1 dsf se '
S3.split () returns the sliced
res = R ' t[io]p '
Square brackets Riga ^ means not including res = R ' t[^io]p '
^ Sharp horn for beginning of line R "^hello" only matches the beginning of the line is Hello
$ trailing R "hello$" matches only the end of the line
"T[abc$" at the end of A or B or C definitely not.
Just like in [^abc] ^ means except ABC
\d decimal [0-9]
\d non-numeric characters [^0-9]
\s any white space character [\t\n\r\f\v]
\s non-whitespace characters [^\t\n\r\f\v]
\w any alphanumeric [a-za-z0-9]
\w non-alphanumeric [^a-za-z0-9]
010_12345656
R=r "^010-\d{8}" repeats eight times preceding rule a{8} A repeats 8 times
*
R=r "ab*" 0 times to multiple times (not appearing once)
+ Match one or more times at least once with * difference
? Denotes dispensable
Greedy match with non-greedy match
R=r "ab+? "This will come out with the fewest matches and won't appear abbbbbbbb
{} about curly braces {M,n} repeats at least m times at most repeated n times
R=r "a{1,3}"
Match () returns an object if the match is on.
Csvt_re = Re.compile (R ' CSVT ', re. I)
Csvt_re.match (' csvt hello ')
Searcht () No matter where it is, it doesn't match.
Finditer return Iterator Object
Re=r ' C.. T
Re.sub (RS, ' Python ', ' csvt scat ')
Re.split (R ' [\+\-\*] ', s)
Python Regular Expressions