Objective
The basics of regular expressions do not say, interested can click here, extract the general in two cases, one is to extract the text to extract a single position of the string, and the other is to extract a number of consecutive positions of the string. Log analysis will encounter this situation, I will tell you the corresponding method.
One, single-position string extraction
In this case we can use (. +?) This regular expression to extract. For example, a string "a123b", if we want to extract the value between AB 123, you can use the findall with the regular expression, which will return a list containing so the matching situation.
The code is as follows:
?
1234 |
import re str = "a123b" print re.findall(r "a(.+?)b" , str ) # 输出[ ‘123‘ ] |
1.1 Greedy and non-greedy matches
If we have a string "a123b456b", if we want to match all the values between A and the last B instead of the value between a and the first occurrence of B, you can use to control the case of regular greedy and non-greedy matches.
The code is as follows:
?
1234567891011 |
import re
str = "a123b456b"
print re.findall(r
"a(.+?)b"
,
str
)
#输出[‘123‘]#?控制只匹配0或1个,所以只会输出和最近的b之间的匹配情况
print re.findall(r
"a(.+)b"
,
str
)
#输出[‘123b456‘] print re.findall(r
"a(.*)b"
,
str
)
#输出[‘123b456‘]
|
1.2 Multi-line matching
If you want to match multiple lines, then you need to add re. S and RE.M flags. Plus re. s rear. Will match line breaks, default. Line breaks are not matched.
The code is as follows:
?
12345678 |
str = "a23b\na34b" re.findall(r "a(\d+)b.+a(\d+)b" , str ) #输出[] #因为不能处理str中间有\n换行的情况 re.findall(r "a(\d+)b.+a(\d+)b" , str , re.S) #s输出[(‘23‘, ‘34‘)] |
Plus re. After M, the ^$ flag will match each row, and the default ^ and $ will only match the first row.
The code is as follows:
?
1234567 |
str = "a23b\na34b" re.findall(r "^a(\d+)b" , str ) #输出[‘23‘] re.findall(r "^a(\d+)b" , str , re.M) #输出[‘23‘, ‘34‘] |
String extraction with multiple positions in a row
In this case we can use (?P<name>…)
this regular expression to extract. For example, if we have a line of webserver access logs: ‘192.168.0.1 25/Oct/2012:14:46:34 "GET /api HTTP/1.1" 200 44 "http://abc.com/search" "Mozilla/5.0"‘
We want to extract all the contents of this line of log, and can write multiple (?P<name>expr)
to extract, where name can be changed to the variable you named for the location string, and expr changes to the regular of the fetch position.
The code is as follows:
?
12345678910 |
import re
line
=
‘
192.168
.
0.1 25
/
Oct
/
2012
:
14
:
46
:
34 "GET /api HTTP/1.1" 200 44 "http://abc.com/search"
"Mozilla/5.0"
‘
reg
= re.
compile
(‘^(?P<remote_ip>[^ ]
*
) (?P<date>[^ ]
*
)
"(?P<request>[^"
]
*
)"
(?P<status>[^ ]
*
) (?P<size>[^ ]
*
)
"(?P<referrer>[^"
]
*
)
" "
(?P<user_agent>[^
"]*)"
‘)
regMatch
= reg.match(line)
linebits
= regMatch.groupdict()
print linebits
for k, v
in linebits.items() :
print k
+
": "
+
v
|
The result of the output is:
?
123456 |
status: 200 referrer: request: GET /api HTTP/1.1 user_agent: Mozilla/5.0 date: 25/Oct/2012:14:46:34size: 44 remote_ip: 192.168.0.1 |
Summarize
The above is the entire content of this article, I hope that the content of this article on everyone's study or work can bring certain help, if there is doubt you can message exchange.
Python extracts strings using regular expressions