This example describes how Python uses Chinese regular expressions to match a specified Chinese string. Share to everyone for your reference, as follows:
Business Scenario:
Matches the specified Chinese substring from the Chinese sentence. This situation I encountered in the work of a lot, special comb summarized as follows.
Difficulties:
The processing of character encodings such as GBK and UTF8, while regular matching pattern contains Chinese characters, must be very cautious in order for Chinese characters to function properly. It is recommended that the best unified for UTF8 coding, if not this optimal situation, also have discretion.
Often a universal regular expression simplifies the processing of programs and code, making the process simple and less effective, which is often the most significant difference between a master and a rookie.
Example one:
From the QQ innocence database to parse the specific words and counties, the regular expression here is basically able to meet the business scenario, lazy matching? Very necessary, because the processing is not good, will not get the effect we want. The beauty of the crossing, but also ask you to ponder, I only donuts here!
The code is as follows:
?
123456789101112131415161718192021222324252627282930313233343536 |
#!/usr/bin/env python
#encoding: utf-8
#description: 从字符串中提取省市县等名称,用于从纯真库中解析解析地理数据
import re
import sys
reload
(sys)
sys.setdefaultencoding(
‘utf8‘
)
#匹配规则必须含有u,可以没有r
#这里第一个分组的问号是懒惰匹配,必须这么做
PATTERN
= \
ur
‘([\u4e00-\u9fa5]{2,5}?(?:省|自治区|市))([\u4e00-\u9fa5]{2,7}?(?:市|区|县|州)){0,1}([\u4e00-\u9fa5]{2,7}?(?:市|区|县)){0,1}‘
data_list
= [
‘北京市‘
,
‘陕西省西安市雁塔区‘
,
‘西班牙‘
,
‘北京市海淀区‘
,
‘黑龙江省佳木斯市汤原县‘
,
‘内蒙古自治区赤峰市‘
,
‘贵州省黔南州贵定县‘
,
‘新疆维吾尔自治区伊犁州奎屯市‘
]
for data
in data_list:
data_utf8
= data.decode(
‘utf8‘
)
print data_utf8
country
= data
province
= ‘‘
city
= ‘‘
district
= ‘‘
#pattern = re.compile(PATTERN3)
pattern
= re.
compile
(PATTERN)
m
= pattern.search(data_utf8)
if not m:
print country
+ ‘|||‘
continue
#print m.group()
country
= ‘中国‘
if m.lastindex >
= 1
:
province
= m.group(
1
)
if m.lastindex >
= 2
:
city
= m.group(
2
)
if m.lastindex >
= 3
:
district
= m.group(
3
)
out
= ‘%s|%s|%s|%s‘ %
(country, province, city, district)
print out
|
Run
Example two:
Gets information about the location of the specified IP from the ip138.
IP138 is our daily use of more IP query site, I want to obtain each IP corresponding ISP information, need to query this page
I searched the internet for a long time, did not find ip138 return JSON and other interfaces, only in this way query, then we inevitably need to parse out the red box callout of the ISP information. If you use DOM parsing to specify the general idea of a DIV tag is not very effective, the more straightforward way is to use the Chinese regular match, directly from the returned HTML to get "The main data:" That part of the information.
Here is the code I groped for
?
123456789101112131415161718192021222324252627282930313233 |
#!/usr/bin/env python
#encoding: utf-8
#date: 2016-03-31
#note: 测试中遇到的问题,请求指定的链接会有超时现象,可以多请求几次
import requests, re
import sys
reload
(sys)
sys.setdefaultencoding(
‘utf8‘
)
IP138_API
= ‘http://www.ip138.com/ips138.asp?ip=‘
PATTERN
= ur
‘<li>本站主数据:(.*?)</li>‘
def query_api(url):
data
= ‘‘
r
= requests.get(url)
if r.status_code
=
= 200
:
data
= r.content
return data
def parse_ip138(html):
#只能是unicode编码,不能在后面再转换为utf-8,否则无法正则匹配上.
html
= unicode
(html,
‘gb2312‘
)
#html = unicode(html, ‘gb2312‘).encode(‘utf-8‘)
#print html
pattern
= re.
compile
(PATTERN)
m
= pattern.search(html)
if m:
print m.group(
1
)
else
:
print ‘regex match failed‘
if __name__
=
= ‘__main__‘
:
url
= IP138_API
+ ‘14.192.60.0‘
resp
= query_api(url)
if not resp:
print ‘no content‘
parse_ip138(resp)
|
Below is
"Reprint" Python uses Chinese regular expressions to match a method example of specifying Chinese strings