"Reprint" Python uses Chinese regular expressions to match a method example of specifying Chinese strings

Source: Internet
Author: User

This example describes how Python uses Chinese regular expressions to match a specified Chinese string. Share to everyone for your reference, as follows:

Business Scenario:

Matches the specified Chinese substring from the Chinese sentence. This situation I encountered in the work of a lot, special comb summarized as follows.

Difficulties:

The processing of character encodings such as GBK and UTF8, while regular matching pattern contains Chinese characters, must be very cautious in order for Chinese characters to function properly. It is recommended that the best unified for UTF8 coding, if not this optimal situation, also have discretion.

Often a universal regular expression simplifies the processing of programs and code, making the process simple and less effective, which is often the most significant difference between a master and a rookie.

Example one:

From the QQ innocence database to parse the specific words and counties, the regular expression here is basically able to meet the business scenario, lazy matching? Very necessary, because the processing is not good, will not get the effect we want. The beauty of the crossing, but also ask you to ponder, I only donuts here!
The code is as follows:

?
123456789101112131415161718192021222324252627282930313233343536 #!/usr/bin/env python#encoding: utf-8#description: 从字符串中提取省市县等名称,用于从纯真库中解析解析地理数据import reimport sysreload(sys)sys.setdefaultencoding(‘utf8‘)#匹配规则必须含有u,可以没有r#这里第一个分组的问号是懒惰匹配,必须这么做PATTERN = \ur‘([\u4e00-\u9fa5]{2,5}?(?:省|自治区|市))([\u4e00-\u9fa5]{2,7}?(?:市|区|县|州)){0,1}([\u4e00-\u9fa5]{2,7}?(?:市|区|县)){0,1}‘data_list = [‘北京市‘, ‘陕西省西安市雁塔区‘, ‘西班牙‘, ‘北京市海淀区‘, ‘黑龙江省佳木斯市汤原县‘, ‘内蒙古自治区赤峰市‘,‘贵州省黔南州贵定县‘, ‘新疆维吾尔自治区伊犁州奎屯市‘]for data in data_list: data_utf8 = data.decode(‘utf8‘) print data_utf8 country = data province = ‘‘ city = ‘‘ district = ‘‘ #pattern = re.compile(PATTERN3) pattern = re.compile(PATTERN) m = pattern.search(data_utf8) if not m:  print country + ‘|||‘  continue #print m.group() country = ‘中国‘ if m.lastindex >= 1:  province = m.group(1) if m.lastindex >= 2:  city = m.group(2) if m.lastindex >= 3:  district = m.group(3) out = ‘%s|%s|%s|%s‘ %(country, province, city, district) print out

Run

Example two:

Gets information about the location of the specified IP from the ip138.

IP138 is our daily use of more IP query site, I want to obtain each IP corresponding ISP information, need to query this page

I searched the internet for a long time, did not find ip138 return JSON and other interfaces, only in this way query, then we inevitably need to parse out the red box callout of the ISP information. If you use DOM parsing to specify the general idea of a DIV tag is not very effective, the more straightforward way is to use the Chinese regular match, directly from the returned HTML to get "The main data:" That part of the information.

Here is the code I groped for

?
123456789101112131415161718192021222324252627282930313233 #!/usr/bin/env python#encoding: utf-8#date: 2016-03-31#note: 测试中遇到的问题,请求指定的链接会有超时现象,可以多请求几次import requests, reimport sysreload(sys)sys.setdefaultencoding(‘utf8‘)IP138_API = ‘http://www.ip138.com/ips138.asp?ip=‘PATTERN = ur‘<li>本站主数据:(.*?)</li>‘def query_api(url): data = ‘‘ r = requests.get(url) if r.status_code == 200:  data = r.content return datadef parse_ip138(html): #只能是unicode编码,不能在后面再转换为utf-8,否则无法正则匹配上. html = unicode(html, ‘gb2312‘) #html = unicode(html, ‘gb2312‘).encode(‘utf-8‘) #print html pattern = re.compile(PATTERN) m = pattern.search(html) if m:  print m.group(1) else:  print ‘regex match failed‘if __name__ == ‘__main__‘: url = IP138_API + ‘14.192.60.0‘ resp = query_api(url) if not resp:  print ‘no content‘ parse_ip138(resp)

Below is

"Reprint" Python uses Chinese regular expressions to match a method example of specifying Chinese strings

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.