Python uses a Chinese regular expression to match a specified Chinese string,

Source: Internet
Author: User

Python uses a Chinese regular expression to match a specified Chinese string,

This example describes how Python uses a regular expression to match a specified Chinese string. We will share this with you for your reference. The details are as follows:

Business scenario:

Match the specified Chinese sub-string in the text sentence. In this case, I encountered a lot in my work. The summary is as follows.

Difficulties:

We must exercise caution when processing character codes such as GBK and utf8 and regular expression matching that contain Chinese characters. we recommend that you use utf8 encoding. If this is not the best case, you can handle it as appropriate.

A universal regular expression can simplify the processing of programs and codes, and simplify the process and get twice the result with half the effort. This is often the most significant difference between a master and a cainiao.

Example 1:

The regular expression is basically able to meet the needs of business scenarios? It is very necessary, because the processing is not good, it will not get the effect we want. Here is a wonderful place. You may also want to know what you want. I will only CLICK HERE!
The Code is as follows:

#! /Usr/bin/env python # encoding: UTF-8 # description: extract the names of provinces, cities, counties, and so on from the string for parsing the geographic data from the pure library import reimport sysreload (sys) sys. setdefaultencoding ('utf8') # The matching rule must contain u, so no r is allowed. # The question mark of the first group here is a lazy match, PATTERN = \ ur '([\ u4e00-\ u9fa5] {2, 5 }? (? : Province | autonomous region | City) ([\ u4e00-\ u9fa5] {2, 7 }? (? : City | district | County | State) {0, 1} ([\ u4e00-\ u9fa5] {2, 7 }? (? : City | district | County) {0, 1} 'data _ list = ['beijing', 'yanta district, Xi'an City, Shaanxi province ', 'Spanish', 'haidian district, Beijing ', 'tangyuan County, Jiamusi City, Heilongjiang Province ', 'chifeng city, Inner Mongolia Autonomous Region', 'guizhou province, Guizhou Province, Guizhou Province, guiding County ', 'xinjiang Uygur Autonomous Region, iuli City'] for data in data_list: data_utf8 = data. decode ('utf8') print data_utf8 country = data province = ''city ='' district = ''# pattern = re. compile (PATTERN3) pattern = re. compile (PATTERN) m = pattern. search (data_utf8) if not m: print country + '| 'continue # print m. group () country = 'China' if m. lastindex> = 1: province = m. group (1) if m. lastindex> = 2: city = m. group (2) if m. lastindex> = 3: district = m. group (3) out = '% s | % s' % (country, province, city, district) print out

Run

Example 2:

Obtains the geographic location of the specified ip address from ip138.

Ip138 is a website that uses many ip addresses. To obtain the isp information corresponding to each ip address, I need to query this page.

I have been searching the Internet for a long time. I have not found any interface such as ip138 returned json. I can only query it in this way. Therefore, we need to parse the isp information marked in the red box. If you use DOM to parse the specified div tag, I am afraid it is not very effective. The simpler way is to use Chinese Regular Expression matching and get the "main data of this site" directly from the returned html: "that part of the information.

Below is the code I found

#! /Usr/bin/env python # encoding: UTF-8 # date: 2016-03-31 # note: if a problem occurs during the test, the specified link of the Request times out. You can request multiple import requests, reimport sysreload (sys) sys. setdefaultencoding ('utf8') IP138_API = 'HTTP: // www.ip138.com/ips138.asp? Ip = 'pattern' <li> master data of this site :(.*?) </Li> 'def query_api (url): data = ''r = requests. get (url) if r. status_code = 200: data = r. content return datadef parse_ip138 (html): # It can only be unicode encoded and cannot be converted to UTF-8 later. Otherwise, it cannot be used for regular expression matching. html = unicode (html, 'gb2312') # html = unicode (html, 'gb2312 '). encode ('utf-8') # print html pattern = re. compile (PATTERN) m = pattern. search (html) if m: print m. group (1) else: print 'regex match failed' if _ name _ = '_ main _': url = IP138_API + '14. 192.60.0 'resp = query_api (url) if not resp: print 'no content' parse_ip138 (resp)

Below is

PS: here we will provide two very convenient Regular Expression tools for your reference:

JavaScript Regular Expression online testing tool:
Http://tools.jb51.net/regex/javascript

Regular Expression generation tool:
Http://tools.jb51.net/regex/create_reg

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.