Implementation of Python detection of uncommon words, python detection of uncommon words

Last Update:2016-10-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Solution

The first thing that comes to mind is to use a python regular expression to match illegal characters and then identify illegal records. However, the ideal is always full, but the reality is cruel. In the process of implementation, I found that I lack knowledge about character encoding and character string representation in python. During this period, I step on a lot of pitfalls. Although there are still some vague points in the end, I have a general clear understanding. Record your experiences here to avoid falling down in the same place in the future.

The following test environment is the python 2.7.8 environment that comes with ArcGIS 10.3. it is not guaranteed that other python environments are also applicable.

Python Regular Expression

The Regular Expression Function in python is provided by the embedded re function library and mainly uses three functions.re.compile() Provides reusable regular expressions,match() And search() The function returns a matching result. The difference between the two is: match() Match from the specified position,search() Searches backward from the specified position until matching strings are found. For example, in the following code,match_result Matching starts from the first character f. If the matching fails, a null value is returned;search_result Search backward from f until the first matched character a is found, and then the matching result is a through the group () function.

import repattern = re.compile('[abc]')match_result = pattern.match('fabc')if match_result: print match_result.group()search_result = pattern.search('fabc')if search_result: print search_result.group()

The above implementation method requires compiling a pattern before matching. In fact, we can directly usere.match(pattern, string)Function to implement the same function. However, direct matching is not as flexible as compiling and then matching. First, regular expressions cannot be reused. If a large amount of data is matched in the same pattern, internal compilation is required each time, resulting in performance loss; in addition,re.match() No Function pattern.match() Powerful. The latter can specify the position from which the matching starts.

Encoding Problems

After learning about the basic functions of python regular expressions, you can find a suitable regular expression to match uncommon and invalid characters. Invalid characters are simple. match with the following pattern:

pattern = re.compile(r'[~!@#$%^&* ]')

However, it is difficult for me to match uncommon words. What is the definition of uncommon words? After consultation with the project manager, the non-GB2312 characters are uncommon characters. The following question is: how to match the GB2312 character?

After query, the range of GB2312 is [\xA1-\xF7][\xA1-\xFE] The range of the Chinese character area is [\xB0-\xF7][\xA1-\xFE] . Therefore, the expression after the uncommon words match is:

pattern = re.compile(r'[~!@#$%^&* ]|[^\xA1-\xF7][^\xA1-\xFE]')

The problem seems to have been solved locally, but I still use too simple too naive. Since the string to be judged is read from the layer file, arcpy carefully encodes the characters to be read in unicode format. Therefore, I need to find the encoding range of the GB2312 character set in unicode. However, the reality is that the distribution of the GB2312 character set in unicode is not continuous. using regular expressions, this range must be very complex. The idea of using regular expressions to match uncommon words seems to be in a dead end.

Solution

Since the provided string is in unicode format, can I convert it to GB2312 for matching? The unicode Character Set is much larger than the GB2312 character set.GB2312 => unicode It can always be implemented, and vice versaunicode => GB2312 May not be successful.

This suddenly gave me another idea, assuming that a stringunicode => GB2312 If the conversion fails, does it mean that it does not belong to the GB2312 character set? So, I useunicode_string.encode('GB2312')The function tries to convert the string and capture the UnicodeEncodeError exception to identify uncommon words.

The final code is as follows:

import redef is_rare_name(string): pattern = re.compile(u"[~!@#$%^&* ]") match = pattern.search(string) if match: return True try:    string.encode("gb2312")  except UnicodeEncodeError:   return True  return False

Summary

The above is all about this article. I hope this article will help you in your study or work. If you have any questions, please leave a message.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Implementation of Python detection of uncommon words, python detection of uncommon words

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support