The implementation method of Python detecting uncommon words

The implementation method of Python detecting uncommon words _python

Last Update:2017-01-18 Source: Internet

Author: User

Tags character set in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Solving ideas

The first thought was to use Python's regular expressions to match illegal characters and then find illegal records. But the ideal is always plump, but the reality is brutal. In the process of implementation, only to find their own character encoding, as well as Python internal string representation of the lack of relevant knowledge. During this time, I stepped through a lot of pits, to the end although there are some vague place, but finally there is a general clear understanding. In this record of experience, to avoid falling in the same place later.

The following test environment is a python 2.7.8 environment with ArcGIS 10.3, which is not guaranteed to be applicable to other Python environments.

Python Regular Expressions

The regular functionality in Python is provided by the built-in RE function library, with 3 functions in the main. re.compile() provides a reusable regular expression, match() and the search() function returns a matching result, which is the difference between: match() starting at the specified position and search() searching backwards from the specified position until a matching string is found. For example, the following code match_result starts with the first character F, the matching failure returns a null value search_result , and searches backwards from F until the first matching character A is found, and then the result of the group () function output matches to character A.

Import re Pattern

= re.compile (' [ABC] ')
Match_result = Pattern.match (' fabc ')
if Match_result:
 print Match_result.group ()

search_result = Pattern.search (' fabc ')
if Search_result:
 print Search_ Result.group ()

The above implementation requires that you compile a pattern before you can match it. In fact, we can use re.match(pattern, string) functions directly to achieve the same function. But the way of direct matching without first compiling and matching the way flexible, the first is the regular expression can not be reused, if a large number of data for the same pattern matching, means that each time the need for internal compilation, resulting in performance loss re.match() pattern.match()

Coding problems

Once you know the basics of Python, the rest is to find a suitable regular expression to match uncommon and illegal characters. Illegal characters are simple and can be matched by following pattern:

Pattern = Re.compile (R ' [~!@#$%^&*] ')

However, for the rare words of the match, it really baffled me. First of all, for the definition of rare words, what kind of words are rare words? After consulting the project manager, it is stipulated that the characters of GB2312 belong to uncommon words. The next question is, how do you match GB2312 characters?

After inquiry, the range of GB2312 is [\xA1-\xF7][\xA1-\xFE] , of which the Chinese character area is [\xB0-\xF7][\xA1-\xFE] . Therefore, when you add an uncommon word, the expression is:

Pattern = Re.compile (R ' [~!@#$%^&*]|[ ^\XA1-\XF7][^\XA1-\XFE] ')

The problem seems to be the local solution, but I still too simple too naive. Since the string to be judged is read from a layer file, arcpy carefully encodes the read characters into Unicode format. Therefore, I need to find out the encoding range of the GB2312 character set in Unicode. But the reality is that the distribution of the GB2312 character set in Unicode is not sequential, and using regular representations of this range must be very complex. The idea of using regular expressions to match uncommon words seems to be a dead end.

Solution

Since the supplied string is in Unicode format, can I convert it to GB2312 and then match it? In fact, because the Unicode character set is much larger than the GB2312 character set, it is GB2312 => unicode always achievable, and the converse unicode => GB2312 does not necessarily succeed.

This suddenly gives me another way of thinking, assuming that a string unicode => GB2312 conversion fails, does it just mean that it does not belong to the GB2312 character set? So, I use the unicode_string.encode('GB2312') function to try to convert the string, catch Unicodeencodeerror exception to identify uncommon words.

The final code is as follows:

Import re

def is_rare_name (string): Pattern
 = re.compile (U "[~!@#$%^&*]")
 match = Pattern.search ( String)
 if match: return
 True

 try:
    string.encode ("gb2312")
  except Unicodeencodeerror:
   Return True to return

  False

Summarize

The above is the entire content of this article, I hope the content of this article for everyone's study or work can bring certain help, if you have questions you can message exchange.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More