The implementation method of Python detecting uncommon words _python

Source: Internet
Author: User
Tags character set in python

Solving ideas

The first thought was to use Python's regular expressions to match illegal characters and then find illegal records. But the ideal is always plump, but the reality is brutal. In the process of implementation, only to find their own character encoding, as well as Python internal string representation of the lack of relevant knowledge. During this time, I stepped through a lot of pits, to the end although there are some vague place, but finally there is a general clear understanding. In this record of experience, to avoid falling in the same place later.

The following test environment is a python 2.7.8 environment with ArcGIS 10.3, which is not guaranteed to be applicable to other Python environments.

Python Regular Expressions

The regular functionality in Python is provided by the built-in RE function library, with 3 functions in the main. re.compile() provides a reusable regular expression, match() and the search() function returns a matching result, which is the difference between: match() starting at the specified position and search() searching backwards from the specified position until a matching string is found. For example, the following code match_result starts with the first character F, the matching failure returns a null value search_result , and searches backwards from F until the first matching character A is found, and then the result of the group () function output matches to character A.

Import re Pattern

= re.compile (' [ABC] ')
Match_result = Pattern.match (' fabc ')
if Match_result:
 print Match_result.group ()

search_result = Pattern.search (' fabc ')
if Search_result:
 print Search_ Result.group ()

The above implementation requires that you compile a pattern before you can match it. In fact, we can use re.match(pattern, string) functions directly to achieve the same function. But the way of direct matching without first compiling and matching the way flexible, the first is the regular expression can not be reused, if a large number of data for the same pattern matching, means that each time the need for internal compilation, resulting in performance loss re.match() pattern.match()

Coding problems

Once you know the basics of Python, the rest is to find a suitable regular expression to match uncommon and illegal characters. Illegal characters are simple and can be matched by following pattern:

Pattern = Re.compile (R ' [~!@#$%^&*] ')

However, for the rare words of the match, it really baffled me. First of all, for the definition of rare words, what kind of words are rare words? After consulting the project manager, it is stipulated that the characters of GB2312 belong to uncommon words. The next question is, how do you match GB2312 characters?

After inquiry, the range of GB2312 is [\xA1-\xF7][\xA1-\xFE] , of which the Chinese character area is [\xB0-\xF7][\xA1-\xFE] . Therefore, when you add an uncommon word, the expression is:

Pattern = Re.compile (R ' [~!@#$%^&*]|[ ^\XA1-\XF7][^\XA1-\XFE] ')

The problem seems to be the local solution, but I still too simple too naive. Since the string to be judged is read from a layer file, arcpy carefully encodes the read characters into Unicode format. Therefore, I need to find out the encoding range of the GB2312 character set in Unicode. But the reality is that the distribution of the GB2312 character set in Unicode is not sequential, and using regular representations of this range must be very complex. The idea of using regular expressions to match uncommon words seems to be a dead end.

Solution

Since the supplied string is in Unicode format, can I convert it to GB2312 and then match it? In fact, because the Unicode character set is much larger than the GB2312 character set, it is GB2312 => unicode always achievable, and the converse unicode => GB2312 does not necessarily succeed.

This suddenly gives me another way of thinking, assuming that a string unicode => GB2312 conversion fails, does it just mean that it does not belong to the GB2312 character set? So, I use the unicode_string.encode('GB2312') function to try to convert the string, catch Unicodeencodeerror exception to identify uncommon words.

The final code is as follows:

Import re

def is_rare_name (string): Pattern
 = re.compile (U "[~!@#$%^&*]")
 match = Pattern.search ( String)
 if match: return
 True

 try:
    string.encode ("gb2312")
  except Unicodeencodeerror:
   Return True to return

  False

Summarize

The above is the entire content of this article, I hope the content of this article for everyone's study or work can bring certain help, if you have questions you can message exchange.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.