How to detect uncommon characters with Python

Source: Internet
Author: User
Recently encountered a demand in the work, asked to detect whether the field contains uncommon characters and some illegal characters such as ~!@#$%^&*. Through the online search data to solve, now will solve the process and sample code to share to everyone, there is a need to reference. Let's take a look below.

Solution Ideas

The first thing to think about is to use Python's regular expressions to match illegal characters and then find illegal records. But the ideal is always plump, but the reality is cruel. The lack of knowledge about character encoding and Python's internal string representation is found in the implementation process. During this period, stepped over a lot of pits, to the end although there are some vague places, but finally have a general clear understanding. Keep track of what you are doing and avoid falling in the same place later.

The following test environment is a python 2.7.8 environment that comes with ArcGIS 10.3 and does not guarantee that other Python environments will work as well.

Python Regular Expressions

The regular functionality in Python is provided by the embedded RE function library, which uses 3 functions. re.compile() provides a reusable regular expression, match() and the search() function returns a matching result, the difference between the two is that the match() match starts at the specified position and search() is searched backwards from the specified location until a matching string is found. For example, in the following code, the match match_result fails to return a null value from the first character F, and search_result a backward search from F until the first matching character A is found, and then the match result is output by the group () function as character A.

Import Repattern = Re.compile (' [ABC] ') Match_result = Pattern.match (' fabc ') if Match_result:print match_result.group () Search_result = Pattern.search (' fabc ') if Search_result:print search_result.group ()

The above implementation requires a pattern to be compiled before matching. In fact, we can use re.match(pattern, string) functions directly to achieve the same functionality. However, the direct matching method is not first compiled and then matched in a flexible manner, first of all, the regular expression is not reusable, if a large number of data for the same pattern matching, meaning that each time the need for internal compilation, resulting in performance loss; In addition, the re.match() function is not pattern.match() powerful, which can specify where to start the match.

Coding issues

Once you understand the basic functionality of Python regular, the rest is to find a suitable regular expression to match uncommon and illegal characters. Illegal characters are simple and can be matched with the following pattern:

Pattern = Re.compile (R ' [~!@#$%^&*] ')

However, the matching of uncommon words, really baffled me. First of all, for the definition of uncommon words, what kind of words are uncommon words? After consulting the project manager, the non-GB2312 characters belong to the uncommon word. The next question is, how do I match GB2312 characters?

After querying, the range of GB2312 is [\xA1-\xF7][\xA1-\xFE] , where the range of the kanji area is [\xB0-\xF7][\xA1-\xFE] . Therefore, when you add an uncommon word match, the expression is:

Pattern = Re.compile (R ' [~!@#$%^&*]|[ ^\XA1-\XF7][^\XA1-\XFE] ')

The problem seems to be the local solution, but I still too simple too naive. Since the string to be judged is read from the layer file, arcpy carefully encodes the characters read into Unicode format. Therefore, I need to find out the encoding range of the GB2312 character set in Unicode. The reality is that the distribution of the GB2312 character set in Unicode is not continuous, and the use of regular means that the range must be very complex. The idea of using regular expressions to match uncommon characters seems to be in a dead end.

Solution Solutions

Since the supplied string is in Unicode format, can I convert it to GB2312 and then match it? Not really, because the Unicode character set is much larger than the GB2312 character set and is GB2312 => unicode always achievable, and not unicode => GB2312 necessarily successful.

This suddenly provides me with another way of thinking, assuming that a string unicode => GB2312 conversion will fail, then does that mean it does not belong to the GB2312 character set? So, I used the unicode_string.encode('GB2312') function to try to convert the string, capturing the Unicodeencodeerror exception to identify the uncommon word.

The final code is as follows:

Import Redef Is_rare_name (string): pattern = re.compile (U "[~!@#$%^&*]") match = Pattern.search (string) if Match:retu RN true try:    string.encode ("gb2312")  except Unicodeencodeerror:   return true  return False

Summarize

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.