Implementation of Python detection of uncommon words, python detection of uncommon words
Solution
The first thing that comes to mind is to use a python regular expression to match illegal characters and then identify illegal records. However, the ideal is always full, but the reality is cruel. In the process of implementation, I found that I lack knowledge about character encoding and character string representation in python. During this period, I step on a lot of pitfalls. Although there are still some vague points in the end, I have a general clear understanding. Record your experiences here to avoid falling down in the same place in the future.
The following test environment is the python 2.7.8 environment that comes with ArcGIS 10.3. it is not guaranteed that other python environments are also applicable.
Python Regular Expression
The Regular Expression Function in python is provided by the embedded re function library and mainly uses three functions.re.compile() Provides reusable regular expressions,match() And search() The function returns a matching result. The difference between the two is: match() Match from the specified position,search() Searches backward from the specified position until matching strings are found. For example, in the following code,match_result Matching starts from the first character f. If the matching fails, a null value is returned;search_result Search backward from f until the first matched character a is found, and then the matching result is a through the group () function.
import repattern = re.compile('[abc]')match_result = pattern.match('fabc')if match_result: print match_result.group()search_result = pattern.search('fabc')if search_result: print search_result.group()
The above implementation method requires compiling a pattern before matching. In fact, we can directly usere.match(pattern, string)Function to implement the same function. However, direct matching is not as flexible as compiling and then matching. First, regular expressions cannot be reused. If a large amount of data is matched in the same pattern, internal compilation is required each time, resulting in performance loss; in addition,re.match() No Function pattern.match() Powerful. The latter can specify the position from which the matching starts.
Encoding Problems
After learning about the basic functions of python regular expressions, you can find a suitable regular expression to match uncommon and invalid characters. Invalid characters are simple. match with the following pattern:
pattern = re.compile(r'[~!@#$%^&* ]')
However, it is difficult for me to match uncommon words. What is the definition of uncommon words? After consultation with the project manager, the non-GB2312 characters are uncommon characters. The following question is: how to match the GB2312 character?
After query, the range of GB2312 is [\xA1-\xF7][\xA1-\xFE] The range of the Chinese character area is [\xB0-\xF7][\xA1-\xFE] . Therefore, the expression after the uncommon words match is:
pattern = re.compile(r'[~!@#$%^&* ]|[^\xA1-\xF7][^\xA1-\xFE]')
The problem seems to have been solved locally, but I still use too simple too naive. Since the string to be judged is read from the layer file, arcpy carefully encodes the characters to be read in unicode format. Therefore, I need to find the encoding range of the GB2312 character set in unicode. However, the reality is that the distribution of the GB2312 character set in unicode is not continuous. using regular expressions, this range must be very complex. The idea of using regular expressions to match uncommon words seems to be in a dead end.
Solution
Since the provided string is in unicode format, can I convert it to GB2312 for matching? The unicode Character Set is much larger than the GB2312 character set.GB2312 => unicode It can always be implemented, and vice versaunicode => GB2312 May not be successful.
This suddenly gave me another idea, assuming that a stringunicode => GB2312 If the conversion fails, does it mean that it does not belong to the GB2312 character set? So, I useunicode_string.encode('GB2312')The function tries to convert the string and capture the UnicodeEncodeError exception to identify uncommon words.
The final code is as follows:
import redef is_rare_name(string): pattern = re.compile(u"[~!@#$%^&* ]") match = pattern.search(string) if match: return True try: string.encode("gb2312") except UnicodeEncodeError: return True return False
Summary
The above is all about this article. I hope this article will help you in your study or work. If you have any questions, please leave a message.