Python Chinese Regular Expression notes

Last Update:2017-01-12 Source: Internet

Author: User

Tags character classes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From the angle of string, Chinese is not as neat and normative as English, which is inevitable reality. This article unifies the online material and the personal experience, takes the Python language as an example, summarizes briefly. Welcome to add or pick the wrong.
A little experience
You can use the repr () function to view the original format of a string. This is useful for writing regular expressions.
Python's RE module has two similar functions: Re.match (), Re.search. The matching process of the two functions is exactly the same, but the starting point is different. Match is only matched from the beginning of the string, and if it fails, it discards it, and search will persevere in completely traversing all possible positions in the string until a match is successfully found, or the string is searched for failure. If you understand the character of match (in some cases faster), you can use it freely, and if you're not sure, search is usually the function you need.
From a bunch of text, find out all possible matches, returned as a list, in this case with the FindAll () function. See the following code for an example.
UTF8, each Chinese character occupies 3 characters, the regular type is [\x80-\xff]{3}, this is known.
Unicode, the format of Chinese characters such as \uxxxx, as long as the corresponding character set can be found in the range, it will be able to match the corresponding string, easy to select from the multilingual text of the language needed to text. However, for sticky words such as Japanese, both Chinese characters and hiragana katakana may result in deviations.
Two character classes can be used together, for example, hiragana, katakana, and Chinese together, U "[\u4e00-\u9fa5\u3040-\u309f\u30a0-\u30ff]+", from the definition of the text needed to match.
When you match Chinese, the format of the regular expression and the target string must be the same. This is crucial. Or you can use the default UTF8, you don't have to do anything at this point, and if it's Unicode, you need to precede the regular with the U "" format.
You can define a Unicode string like this: String=u "I love regular Expressions." If the string is not Unicode, you can use the Unicode () function to convert it. If you know the encoding of the source string, you can convert it using Newstr=unicode (oldstring, original_coding_name), such as common Unicode (string, "UTF8") under Linux, Windows May use cp936, not tested.
Example Program

#!/usr/bin/python #-*-Coding:utf-8-*-# #author: Rex #blog: http://iregex.org #filename py_utf8_unicode.py #created: 2 010-06-27 09:11 Import re def findpart (regex, text, name): Res=re.findall (regex, text) if Res:print "There is%d%s part s:\n "% (len (res), name) for R in Res:print" \ T ", R print #sample is UTF8 by default. Sample= ' en:regular expression is a powerful tool for manipulating text. ZH: Regular expressions are a useful tool for working with text. JP: The regular table は very occupies occupies つツールテキストを operation することです. jp-char:あアいイうウえエおオkr:정규표현식은매우유용한도구텍스트를조작하는것입니다. PUC:.?! 、，；：“ ”‘ '——......· －·《》〈〉！ ￥%&*# ' #let ' s look its raw representation under the Hood:print "the raw UTF8 string is:\n", repr (sample) print #find t He non-ascii Chars:findpart (r "[\x80-\xff]+", Sample, "Non-ascii") #convert the UTF8 to Unicode usample=unicode (sample, ' UTF8 ') #let ' s look its raw representation under the Hood:print "the raw Unicode string is:\n", repr (usample) print #get E Ach language Parts:findpart (u "[\u4e00-\u9fa5]+", Usample, "Unicode Chinese") Findpart (U"[\uac00-\ud7ff]+", Usample, "Unicode Korean") Findpart (U "[\u30a0-\u30ff]+", usample, "Unicode Japanese Katakana") Findpart (U "[\u3040-\u309f]+", usample, "Unicode Japanese Hiragana") Findpart (U "[\u3000-\u303f\ufb00-\ufffd]+", Usample, "Unicode CJK Punctuation")

The

Outputs the result:

The raw UTF8 string is: ' En:regular expression was a powerful tool for manipulating Text.\nzh: \xe6\xad\xa3\xe5\x88\x99\xe 8\xa1\xa8\xe8\xbe\xbe\xe5\xbc\x8f\xe6\x98\xaf\xe4\xb8\x80\xe7\xa7\x8d\xe5\xbe\x88\xe6\x9c\x89\xe7\x94\xa8\xe7\ X9A\X84\XE5\XA4\X84\XE7\X90\X86\XE6\X96\X87\XE6\X9C\XAC\XE7\X9A\X84\XE5\XB7\XA5\XE5\X85\XB7\XE3\X80\X82\NJP: \ Xe6\xad\xa3\xe8\xa6\x8f\xe8\xa1\xa8\xe7\x8f\xbe\xe3\x81\xaf\xe9\x9d\x9e\xe5\xb8\xb8\xe3\x81\xab\xe5\xbd\xb9\ Xe3\x81\xab\xe7\xab\x8b\xe3\x81\xa4\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\ Xe3\x83\x88\xe3\x82\x92\xe6\x93\x8d\xe4\xbd\x9c\xe3\x81\x99\xe3\x82\x8b\xe3\x81\x93\xe3\x81\xa8\xe3\x81\xa7\ Xe3\x81\x99\xe3\x80\x82\njp-char: \xe3\x81\x82\xe3\x82\xa2\xe3\x81\x84\xe3\x82\xa4\xe3\x81\x86\xe3\x82\xa6\xe3\ x81\x88\xe3\x82\xa8\xe3\x81\x8a\xe3\x82\xaa\nkr:\xec\xa0\x95\xea\xb7\x9c \xed\x91\x9c\xed\x98\x84\xec\x8b\x9d\ xec\x9d\x80 \xeb\xa7\xa4\xec\x9a\xb0 \xec\x9c\xa0\xec\x9a\xa9\xed\x95\x9c \xeb\x8f\x84\xea\xb5\xac \xed\x85\x8d\xEC\X8A\XA4\XED\X8A\XB8\XEB\XA5\XBC \xec\xa1\xb0\xec\x9e\x91\xed\x95\x98\xeb\x8a\x94 \xea\xb2\x83\xec\x9e\x85\xeb \X8B\X88\XEB\X8B\XA4.\NPUC: \xe3\x80\x82\xef\xbc\x9f\xef\xbc\x81\xe3\x80\x81\xef\xbc\x8c\xef\xbc\x9b\xef\xbc\ x9a\xe2\x80\x9c \xe2\x80\x9d\xe2\x80\x98 \xe2\x80\x99\xe2\x80\x94\xe2\x80\x94\xe2\x80\xa6\xe2\x80\xa6\xc2\xb7\ xef\xbc\x8d\xc2\xb7\xe3\x80\x8a\xe3\x80\x8b\xe3\x80\x88\xe3\x80\x89\xef\xbc\x81\xef\xbf\xa5\xef\xbc\x85\xef\ Xbc\x86\xef\xbc\x8a\xef\xbc\x83\n ' There is non-ascii parts: Regular expressions are a useful tool for working with text. The regular table は very occupies occupies つツールテキストを operation することです. あアいイうウえエおオ정규표현식은매우유용한도구텍스트를조작하는것입니다.?! 、，；：“ ”‘ '——......· －·《》〈〉！ ￥%&*# the raw Unicode string is:u ' en:regular expression is a powerful tool for manipulating Text.\nzh: \u6b63\u5219\u886 8\U8FBE\U5F0F\U662F\U4E00\U79CD\U5F88\U6709\U7528\U7684\U5904\U7406\U6587\U672C\U7684\U5DE5\U5177\U3002\NJP: \ U6b63\u898f\u8868\u73fe\u306f\u975e\u5e38\u306b\u5f79\u306b\u7acb\u3064\u30c4\u30fc\u30eb\u30c6\u30ad\u30b9\ U30c8\u3092\u64cd\u4f5c\u3059\u308b\U3053\u3068\u3067\u3059\u3002\njp-char: \u3042\u30a2\u3044\u30a4\u3046\u30a6\u3048\u30a8\u304a\u30aa\nkr:\uc815 \UADDC \ud45c\ud604\uc2dd\uc740 \ub9e4\uc6b0 \uc720\uc6a9\ud55c \ub3c4\uad6c \ud14d\uc2a4\ud2b8\ub97c \uc870\uc791\ ud558\ub294 \UAC83\UC785\UB2C8\UB2E4.\NPUC: \u3002\uff1f\uff01\u3001\uff0c\uff1b\uff1a\u201c \u201d\u2018 \u2019\ U2014\u2014\u2026\u2026\xb7\uff0d\xb7\u300a\u300b\u3008\u3009\uff01\uffe5\uff05\uff06\uff0a\uff03\n ' There is 6 Unicode Chinese parts: Regular expressions are a useful tool for working with text the regular table is very operational there is 8 Unicode Korean Parts:정규표현식은매우유용한도구텍스트를조작하는것입 니다there is 6 Unicode Japanese katakana parts:ツールテキストアイウエオthere is Unicode Japanese hiragana parts:はににつを することですあいうえおthere is 5 Unicode CJK punctuation parts: ...?! 、，；： － 《》〈〉！ ￥%&*#

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More