From the angle of string, Chinese is not as neat and normative as English, which is inevitable reality. This article unifies the online material and the personal experience, takes the Python language as an example, summarizes briefly. Welcome to add or pick the wrong.
A little experience
You can use the repr () function to view the original format of a string. This is useful for writing regular expressions.
Python's RE module has two similar functions: Re.match (), Re.search. The matching process of the two functions is exactly the same, but the starting point is different. Match is only matched from the beginning of the string, and if it fails, it discards it, and search will persevere in completely traversing all possible positions in the string until a match is successfully found, or the string is searched for failure. If you understand the character of match (in some cases faster), you can use it freely, and if you're not sure, search is usually the function you need.
From a bunch of text, find out all possible matches, returned as a list, in this case with the FindAll () function. See the following code for an example.
UTF8, each Chinese character occupies 3 characters, the regular type is [\x80-\xff]{3}, this is known.
Unicode, the format of Chinese characters such as \uxxxx, as long as the corresponding character set can be found in the range, it will be able to match the corresponding string, easy to select from the multilingual text of the language needed to text. However, for sticky words such as Japanese, both Chinese characters and hiragana katakana may result in deviations.
Two character classes can be used together, for example, hiragana, katakana, and Chinese together, U "[\u4e00-\u9fa5\u3040-\u309f\u30a0-\u30ff]+", from the definition of the text needed to match.
When you match Chinese, the format of the regular expression and the target string must be the same. This is crucial. Or you can use the default UTF8, you don't have to do anything at this point, and if it's Unicode, you need to precede the regular with the U "" format.
You can define a Unicode string like this: String=u "I love regular Expressions." If the string is not Unicode, you can use the Unicode () function to convert it. If you know the encoding of the source string, you can convert it using Newstr=unicode (oldstring, original_coding_name), such as common Unicode (string, "UTF8") under Linux, Windows May use cp936, not tested.
Example Program
#!/usr/bin/python #-*-Coding:utf-8-*-# #author: Rex #blog: http://iregex.org #filename py_utf8_unicode.py #created: 2 010-06-27 09:11 Import re def findpart (regex, text, name): Res=re.findall (regex, text) if Res:print "There is%d%s part s:\n "% (len (res), name) for R in Res:print" \ T ", R print #sample is UTF8 by default. Sample= ' en:regular expression is a powerful tool for manipulating text. ZH: Regular expressions are a useful tool for working with text. JP: The regular table は very occupies occupies つツールテキストを operation することです. jp-char:あアいイうウえエおオkr:정규표현식은매우유용한도구텍스트를조작하는것입니다. PUC:.?! 、,;:“ ”‘ '——......· -·《》〈〉! ¥%&*# ' #let ' s look its raw representation under the Hood:print "the raw UTF8 string is:\n", repr (sample) print #find t He non-ascii Chars:findpart (r "[\x80-\xff]+", Sample, "Non-ascii") #convert the UTF8 to Unicode usample=unicode (sample, ' UTF8 ') #let ' s look its raw representation under the Hood:print "the raw Unicode string is:\n", repr (usample) print #get E Ach language Parts:findpart (u "[\u4e00-\u9fa5]+", Usample, "Unicode Chinese") Findpart (U"[\uac00-\ud7ff]+", Usample, "Unicode Korean") Findpart (U "[\u30a0-\u30ff]+", usample, "Unicode Japanese Katakana") Findpart (U "[\u3040-\u309f]+", usample, "Unicode Japanese Hiragana") Findpart (U "[\u3000-\u303f\ufb00-\ufffd]+", Usample, "Unicode CJK Punctuation")
The
Outputs the result:
The raw UTF8 string is: ' En:regular expression was a powerful tool for manipulating Text.\nzh: \xe6\xad\xa3\xe5\x88\x99\xe 8\xa1\xa8\xe8\xbe\xbe\xe5\xbc\x8f\xe6\x98\xaf\xe4\xb8\x80\xe7\xa7\x8d\xe5\xbe\x88\xe6\x9c\x89\xe7\x94\xa8\xe7\ X9A\X84\XE5\XA4\X84\XE7\X90\X86\XE6\X96\X87\XE6\X9C\XAC\XE7\X9A\X84\XE5\XB7\XA5\XE5\X85\XB7\XE3\X80\X82\NJP: \ Xe6\xad\xa3\xe8\xa6\x8f\xe8\xa1\xa8\xe7\x8f\xbe\xe3\x81\xaf\xe9\x9d\x9e\xe5\xb8\xb8\xe3\x81\xab\xe5\xbd\xb9\ Xe3\x81\xab\xe7\xab\x8b\xe3\x81\xa4\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\ Xe3\x83\x88\xe3\x82\x92\xe6\x93\x8d\xe4\xbd\x9c\xe3\x81\x99\xe3\x82\x8b\xe3\x81\x93\xe3\x81\xa8\xe3\x81\xa7\ Xe3\x81\x99\xe3\x80\x82\njp-char: \xe3\x81\x82\xe3\x82\xa2\xe3\x81\x84\xe3\x82\xa4\xe3\x81\x86\xe3\x82\xa6\xe3\ x81\x88\xe3\x82\xa8\xe3\x81\x8a\xe3\x82\xaa\nkr:\xec\xa0\x95\xea\xb7\x9c \xed\x91\x9c\xed\x98\x84\xec\x8b\x9d\ xec\x9d\x80 \xeb\xa7\xa4\xec\x9a\xb0 \xec\x9c\xa0\xec\x9a\xa9\xed\x95\x9c \xeb\x8f\x84\xea\xb5\xac \xed\x85\x8d\xEC\X8A\XA4\XED\X8A\XB8\XEB\XA5\XBC \xec\xa1\xb0\xec\x9e\x91\xed\x95\x98\xeb\x8a\x94 \xea\xb2\x83\xec\x9e\x85\xeb \X8B\X88\XEB\X8B\XA4.\NPUC: \xe3\x80\x82\xef\xbc\x9f\xef\xbc\x81\xe3\x80\x81\xef\xbc\x8c\xef\xbc\x9b\xef\xbc\ x9a\xe2\x80\x9c \xe2\x80\x9d\xe2\x80\x98 \xe2\x80\x99\xe2\x80\x94\xe2\x80\x94\xe2\x80\xa6\xe2\x80\xa6\xc2\xb7\ xef\xbc\x8d\xc2\xb7\xe3\x80\x8a\xe3\x80\x8b\xe3\x80\x88\xe3\x80\x89\xef\xbc\x81\xef\xbf\xa5\xef\xbc\x85\xef\ Xbc\x86\xef\xbc\x8a\xef\xbc\x83\n ' There is non-ascii parts: Regular expressions are a useful tool for working with text. The regular table は very occupies occupies つツールテキストを operation することです. あアいイうウえエおオ정규표현식은매우유용한도구텍스트를조작하는것입니다.?! 、,;:“ ”‘ '——......· -·《》〈〉! ¥%&*# the raw Unicode string is:u ' en:regular expression is a powerful tool for manipulating Text.\nzh: \u6b63\u5219\u886 8\U8FBE\U5F0F\U662F\U4E00\U79CD\U5F88\U6709\U7528\U7684\U5904\U7406\U6587\U672C\U7684\U5DE5\U5177\U3002\NJP: \ U6b63\u898f\u8868\u73fe\u306f\u975e\u5e38\u306b\u5f79\u306b\u7acb\u3064\u30c4\u30fc\u30eb\u30c6\u30ad\u30b9\ U30c8\u3092\u64cd\u4f5c\u3059\u308b\U3053\u3068\u3067\u3059\u3002\njp-char: \u3042\u30a2\u3044\u30a4\u3046\u30a6\u3048\u30a8\u304a\u30aa\nkr:\uc815 \UADDC \ud45c\ud604\uc2dd\uc740 \ub9e4\uc6b0 \uc720\uc6a9\ud55c \ub3c4\uad6c \ud14d\uc2a4\ud2b8\ub97c \uc870\uc791\ ud558\ub294 \UAC83\UC785\UB2C8\UB2E4.\NPUC: \u3002\uff1f\uff01\u3001\uff0c\uff1b\uff1a\u201c \u201d\u2018 \u2019\ U2014\u2014\u2026\u2026\xb7\uff0d\xb7\u300a\u300b\u3008\u3009\uff01\uffe5\uff05\uff06\uff0a\uff03\n ' There is 6 Unicode Chinese parts: Regular expressions are a useful tool for working with text the regular table is very operational there is 8 Unicode Korean Parts:정규표현식은매우유용한도구텍스트를조작하는것입 니다there is 6 Unicode Japanese katakana parts:ツールテキストアイウエオthere is Unicode Japanese hiragana parts:はににつを することですあいうえおthere is 5 Unicode CJK punctuation parts: ...?! 、,;: - 《》〈〉! ¥%&*#