Reference: http://hi.baidu.com/nivrrex/blog/item/e6ccaf511d0926888d543071.html
Http://topic.csdn.net/u/20070404/15/b011aa83-f9b7-43b3-bbff-bfe4f653df03.html
First, make sure that all encodings are Unicode
such as Str.decode (' UTF8 ') #从utf8文本中
U "Ah L" #在控制台输出中
(wordy) I want to use a reference must be coded Hex but depressed is that each word seems to occupy 2 positions, using regular match without fruit.
Second, determine the Chinese range: [\u4e00-\u9fa5]
(note here that Python's re is written) to u"[\u4e00-\u9fa5]" #确定正则表达式也是 Unicode
DEF filter (String):
String = "AVB Test 123<>"! * (^) $%[email protected]#$...&%¥-+=,. ,; ' "": • ' text ';
String = String.decode ("Utf-8")
filtrate = re.compile (U ' [^\u4e00-\u9fa5] ') #非中文
Filtered_str = Filtrate.sub (R ', string) #replace
Print Filtered_str
Return ' Te '
Demo:
>>>PrintRe.match (Ur"[\u4e00-\u9fa5]+","Ah")
None
>>>PrintRe.match (Ur"[\u4e00-\u9fa5]+", u"Ah")
<_sre. Sre_match Object at0x2a98981308>
>>>PrintRe.match (Ur"[\u4e00-\u9fa5]+", u"T")
None
>>>PrintTt
Now we understand.
>>>Tt
‘\xe7\x8e\xb0\xe5\x9c\xa8\xe6\x89\x8d\xe6\x98\x8e\xe7\x99\xbd‘
>>>PrintRe.match (R"[\u4e00-\u9fa5]", Tt.decode (‘Utf8‘))
None
>>>PrintRe.match (Ur"[\u4e00-\u9fa5]", Tt.decode (‘Utf8‘))
<_sre. Sre_match Object at0x2a955d9c60>
>>>PrintRe.match (Ur".*["U4e00-"u9fa5]+", u"Hi, match it to the")
<_sre. Sre_match object at 0x2a955d9c60>< Span style= "COLOR: #000000" >
>>> < Span style= "COLOR: #0000ff" >print re.match (Ur " Span style= "COLOR: #800000" >.*[ "u4e00-" u9fa5]+,u "hi,no no ")
None
Other expansion range (RPM)
Here are a few of the main non-English language character ranges (found on Google):
2E80
~33FFH: Chinese-Japanese-Korean symbol area. Host Kangxi Radical, CJK Auxiliary radicals, phonetic symbols, Japanese kana, Korean notes, symbols of CJK, punctuation, circled or with rune numbers, month, and Japanese kana combination, unit, era name, month, date, time, etc.
3400
~4DFFH: CJK Identity Ideographs expands zone A to accommodate 6 of the total,582 Chinese and Japanese Korean characters.
4E00
~9FFFH: China, Japan, South Korea identify ideographs area, total admission 20,902 Chinese and Japanese Korean characters.
A000~A4FFH: Yi writing area, accommodating the Chinese Southern Yi characters and the word root.
AC00~D7FFH: Korean phonetic combination, accommodating words made of Korean notes.
F900~FAFFH: CJK compatible Ideographs area, with a total of 302 Chinese and Japanese Korean characters.
FB00~FFFDH: Text representation area, accommodating combination Latin text, Hebrew, Arabic, CJK Straight punctuation, small symbols, half-width symbols, full-width symbols and so on.
For example, you need to match all Chinese and Japanese Han characters,Then the regular expression should be ^
[\u3400-\u9fff
]+$
Theoretically, yes.,But I went to Msn.co.ko and copied a Korean.,I found it wrong.,Weird
And then the msn.co.jp copied a ' お尻 '.,Also cannot be done.
Then extend the range to ^
[\u2e80-\u9fff
]+$,It all went through.,This should be a regular expression that matches CJK characters.,Including the traditional Chinese that our Taiwan Province is still using blindly.
And the regular expression about Chinese,It's supposed to be ^
[\u4e00-\u9fff
]+$,And the forum is often mentioned ^
[\u4e00-\u9fa5
]+$ very close
Note that the forum said ^
[\u4e00-\u9fa5
]+$ This is specifically used to match the Simplified Chinese regular expression , in fact, the traditional characters are also inside , I tested the next ' Chinese People's Republic ' with a tester,and also passed , of course , ^
[\u4e00-\u9fff
]
This article transferred from: http://blog.csdn.net/wangran51/article/details/7752104
Python matches Chinese characters