Python matches Chinese characters

Source: Internet
Author: User

Reference: http://hi.baidu.com/nivrrex/blog/item/e6ccaf511d0926888d543071.html
Http://topic.csdn.net/u/20070404/15/b011aa83-f9b7-43b3-bbff-bfe4f653df03.html

First, make sure that all encodings are Unicode
such as Str.decode (' UTF8 ') #从utf8文本中
U "Ah L" #在控制台输出中
(wordy) I want to use a reference must be coded Hex but depressed is that each word seems to occupy 2 positions, using regular match without fruit.
Second, determine the Chinese range: [\u4e00-\u9fa5]

(note here that Python's re is written) to u"[\u4e00-\u9fa5]" #确定正则表达式也是 Unicode


DEF filter (String):

String = "AVB Test 123<>"! * (^) $%[email protected]#$...&%¥-+=,. ,; ' "": • ' text ';
String = String.decode ("Utf-8")
filtrate = re.compile (U ' [^\u4e00-\u9fa5] ') #非中文
Filtered_str = Filtrate.sub (R ', string) #replace
Print Filtered_str
Return ' Te '


Demo:

>>>PrintRe.match (Ur"[\u4e00-\u9fa5]+","Ah")
None
>>>PrintRe.match (Ur"[\u4e00-\u9fa5]+", u"Ah")
<_sre. Sre_match Object at0x2a98981308>


>>>PrintRe.match (Ur"[\u4e00-\u9fa5]+", u"T")
None


>>>PrintTt
Now we understand.
>>>Tt
‘\xe7\x8e\xb0\xe5\x9c\xa8\xe6\x89\x8d\xe6\x98\x8e\xe7\x99\xbd‘
>>>PrintRe.match (R"[\u4e00-\u9fa5]", Tt.decode (‘Utf8‘))
None
>>>PrintRe.match (Ur"[\u4e00-\u9fa5]", Tt.decode (‘Utf8‘))
<_sre. Sre_match Object at0x2a955d9c60>


>>>PrintRe.match (Ur".*["U4e00-"u9fa5]+", u"Hi, match it to the")
<_sre. Sre_match object at 0x2a955d9c60>< Span style= "COLOR: #000000" >
>>> < Span style= "COLOR: #0000ff" >print re.match (Ur " Span style= "COLOR: #800000" >.*[ "u4e00-" u9fa5]+,u "hi,no no ")
None





Other expansion range (RPM)

Here are a few of the main non-English language character ranges (found on Google):
2E80 ~33FFH: Chinese-Japanese-Korean symbol area. Host Kangxi Radical, CJK Auxiliary radicals, phonetic symbols, Japanese kana, Korean notes, symbols of CJK, punctuation, circled or with rune numbers, month, and Japanese kana combination, unit, era name, month, date, time, etc.
3400 ~4DFFH: CJK Identity Ideographs expands zone A to accommodate 6 of the total,582 Chinese and Japanese Korean characters.
4E00 ~9FFFH: China, Japan, South Korea identify ideographs area, total admission 20,902 Chinese and Japanese Korean characters.
A000~A4FFH: Yi writing area, accommodating the Chinese Southern Yi characters and the word root.
AC00~D7FFH: Korean phonetic combination, accommodating words made of Korean notes.
F900~FAFFH: CJK compatible Ideographs area, with a total of 302 Chinese and Japanese Korean characters.
FB00~FFFDH: Text representation area, accommodating combination Latin text, Hebrew, Arabic, CJK Straight punctuation, small symbols, half-width symbols, full-width symbols and so on.
For example, you need to match all Chinese and Japanese Han characters,Then the regular expression should be ^ [\u3400-\u9fff ]+$
Theoretically, yes.,But I went to Msn.co.ko and copied a Korean.,I found it wrong.,Weird
And then the msn.co.jp copied a ' お尻 '.,Also cannot be done.
Then extend the range to ^ [\u2e80-\u9fff ]+$,It all went through.,This should be a regular expression that matches CJK characters.,Including the traditional Chinese that our Taiwan Province is still using blindly.
And the regular expression about Chinese,It's supposed to be ^ [\u4e00-\u9fff ]+$,And the forum is often mentioned ^ [\u4e00-\u9fa5 ]+$ very close
Note that the forum said ^ [\u4e00-\u9fa5 ]+$ This is specifically used to match the Simplified Chinese regular expression , in fact, the traditional characters are also inside , I tested the next ' Chinese People's Republic ' with a tester,and also passed , of course , ^ [\u4e00-\u9fff ]
This article transferred from: http://blog.csdn.net/wangran51/article/details/7752104

Python matches Chinese characters

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.