Python regular expressions match Chinese characters,
#-*-Coding: UTF-8-*-import redef findPart (regex, text, name): res = re. findall (regex, text) if res: print "There are % d % s parts: \ n" % (len (res), name) for r in res: print "\ t", r. encode ("utf8") print text = "# who # helloworld # a Chinese x #" usample = unicode (text, 'utf8 ') findPart (u "# [\ w \ u2E80-\ u9FFF] + #", usample, "unicode chinese") Note: The main non-English-speaking characters range from 2E80 ~ 33FFh: Symbol area of China, Japan, and South Korea. Reception of Kangxi Dictionary heads, China-Japan-South Korea auxiliary departments heads, phonetic symbols, Japanese Kana, Korean Notes, Chinese-Japan-South Korea symbols, punctuation marks, circled or including Rune numbers, months, and Japanese Kana combination, unit, year, month, date, and time. 3400 ~ 4 DFFh: Japan and South Korea recognized the expansion of ideographic text area A, a total of 6,582 Chinese and Korean characters. 4E00 ~ 9 FFFh: Japan and South Korea recognized the ideographic text area, a total of 20,902 Chinese and Korean characters. A000 ~ A4FFh: Yi text area, which contains the texts and roots of Yi people in southern China. AC00 ~ D7FFh: A combination area of Korean and pinyin. It contains text in Korean Notes. F900 ~ FAFFh: compatible with ideographic text area, a total of 302 Chinese and Korean characters. FB00 ~ FFFDh: Text representation area. It contains a combination of Latin characters, Hebrew characters, Arabic characters, China-Japan-South Korea vertices, small characters, halfwidth characters, and fullwidth characters (#! /Usr/bin/python3 #-*-coding: UTF-8-*-import remessage = u 'Heaven and man '. encode ('utf8') print (re. search (u'people '. encode ('utf8'), message ). example in group () Interaction Mode> import re> s = 'phone No. 010-87654321 '>>>>> r = re. compile (R' (\ d +)-(\ d +) ')> m = r. search (s) >>> m <_ sre. SRE_Match object at 0x010EE218>)