If you need to know how to match Chinese characters in Python, we must first look at the following table
UTF8
[x01-x7f]| [XC0-XDF] [x80-xbf]| [Xe0-xef] [X80-XBF] {2}| [Xf0-xff] [X80-XBF] {3}
UTF16
[X00-xd7] [xe0-xff]| [XD8-XDF] [X00-xff] {2}
Jis
[x20-x7e]| [x21-x5f]| [x21-x7e] {2}
Sjis
[x20-x7e]| [xa1-xdf]| ([x81-x9f]| [Xe0-xef]) ([x40-x7e]| [X80-XFC])
Euc_jp
[x20-x7e]|x81[xa1-xdf]| [Xa1-xfe] [Xa1-xfe]|x8f[xa1-xfe]{2}
EUC_JP Punctuation and special characters
[Xa1-xa2] [Xa0-xfe]
EUC_JP Full-angle Digital
XA3[XB0-XB9]
EUC_JP All Corners Capital English
XA3[XC1-XDA]
EUC_JP Full Angle lowercase English
XA3[XE1-XFA]
EUC_JP Full-angle Hiragana
XA4[XA1-XF3]
EUC_JP Full-angle Katakana
xa3[xb0-xb9]|xa3[xc1-xda]|xa5[xa1-xf6][xa3][xb0-xfa]| [XA1] [xbc-xbe]| [XA1] [XDD]
EUC_JP full-angle Chinese characters
[XB0-XCF] [xa0-xd3]| [Xd0-xf4] [xa0-xfe]| [Xb0-xf3] [xa1-xfe]| [XF4] [xa1-xa6]| [XA4] [xa1-xf3]| [XA5] [xa1-xf6]| [XA1] [Xbc-xbe]
Big5
[x01-x7f]| [X81-xfe] ([x40-x7e]| [Xa1-xfe])
GBK
[x01-x7f]| [X81-xfe] [X40-xfe]
GB2312 Chinese Characters
[Xb0-xf7] [Xa0-xfe]
GB2312 punctuation mark and special symbol
XA1[XA2-XFE]
GB2312 Roman Array and item serial number
XA2 ([xa1-xaa]|[ xb1-xbf]| [xc0-xdf]| [xe0-xe2]| [xe5-xee]| [XF1-XFC])
GB2312 Full angle punctuation and full-width letters
XA3[XA1-XFE]
GB2312 Japanese Hiragana
XA4[XA1-XF3]
GB2312 Japanese Katakana
Xa5[xa1-xf6]
Mesenchymal
GB18030
[x00-x7f]| [X81-xfe] [x40-xfe]| [X81-xfe] [x30-x39] [X81-xfe] [x30-x39]
Japanese Half corner space
X20
Sjis Full-angle space
(?: x81x81)
Sjis Full-angle Digital
(?: x82[x4f-x58])
Sjis All Corners Capital English
(?: x82[x60-x79])
Sjis Full Angle lowercase English
(?: x82[x81-x9a])
Sjis Full-angle Hiragana
(?: X82[x9f-xf1])
Sjis full-angle Hiragana extension
(?: x82[x9f-xf1]|x81[x4ax4bx54x55])
Sjis Full-angle Katakana
(?: x83[x40-x96])
Sjis full-angle Katakana extension
(?: x83[x40-x96]|x81[x45x5bx52x53])
EUC_JP Full-angle space
(?: XA1XA1)
EUC Half-angle Katakana
(?: X8E[XA6-XDF])
OK, let me write a regular expression that matches Chinese
The code is as follows |
Copy Code |
#-*-Coding:utf-8-*- Import re def findpart (regex, text, name): Res=re.findall (regex, text) If Res: Print "There are%d%s parts:n"% (len (res), name) For R in Res: Print "T", R.encode ("UTF8") Print
Text = "#who #helloworld#a Chinese x#" Usample=unicode (text, ' UTF8 ') Findpart (U "#[wu2e80-u9fff]+#", Usample, "Unicode Chinese") |
Test Match Chinese
The code is as follows |
Copy Code |
Import re
message = U ' heaven and Man in Oneness '. Encode (' UTF8 ') Print (Re.search (U ' person '. Encode (' UTF8 '), message). Group ()) |
Examples in interactive mode
code is as follows |
copy code |
>>> Import re >>> s= ' Phone No. 010-87654321 ' >>> >>> r= Re.compile (R ' (d+)-(d+) ') >>> m=r.search (s) >>> m <_sre. Sre_match object at 0x010ee218> ) |
Note: Several major non-English language character ranges 2E80~33FFH: China, Japan and South Korea symbol area. Host Kangxi Dictionary Radicals, Chinese and Japanese auxiliary radicals, phonetic symbols, Japanese kana, Korean notes, Chinese and Japanese symbols, punctuation, circled or with rune numbers, month, and Japan's kana combination, units, the year, month, date, time and so on.
3400~4DFFH: China, Japan and South Korea agree to the ideographic expansion of a district, a total of 6,582 Japanese and Korean Chinese characters.
4E00~9FFFH: China, Japan and South Korea identify with the ideographic area, a total of 20,902 Japanese and Korean Chinese characters.
A000~a4ffh: Yi text area, accepting Chinese Southern Yi text and word root.
Ac00~d7ffh: Korean phonetic combination word area, to accommodate the words spelled in Korean notes.
F900~FAFFH: China, Japan and South Korea compatible Ideographic area, a total of 302 Chinese and Japanese Han characters.
Fb00~fffdh: Text manifestation area, reception combination Latin text, Hebrew, Arabic, Chinese-Japanese-Korean straight punctuation, small sign, half-width symbol, full-width (#!/usr/bin/python3
#-*-Coding:utf-8-*-