Python Regular Expressions Match Chinese

Source: Internet
Author: User
Tags lowercase in python


If you need to know how to match Chinese characters in Python, we must first look at the following table

UTF8
[x01-x7f]| [XC0-XDF] [x80-xbf]| [Xe0-xef] [X80-XBF] {2}| [Xf0-xff] [X80-XBF] {3}
UTF16
[X00-xd7] [xe0-xff]| [XD8-XDF] [X00-xff] {2}
Jis
[x20-x7e]| [x21-x5f]| [x21-x7e] {2}
Sjis
[x20-x7e]| [xa1-xdf]| ([x81-x9f]| [Xe0-xef]) ([x40-x7e]| [X80-XFC])
Euc_jp
[x20-x7e]|x81[xa1-xdf]| [Xa1-xfe] [Xa1-xfe]|x8f[xa1-xfe]{2}
EUC_JP Punctuation and special characters

[Xa1-xa2] [Xa0-xfe]
EUC_JP Full-angle Digital
XA3[XB0-XB9]
EUC_JP All Corners Capital English
XA3[XC1-XDA]
EUC_JP Full Angle lowercase English
XA3[XE1-XFA]
EUC_JP Full-angle Hiragana
XA4[XA1-XF3]
EUC_JP Full-angle Katakana
xa3[xb0-xb9]|xa3[xc1-xda]|xa5[xa1-xf6][xa3][xb0-xfa]| [XA1] [xbc-xbe]| [XA1] [XDD]
EUC_JP full-angle Chinese characters
[XB0-XCF] [xa0-xd3]| [Xd0-xf4] [xa0-xfe]| [Xb0-xf3] [xa1-xfe]| [XF4] [xa1-xa6]| [XA4] [xa1-xf3]| [XA5] [xa1-xf6]| [XA1] [Xbc-xbe]
Big5
[x01-x7f]| [X81-xfe] ([x40-x7e]| [Xa1-xfe])
GBK
[x01-x7f]| [X81-xfe] [X40-xfe]
GB2312 Chinese Characters
[Xb0-xf7] [Xa0-xfe]
GB2312 punctuation mark and special symbol
XA1[XA2-XFE]
GB2312 Roman Array and item serial number
XA2 ([xa1-xaa]|[ xb1-xbf]| [xc0-xdf]| [xe0-xe2]| [xe5-xee]| [XF1-XFC])
GB2312 Full angle punctuation and full-width letters
XA3[XA1-XFE]
GB2312 Japanese Hiragana
XA4[XA1-XF3]
GB2312 Japanese Katakana
Xa5[xa1-xf6]
Mesenchymal
GB18030
[x00-x7f]| [X81-xfe] [x40-xfe]| [X81-xfe] [x30-x39] [X81-xfe] [x30-x39]
Japanese Half corner space
X20
Sjis Full-angle space

(?: x81x81)
Sjis Full-angle Digital

(?: x82[x4f-x58])
Sjis All Corners Capital English

(?: x82[x60-x79])
Sjis Full Angle lowercase English

(?: x82[x81-x9a])
Sjis Full-angle Hiragana

(?: X82[x9f-xf1])
Sjis full-angle Hiragana extension
(?: x82[x9f-xf1]|x81[x4ax4bx54x55])
Sjis Full-angle Katakana

(?: x83[x40-x96])
Sjis full-angle Katakana extension
(?: x83[x40-x96]|x81[x45x5bx52x53])
EUC_JP Full-angle space

(?: XA1XA1)
EUC Half-angle Katakana

(?: X8E[XA6-XDF])

OK, let me write a regular expression that matches Chinese

The code is as follows Copy Code
#-*-Coding:utf-8-*-
Import re
def findpart (regex, text, name):
Res=re.findall (regex, text)
If Res:
Print "There are%d%s parts:n"% (len (res), name)
For R in Res:
Print "T", R.encode ("UTF8")
Print

Text = "#who #helloworld#a Chinese x#"
Usample=unicode (text, ' UTF8 ')
Findpart (U "#[wu2e80-u9fff]+#", Usample, "Unicode Chinese")


Test Match Chinese

The code is as follows Copy Code


Import re

message = U ' heaven and Man in Oneness '. Encode (' UTF8 ')
Print (Re.search (U ' person '. Encode (' UTF8 '), message). Group ())

Examples in interactive mode

  code is as follows copy code

       >>> Import re
      >>> s= ' Phone No. 010-87654321 '
      >>>
      >>> r= Re.compile (R ' (d+)-(d+) ')
      >>> m=r.search (s)
       >>> m
       <_sre. Sre_match object at 0x010ee218>

)

Note: Several major non-English language character ranges 2E80~33FFH: China, Japan and South Korea symbol area. Host Kangxi Dictionary Radicals, Chinese and Japanese auxiliary radicals, phonetic symbols, Japanese kana, Korean notes, Chinese and Japanese symbols, punctuation, circled or with rune numbers, month, and Japan's kana combination, units, the year, month, date, time and so on.
3400~4DFFH: China, Japan and South Korea agree to the ideographic expansion of a district, a total of 6,582 Japanese and Korean Chinese characters.
4E00~9FFFH: China, Japan and South Korea identify with the ideographic area, a total of 20,902 Japanese and Korean Chinese characters.
A000~a4ffh: Yi text area, accepting Chinese Southern Yi text and word root.
Ac00~d7ffh: Korean phonetic combination word area, to accommodate the words spelled in Korean notes.
F900~FAFFH: China, Japan and South Korea compatible Ideographic area, a total of 302 Chinese and Japanese Han characters.
Fb00~fffdh: Text manifestation area, reception combination Latin text, Hebrew, Arabic, Chinese-Japanese-Korean straight punctuation, small sign, half-width symbol, full-width   (#!/usr/bin/python3
#-*-Coding:utf-8-*-

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.