Introduction to Python matching regular expressions in Chinese

Source: Internet
Author: User
Tags chr lowercase in python

A small program is being written Miniblogs updater, you need to calculate the number of words entered by the user. Because the length of the Chinese-English character encoding is different, if you use the Len () function in Python, it calculates the actual length of the string, and a Chinese character is not equivalent to an English letter. Therefore, the Chinese characters need to be treated as English letters.

I wrote such a statement to deal with:

The code is as follows Copy Code

Length=len (Re.sub (' [X80-xff]{3} ', ' A ', msg ')

It means to replace all Chinese with the English letter A and then count the words. (Just statistics, do not modify the source string.) This statement works correctly in Windows UTF8 files.


Comparison of common Zhong Wenjing expression matching Results

pattern [U4E00-U9FFF] matches the result: 2 mi ^ ^
The pattern [^U4E00-U9FFF] Match result is: My dear only Mao, you know? I was thinking about you guys-
pattern [U4E00-U9FA5] matches the result: 2 mi ^ ^
The pattern [^U4E00-U9FA5] Match result is: My dear only Mao, you know? I'm thinking of you guys –
pattern [u4e00-u9fa5uf900-ufa2d] matches the result: 2 mi ^ ^
The pattern [^u4e00-u9fa5uf900-ufa2d] Match result is: My dear only Mao, you know? I was thinking about you guys-

mode [Chr (0XA1)-CHR (0xff)] matches the result: 2 mi, ^-^
pattern [^CHR (0XA1)-CHR (0xff)] matches the result: My dear only Mao, you know? I was thinking about you.
The pattern [X80-xff] Match result is: My dear only Mao you know? I was thinking about you.
The pattern [^x80-xff] Match result is: 2 mi, ^-^

The pattern [X00-xff] Match result is: My dear 2 Maomi, you know? I was thinking about you.
The pattern [^x00-xff] Match result is:

The pattern [X80-xff][x80-xff] Match result is: My dear only Mao you know? I was thinking about you.
The pattern [^x80-xff][^x80-xff] matches the result: MI, ^-^


Online about GBK, GB2312 and BIG5 coding range of more data, but less Japanese data, I summed up, I hope to be in the regular judge
These character sets are especially helpful when it comes to the various words, punctuation, and special symbols of the Japanese character set.

The code is as follows Copy Code

UTF8
[x01-x7f]| [XC0-XDF] [x80-xbf]| [Xe0-xef] [X80-XBF] {2}| [Xf0-xff] [X80-XBF] {3}

UTF16
[X00-xd7] [xe0-xff]| [XD8-XDF] [X00-xff] {2}

Jis
[x20-x7e]| [x21-x5f]| [x21-x7e] {2}

Sjis
[x20-x7e]| [xa1-xdf]| ([x81-x9f]| [Xe0-xef]) ([x40-x7e]| [X80-XFC])

Euc_jp
[x20-x7e]|x81[xa1-xdf]| [Xa1-xfe] [Xa1-xfe]|x8f[xa1-xfe]{2}

EUC_JP Punctuation and special characters

[Xa1-xa2] [Xa0-xfe]

EUC_JP Full-angle Digital

XA3[XB0-XB9]

EUC_JP All Corners Capital English

XA3[XC1-XDA]

EUC_JP Full Angle lowercase English
XA3[XE1-XFA]

EUC_JP Full-angle Hiragana

XA4[XA1-XF3]

EUC_JP Full-angle Katakana

xa3[xb0-xb9]|xa3[xc1-xda]|xa5[xa1-xf6][xa3][xb0-xfa]| [XA1] [xbc-xbe]| [XA1] [XDD]

EUC_JP full-angle Chinese characters

[XB0-XCF] [xa0-xd3]| [Xd0-xf4] [xa0-xfe]| [Xb0-xf3] [xa1-xfe]| [XF4] [xa1-xa6]| [XA4] [xa1-xf3]| [XA5] [xa1-xf6]| [XA1] [Xbc-xbe]

Big5

[x01-x7f]| [X81-xfe] ([x40-x7e]| [Xa1-xfe])

GBK

[x01-x7f]| [X81-xfe] [X40-xfe]

GB2312 Chinese Characters

[Xb0-xf7] [Xa0-xfe]

GB2312 punctuation mark and special symbol

XA1[XA2-XFE]

GB2312 Roman Array and item serial number

XA2 ([xa1-xaa]|[ xb1-xbf]| [xc0-xdf]| [xe0-xe2]| [xe5-xee]| [XF1-XFC])

GB2312 Full angle punctuation and full-width letters

XA3[XA1-XFE]

GB2312 Japanese Hiragana

XA4[XA1-XF3]

GB2312 Japanese Katakana

Xa5[xa1-xf6]

Mesenchymal

GB18030
[x00-x7f]| [X81-xfe] [x40-xfe]| [X81-xfe] [x30-x39] [X81-xfe] [x30-x39]

Japanese Half corner space

X20

Sjis Full-angle space

(?: x81x81)

Sjis Full-angle Digital

(?: x82[x4f-x58])

Sjis All Corners Capital English

(?: x82[x60-x79])

Sjis Full Angle lowercase English

(?: x82[x81-x9a])

Sjis Full-angle Hiragana

(?: X82[x9f-xf1])

Sjis full-angle Hiragana extension

(?: x82[x9f-xf1]|x81[x4ax4bx54x55])

Sjis Full-angle Katakana

(?: x83[x40-x96])

Sjis full-angle Katakana extension

(?: x83[x40-x96]|x81[x45x5bx52x53])

EUC_JP Full-angle space

(?: XA1XA1)

EUC half-width Katakana
       
(?: X8E[XA6-XDF])

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.