Python Chinese Regular Expression notes

Python Chinese Regular Expression notes _ regular expressions

Last Update:2017-01-18 Source: Internet

Author: User

Tags character classes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From a string perspective, Chinese is not as neat and standard as English, which is an unavoidable reality. This article combines the online data and personal experience, take Python language as an example, a little summary. Welcome to add or pick the wrong.
A little experience
you can use the repr () function to view the original format of a string. This is useful for writing regular expressions. The
Python re module has two similar functions: Re.match (), Re.search. The matching process of two functions is exactly the same, but the starting point is different. Match only matches from the beginning of the string, and if it fails, the search will persevere in completely traversing all the possible positions in the entire string until it succeeds in finding a match, or searching through the string to fail. If you understand the character of match (in some cases faster), you can use it freely, and if you're not sure, search is usually the function you need.
from a heap of text, find all possible matches, return as a list, in this case using the FindAll () function. Examples are shown in the following code.
UTF8, each Chinese character occupies 3 characters position, the regular type is [\x80-\xff]{3}, this all knows.
Unicode, the format of Chinese characters, such as \uxxxx, as long as the corresponding character set to find the range, you can match the corresponding string, easy to pick out from the multilingual text in a language text. However, for the sticky language such as Japanese, there are both Chinese characters and hiragana katakana, perhaps the result will be biased.
Two character classes can be used in conjunction with, for example, hiragana, katakana, Chinese, and U "[\u4e00-\u9fa5\u3040-\u309f\u30a0-\u30ff]+" to customize the text that you need to match. When the
matches Chinese, the format of the regular expression and the target string must be the same. This is vital. Or with the default UTF8, you don't have to do anything else at this point, and if it's Unicode, you need to precede the regular formula with the U "" format. The
can define Unicode strings like this: String=u "I love regular Expressions." If the string is not Unicode, you can use the Unicode () function to convert it. If you know the encoding of the source string, you can use Newstr=unicode (oldstring, Original_coding_name) to convert, for example, common Unicode (String, "UTF8") under Linux, Windows Maybe you can use cp936, no test.
Example Program

Copy Code code as follows:

#!/usr/bin/python
#-*-Coding:utf-8-*-
#
#author: Rex
#blog: http://iregex.org
#filename py_utf8_unicode.py
#created: 2010-06-27 09:11
Import re
def findpart (regex, text, name):
Res=re.findall (regex, text)
If Res:
Print "There are%d%s parts:\n"% (len (res), name)
For R in Res:
print "\ T", r
Print
#sample is UTF8 by default.
Sample= ' en:regular expression is a powerful tool for manipulating text.
En: Regular expressions are a useful tool for working with text.
JP: は Very i service i つツールテキストを operation することです.
Jp-char:あアいイうウえエおオ
kr:정규표현식은매우유용한도구텍스트를조작하는것입니다.
PUC:.?! 、，；：“ ”‘ '——......· －·《》〈〉！￥%&*#
'''
#let ' s look it raw representation under the hood:
Print "The raw UTF8 string is:\n", repr (sample)
Print
#find the Non-ascii chars:
Findpart (r "[\x80-\xff]+", Sample, "Non-ascii")
#convert the UTF8 to Unicode
Usample=unicode (sample, ' UTF8 ')
#let ' s look it raw representation under the hood:
Print "The raw Unicode string is:\n", repr (Usample)
Print
#get each language parts:
Findpart (U "[\u4e00-\u9fa5]+", Usample, "Unicode Chinese")
Findpart (U "[\uac00-\ud7ff]+", Usample, "Unicode Korean")
Findpart (U "[\u30a0-\u30ff]+", usample, "Unicode Japanese Katakana")
Findpart (U "[\u3040-\u309f]+", usample, "Unicode Japanese Hiragana")
Findpart (U "[\u3000-\u303f\ufb00-\ufffd]+", Usample, "Unicode CJK Punctuation")

The output is:

Copy Code code as follows:

The raw UTF8 string is:
' En:regular expression is a powerful tool for manipulating Text.\nzh: \xe6\xad\xa3\xe5\x88\x99\xe8\xa1\xa8\xe8\xbe\xbe\ Xe5\xbc\x8f\xe6\x98\xaf\xe4\xb8\x80\xe7\xa7\x8d\xe5\xbe\x88\xe6\x9c\x89\xe7\x94\xa8\xe7\x9a\x84\xe5\xa4\x84\ XE7\X90\X86\XE6\X96\X87\XE6\X9C\XAC\XE7\X9A\X84\XE5\XB7\XA5\XE5\X85\XB7\XE3\X80\X82\NJP: \xe6\xad\xa3\xe8\xa6\ X8f\xe8\xa1\xa8\xe7\x8f\xbe\xe3\x81\xaf\xe9\x9d\x9e\xe5\xb8\xb8\xe3\x81\xab\xe5\xbd\xb9\xe3\x81\xab\xe7\xab\ X8b\xe3\x81\xa4\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88\xe3\x82\ X92\xe6\x93\x8d\xe4\xbd\x9c\xe3\x81\x99\xe3\x82\x8b\xe3\x81\x93\xe3\x81\xa8\xe3\x81\xa7\xe3\x81\x99\xe3\x80\ X82\njp-char: \xe3\x81\x82\xe3\x82\xa2\xe3\x81\x84\xe3\x82\xa4\xe3\x81\x86\xe3\x82\xa6\xe3\x81\x88\xe3\x82\xa8\ xe3\x81\x8a\xe3\x82\xaa\nkr:\xec\xa0\x95\xea\xb7\x9c \xed\x91\x9c\xed\x98\x84\xec\x8b\x9d\xec\x9d\x80 \xeb\xa7\ Xa4\xec\x9a\xb0 \xec\x9c\xa0\xec\x9a\xa9\xed\x95\x9c \xeb\x8f\x84\xea\xb5\xac \xed\x85\x8d\xec\x8a\xa4\xed\x8a\ Xb8\xEB\XA5\XBC \xec\xa1\xb0\xec\x9e\x91\xed\x95\x98\xeb\x8a\x94 \xea\xb2\x83\xec\x9e\x85\xeb\x8b\x88\xeb\x8b\xa4.\ NPUC: \xe3\x80\x82\xef\xbc\x9f\xef\xbc\x81\xe3\x80\x81\xef\xbc\x8c\xef\xbc\x9b\xef\xbc\x9a\xe2\x80\x9c \xe2\x80\ X9d\xe2\x80\x98 \xe2\x80\x99\xe2\x80\x94\xe2\x80\x94\xe2\x80\xa6\xe2\x80\xa6\xc2\xb7\xef\xbc\x8d\xc2\xb7\xe3\ x80\x8a\xe3\x80\x8b\xe3\x80\x88\xe3\x80\x89\xef\xbc\x81\xef\xbf\xa5\xef\xbc\x85\xef\xbc\x86\xef\xbc\x8a\xef\ Xbc\x83\n '
There are NON-ASCII parts:
Regular expressions are a useful tool for working with text.
は very I service i つツールテキストを operation することです.
あアいイうウえエおオ
정규
표현식은
매우
유용한
도구
텍스트를
조작하는
것입니다
。？！、，；：“
”‘
'——......· －·《》〈〉！￥%&*#
The raw Unicode string is:
U ' en:regular expression is a powerful tool for manipulating Text.\nzh: \U6B63\U5219\U8868\U8FBE\U5F0F\U662F\U4E00\U79CD \U5F88\U6709\U7528\U7684\U5904\U7406\U6587\U672C\U7684\U5DE5\U5177\U3002\NJP: \u6b63\u898f\u8868\u73fe\u306f\ U975e\u5e38\u306b\u5f79\u306b\u7acb\u3064\u30c4\u30fc\u30eb\u30c6\u30ad\u30b9\u30c8\u3092\u64cd\u4f5c\u3059\ U308b\u3053\u3068\u3067\u3059\u3002\njp-char: \u3042\u30a2\u3044\u30a4\u3046\u30a6\u3048\u30a8\u304a\u30aa\nkr : \UC815\UADDC \ud45c\ud604\uc2dd\uc740 \ub9e4\uc6b0 \uc720\uc6a9\ud55c \ub3c4\uad6c \ud14d\uc2a4\ud2b8\ub97c uc791\ud558\ub294 \UAC83\UC785\UB2C8\UB2E4.\NPUC: \u3002\uff1f\uff01\u3001\uff0c\uff1b\uff1a\u201c \u201d\u2018 U2019\u2014\u2014\u2026\u2026\xb7\uff0d\xb7\u300a\u300b\u3008\u3009\uff01\uffe5\uff05\uff06\uff0a\uff03\n '
There are 6 Unicode Chinese parts:
Regular expressions are a useful tool for working with text
Normal form
Very
Service
State
Operation
There are 8 Unicode Korean parts:
정규
표현식은
매우
유용한
도구
텍스트를
조작하는
것입니다
There are 6 Unicode Japanese katakana parts:
ツールテキスト
ア
イ
ウ
Maintain
Pick on
There are Unicode Japanese hiragana parts:
は
I
I
つ
を
することです
Seest
O Muhammad
という
え
O ' sei
There are 5 Unicode CJK Punctuation Parts:
。
。
。？！、，；：
－
《》〈〉！￥%&*#

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More