From the string perspective, Chinese is not as neat and standardized as English, which is an inevitable reality. This article takes the Python language as an example based on online materials and personal experience. Please add or pick up the wrong one.
Experience
You can use the Repr () function to view the original string format. This is helpful for writing regular expressions.
The re module of Python has two similar functions: Re. Match () and RE. Search. The matching process of the two functions is identical, but the starting point is different. Match only matches the starting position of the string. If it fails, it gives up. Search will persistently traverse all possible positions in the entire string until a match is found successfully, or the search string ends with a failure. If you understand the features of match (in some cases it is faster), you can use it freely. If you are not quite clear, search is usually the function you need.
Find all possible matches from a pile of text and return them in the form of a list. In this case, use the findall () function. For more information, see Code .
In utf8, each Chinese Character occupies three characters. The regular expression is [\ X80-\ xFF] {3.
In Unicode, the Chinese character format is as follows: \ uxxxx. As long as the range of the corresponding character set is found, the corresponding string can be matched to help you pick out the required text in a language from the multilingual text. However, for adhesive languages like Japanese, there may be both Chinese characters and hirakana Katakana, and the results may be biased.
The two types of character classes can be used together, for example, hirakana, Katakana, and Chinese, U "[\ u4e00-\ u9fa5 \ u3040-\ u309f \ u30a0-\ u30ff] +" to customize the text to be matched.
When matching Chinese characters, the regular expression and the target string must be in the same format. This is crucial. Or use the default utf8, so you do not need to do anything else. For Unicode, you need to add the U "" format before the regular expression.
Unicode string: String = u "I love regular expressions" can be defined in this way ". If the string is not Unicode, you can use the Unicode () function to convert it. If you know the encoding of the source string, you can use newstr = Unicode (oldstring, original_coding_name) to convert it. For example, Unicode (string, "utf8") is commonly used in Linux "), cp936 may be used in windows, but it is not tested.
Example Program
Copy code The Code is as follows :#! /Usr/bin/Python
#-*-Coding: UTF-8 -*-
#
# Author: Rex
# Blog: http://iregex.org
# Filename py_utf8_unicode.py
# Created:
Import re
Def findpart (RegEx, text, name ):
Res = Re. findall (RegEx, text)
If Res:
Print "there are % d % s parts: \ n" % (LEN (RES), name)
For R in Res:
Print "\ t", R
Print
# Sample is utf8 by default.
Sample = ''' en: regular expression is a powerful tool for manipulating text.
Zh: regular expression is a useful tool for processing text.
JP.
JP-CHAR: Invalid invalid
KR.
PUC :.?! ,;: "" ''--... · "<>! ¥ % &*#
'''
# Let's look its raw representation under the hood:
Print "the raw utf8 string is: \ n", repr (sample)
Print
# Find the non-ASCII chars:
Findpart (R "[\ X80-\ xFF] +", sample, "non-ASCII ")
# Convert the utf8 to Unicode
Usample = Unicode (sample, 'utf8 ')
# Let's look its raw representation under the hood:
Print "the raw Unicode string is: \ n", repr (usample)
Print
# Get each language parts:
Findpart (U "[\ u4e00-\ u9fa5] +", usample, "Unicode Chinese ")
Findpart (U "[\ uac00-\ ud7ff] +", usample, "Unicode Korean ")
Findpart (U "[\ u30a0-\ u30ff] +", usample, "Unicode Japanese Katakana ")
Findpart (U "[\ u3040-\ u309f] +", usample, "Unicode Japanese hiragana ")
Findpart (U "[\ u3000-\ u303f \ ufb00-\ ufffd] +", usample, "Unicode CJK punctuation ")
The output result is: Copy code The Code is as follows: the raw utf8 string is:
'En: regular expression is a powerful tool for manipulating text. \ nzh: \ xe6 \ XAD \ xa3 \ xe5 \ x88 \ x99 \ xe8 \ xA1 \ xA8 \ xe8 \ xbe \ xe5 \ xbc \ x8f \ xe6 \ x98 \ xaf \ xe4 \ xb8 \ X80 \ xe7 \ Xa7 \ x8d \ xe5 \ xbe \ x88 \ xe6 \ x9c \ x89 \ xe7 \ x94 \ xA8 \ xe7 \ x9a \ x84 \ xe5 \ Xa4 \ x84 \ xe7 \ x90 \ x86 \ xe6 \ x96 \ x87 \ xe6 \ x9c \ xac \ xe7 \ x9a \ x84 \ xe5 \ xb7 \ xa5 \ xe5 \ x85 \ xb7 \ xe3 \ X80 \ x82 \ njp: \ xe6 \ XAD \ xa3 \ xe8 \ xA6 \ x8f \ xe8 \ xA1 \ xA8 \ xe7 \ x8f \ xbe \ xe3 \ x81 \ xaf \ xe9 \ x9d \ x9e \ xe5 \ xb8 \ xb8 \ xe3 \ x81 \ XAB \ xe5 \ xbd \ xb9 \ xe3 \ x81 \ XAB \ xe7 \ XAB \ x8b \ xe3 \ x81 \ Xa4 \ xe3 \ x83 \ x84 \ xe3 \ x83 \ xbc \ xe3 \ x83 \ XAB \ xe3 \ x83 \ x86 \ xe3 \ x82 \ XAD \ xe3 \ x82 \ xb9 \ xe3 \ x83 \ x88 \ xe3 \ x82 \ x92 \ xe6 \ x93 \ x8d \ xe4 \ xbd \ x9c \ xe3 \ x81 \ x99 \ xe3 \ x82 \ x8b \ xe3 \ x81 \ x93 \ xe3 \ x81 \ xA8 \ xe3 \ x81 \ Xa7 \ xe3 \ x81 \ x99 \ xe3 \ X80 \ x82 \ njp-CHAR: \ xe3 \ x81 \ x82 \ xe3 \ x82 \ xa2 \ xe3 \ x81 \ x84 \ xe3 \ x82 \ Xa4 \ xe3 \ x81 \ x86 \ xe3 \ x82 \ xA6 \ xe3 \ x81 \ x88 \ xe3 \ x82 \ xA8 \ xe3 \ x81 \ x8a \ xe3 \ x82 \ xAA \ NKr: \ xec \ xa0 \ x95 \ xea \ xb7 \ x9c \ xed \ x91 \ x9c \ xed \ x98 \ x84 \ xec \ x8b \ x9d \ xec \ x9d \ X80 \ xeb \ Xa7 \ Xa4 \ xec \ x9a \ xb0 \ xec \ x9c \ xa0 \ xec \ x9a \ xa9 \ xed \ x95 \ x9c \ xeb \ x8f \ x84 \ xea \ xb5 \ xac \ xed \ x85 \ x8d \ xec \ x8a \ Xa4 \ xed \ x8a \ xb8 \ xeb \ xa5 \ xbc \ xec \ xA1 \ xb0 \ xec \ x9e \ x91 \ xed \ x95 \ x98 \ xeb \ x8a \ x94 \ xea \ XB2 \ x83 \ xec \ x9e \ x85 \ xeb \ x8b \ x88 \ xeb \ x8b \ xa4. \ npuc: \ xe3 \ X80 \ x82 \ XeF \ xbc \ x9f \ XeF \ xbc \ x81 \ xe3 \ X80 \ x81 \ XeF \ xbc \ x8c \ XeF \ xbc \ x9b \ XeF \ xbc \ x9a \ xe2 \ X80 \ x9c \ xe2 \ X80 \ x9d \ xe2 \ X80 \ x98 \ xe2 \ X80 \ x99 \ xe2 \ X80 \ x94 \ xe2 \ X80 \ x94 \ xe2 \ X80 \ xA6 \ xe2 \ X80 \ xA6 \ xc2 \ xb7 \ XeF \ xbc \ x8d \ xc2 \ xb7 \ xe3 \ X80 \ x8a \ xe3 \ X80 \ x8b \ xe3 \ X80 \ x88 \ xe3 \ X80 \ x89 \ XeF \ xbc \ x81 \ XeF \ xbf \ xa5 \ XeF \ xbc \ x85 \ XeF \ xbc \ x86 \ XeF \ xbc \ x8a \ XeF \ xbc \ x83 \ N'
There are 14 non-ASCII parts:
Regular Expressions are a useful tool for processing text.
It is as follows.
When there are too many threads, there are too many threads.
Zookeeper
표 현 식 은
Zookeeper
유 용 한
Zookeeper
텍 스 트 를
조 작 하 는
것 입 니 다
.?! ,;:"
"'
'--...... · "<>! ¥ % &*#
The raw Unicode string is:
U'en: regular expression is a powerful tool for manipulating text. \ nzh: \ Users \ u5219 \ u8868 \ Users \ u662f \ u4e00 \ u79cd \ u5f88 \ u6709 \ u7528 \ u7684 \ Users \ u6587 \ u672c \ u7684 \ Users \ u3002 \ njp: \ Users \ u898f \ u8868 \ u73fe \ Users \ u306b \ Users \ u306b \ Users \ u3064 \ u30c4 \ u30fc \ u30eb \ u30c6 \ u30ad \ u30b9 \ u30c8 \ u3092 \ u64cd \ u4f5c \ u3059 \ u308b \ u3053 \ u3068 \ u3067 \ u3059 \ u3002 \ njp-CHAR: \ u3042 \ u30a2 \ u3044 \ u30a4 \ u3046 \ u30a6 \ u3048 \ u30a8 \ u304a \ u30aa \ NKr: \ Users \ uaddc \ ud45c \ ud604 \ uc2dd \ uc740 \ ub9e4 \ Users \ uc720 \ uc6a9 \ ud55c \ ub3c4 \ Users \ ud14d \ uc2a4 \ Users \ ub97c \ uc870 \ uc791 \ Users \ ub294 \ uac83 \ uc785 \ ub2c8 \ ub2e4. \ npuc: \ u3002 \ Alibaba \ uff01 \ u3001 \ Alibaba \ uff1b \ uff1a \ u201c \ u201d \ u2018 \ u2019 \ u2014 \ u2014 \ u2026 \ u2026 \ xb7 \ Alibaba \ xb7 \ u300a \ u300b \ u3008 \ u3009 \ uff01 \ uffe5 \ uff05 \ uff06 \ uff0a \ uff03 \ N'
There are 6 Unicode Chinese parts:
Regular Expressions are a useful tool for processing text.
Regular Expression
Very
Active
Li
Operation
There are 8 Unicode Korean parts:
Zookeeper
표 현 식 은
Zookeeper
유 용 한
Zookeeper
텍 스 트 를
조 작 하 는
것 입 니 다
There are 6 Unicode japan ese katakana parts:
When there are too many other users
Bytes
Bytes
Bytes
Bytes
Bytes
There are 11 Unicode japan ese hiragana parts:
Bytes
Bytes
Bytes
Bytes
Bytes
There are already too many other
Bytes
Bytes
Bytes
Bytes
Bytes
There are 5 Unicode CJK punctuation parts:
.
.
.?! ,,;:
-
"<>! ¥ % &*#