How Python matches the Chinese way of sharing

Last Update:2017-01-13 Source: Internet

Author: User

Tags ord regular expression strlen in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, make sure all encodings are Unicode

such as Str.decode (' UTF8 ') #从utf8文本中

U "Ah L" #在控制台输出中

(wordy) would like to use a reference to a certain code HEX but the depressing is that each word seems to occupy 2 positions, using a regular match without results.

Second, determine the Chinese scope: [U4E00-U9FA5]

(note here when the Python re is written) want U "[u4e00-u9fa5]" #确定正则表达式也是 Unicode

Demo:

>>> print re.match (ur "[u4e00-u9fa5]+", "Ah")

None

>>> print re.match (ur "[u4e00-u9fa5]+", U "Ah")

<_sre. Sre_match Object at 0x2a98981308>

>>> print re.match (ur "[u4e00-u9fa5]+", U "T")

None

>>> Print TT

Now I understand.

>>> TT

' XE7X8EXB0XE5X9CXA8XE6X89X8DXE6X98X8EXE7X99XBD '

>>> Print Re.match (r "[U4e00-u9fa5]", Tt.decode (' UTF8 '))

None

>>> print re.match (ur "[u4e00-u9fa5]", Tt.decode (' UTF8 '))

<_sre. Sre_match Object at 0x2a955d9c60>

>>> print re.match (ur "*[" u4e00-"u9fa5]+", U "Hi, Match to")

<_sre. Sre_match Object at 0x2a955d9c60>

>>> print re.match (ur "*[" u4e00-"u9fa5]+", U "Hi,no no")

None

other scope of expansion

Here are a few of the major non-English language character ranges (found on Google):

2E80~33FFH: China, Japan and South Korea symbol area. Host Kangxi Dictionary Radicals, Chinese and Japanese auxiliary radicals, phonetic symbols, Japanese kana, Korean notes, Chinese and Japanese symbols, punctuation, circled or with rune numbers, month, and Japan's kana combination, units, the year, month, date, time and so on.

3400~4DFFH: China, Japan and South Korea agree to the ideographic expansion of a district, a total of 6,582 Japanese and Korean Chinese characters.

4E00~9FFFH: China, Japan and South Korea identify with the ideographic area, a total of 20,902 Japanese and Korean Chinese characters.

A000~a4ffh: Yi language area, accepting Chinese Southern Yi language and word root.

AC00~D7FFH: Korean phonetic combination word area, take in Korean notes spelled into the text.

F900~FAFFH: China, Japan and South Korea compatible Ideographic area, a total of 302 Chinese and Japanese Han characters.

FB00~FFFDH: Text manifestation area, the reception combination of Latin text, Hebrew, Arabic, Chinese and Japanese straight punctuation, small symbols, half-width symbols, full-width symbols and so on.

such as the need to match all the Chinese and Japanese Han symbol characters, then the regular expression should be ^[u3400-u9fff]+$

Theoretically yes, but I went to Msn.co.ko casually copied a Korean down, found that the wrong, strange

Again to msn.co.jp copy a ' o ' sei ', also not line.

Then extend the scope to ^[u2e80-u9fff]+$, this is all passed, this should be matching the Chinese and Japanese characters of the regular expression, including my???? A ∵? In the blind use of the traditional Chinese

And the regular expression of Chinese, should be ^[u4e00-u9fff]+$, and the forum is often mentioned ^[u4e00-u9fa5]+$ very close

Note that the forum said ^[u4e00-u9fa5]+$ this is specifically used to match the regular expression of Simplified Chinese, in fact, the traditional characters are also inside, I tested under the "middle? People's Republic??? Also passed, of course, ^[u4e00-u9fff]+$ is the same result.

Using Regular expressions in Python to match Chinese questions and answers

I would like to use the regular expression in Python to match the Chinese, with the [\U4E00-\U9FA5] this code ~ ~ But the matching result is problematic, this expression can not only match the Chinese, but also can match English characters ~ ~
In other languages The experiment is good, but in Python it's not OK to ask what ~ ~ is the problem of coding?

The coding problem is more complex, considering the encoding format of the data source itself, the different operating systems and settings will cause different structures.

I. Coding range

1. GBK (gb2312/gb18030)
/x00-/xff GBK Two-byte coding range
/x20-/x7f ASCII
/xa1-/xff Chinese
/x80-/xff Chinese

2. UTF-8 (Unicode)
/u4e00-/u9fa5 (Chinese)
/x3130-/x318f (Korean
/XAC00-/XD7A3 (Korean)
/u0800-/u4e00 (Japanese)
PS: Korean is greater than [/U9FA5] character

Regular example:
Preg_replace ("/([/x80-/xff])/", "", $str);
Preg_replace ("/([U4E00-U9FA5])/", "", $str);

Second, code examples

Determine if there is any Chinese-gbk (PHP) in the content
function Check_is_chinese ($s) {
Return Preg_match ('/[/x80-/xff]./', $s);
}

Get string length-GBK (PHP)
function Gb_strlen ($STR) {
$count = 0;
for ($i =0; $i <strlen ($STR); $i + +) {
$s = substr ($str, $i, 1);
if (Preg_match ("/[/x80-/xff]/", $s)) + + $i;
+ + $count;
}
return $count;
}

Intercepting string Strings-GBK (PHP)
function Gb_substr ($STR, $len) {
$count = 0;
for ($i =0; $i <strlen ($STR); $i + +) {
if ($count = = $len) break;
if (Preg_match ("/[/x80-/xff]/", substr ($str, $i, 1)) + + $i;
+ + $count;
}
Return substr ($str, 0, $i);
}

Statistic string length-utf8 (PHP)
function Utf8_strlen ($STR) {
$count = 0;
for ($i = 0; $i < strlen ($STR); $i + +) {
$value = Ord ($str [$i]);
if ($value > 127) {
$count + +;
if ($value >= && $value <= 223) $i + +;
ElseIf ($value >= 224 && $value <= 239) $i = $i + 2;
ElseIf ($value >= && $value <= 247) $i = $i + 3;
Else die (' Not a UTF-8 compatible string ');
}
$count + +;
}
return $count;
}

Intercept string-utf8 (PHP)
function Utf8_substr ($str, $position, $length) {
$start _position = strlen ($STR);
$start _byte = 0;
$end _position = strlen ($STR);
$count = 0;
for ($i = 0; $i < strlen ($STR); $i + +) {
if ($count >= $position && $start _position > $i) {
$start _position = $i;
$start _byte = $count;
}
if (($count-$start _byte) >= $length) {
$end _position = $i;
Break
}
$value = Ord ($str [$i]);
if ($value > 127) {
$count + +;
if ($value >= && $value <= 223) $i + +;
ElseIf ($value >= 224 && $value <= 239) $i = $i + 2;
ElseIf ($value >= && $value <= 247) $i = $i + 3;
Else die (' Not a UTF-8 compatible string ');
}
$count + +;

}
Return (substr ($str, $start _position, $end _position-$start _position));
}

String length statistics-utf8 [Chinese 3 bytes, Russian, Korean accounted for 2 bytes, letters accounted for 1 bytes] (Ruby)
def utf8_string_length (str)
temp = Cgi::unescape (str)
i = 0;
j = 0;
Temp.length.times{|t|
If TEMP[T] < 127
i + 1
ElseIf Temp[t] >= 127 and temp[t] < 224
J + 1
If 0 = = (j% 2)
i + 2
j = 0
End
Else
J + 1
If 0 = = (j% 3)
I +=2
j = 0
End
End
}
return I
}

Determine if there is a Korean-utf-8 (JavaScript)
function Checkkoreachar (str) {
For (i=0 i<str.length; i++) {
if ((Str.charcodeat (i) > 0x3130 && str.charcodeat (i) < 0x318f) | | (Str.charcodeat (i) >= 0xac00 && str.charcodeat (i) <= 0xd7a3))) {
return true;
}
}
return false;
}

Determine if there is a Chinese character-gbk (JavaScript)
function Check_chinese_char (s) {
Return (s.length!= s.replace (/[^/x00-/xff]/g, "* *"). length);
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How Python matches the Chinese way of sharing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How Python matches the Chinese way of sharing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support