How Python matches the Chinese way of sharing

Source: Internet
Author: User
Tags ord regular expression strlen in python

First, make sure all encodings are Unicode

such as Str.decode (' UTF8 ') #从utf8文本中

U "Ah L" #在控制台输出中

(wordy) would like to use a reference to a certain code HEX but the depressing is that each word seems to occupy 2 positions, using a regular match without results.

Second, determine the Chinese scope: [U4E00-U9FA5]

(note here when the Python re is written) want U "[u4e00-u9fa5]" #确定正则表达式也是 Unicode

Demo:



>>> print re.match (ur "[u4e00-u9fa5]+", "Ah")

None

>>> print re.match (ur "[u4e00-u9fa5]+", U "Ah")

<_sre. Sre_match Object at 0x2a98981308>



>>> print re.match (ur "[u4e00-u9fa5]+", U "T")

None

>>> Print TT

Now I understand.

>>> TT

' XE7X8EXB0XE5X9CXA8XE6X89X8DXE6X98X8EXE7X99XBD '

>>> Print Re.match (r "[U4e00-u9fa5]", Tt.decode (' UTF8 '))

None

>>> print re.match (ur "[u4e00-u9fa5]", Tt.decode (' UTF8 '))

<_sre. Sre_match Object at 0x2a955d9c60>



>>> print re.match (ur "*[" u4e00-"u9fa5]+", U "Hi, Match to")

<_sre. Sre_match Object at 0x2a955d9c60>

>>> print re.match (ur "*[" u4e00-"u9fa5]+", U "Hi,no no")

None



other scope of expansion

Here are a few of the major non-English language character ranges (found on Google):

2E80~33FFH: China, Japan and South Korea symbol area. Host Kangxi Dictionary Radicals, Chinese and Japanese auxiliary radicals, phonetic symbols, Japanese kana, Korean notes, Chinese and Japanese symbols, punctuation, circled or with rune numbers, month, and Japan's kana combination, units, the year, month, date, time and so on.

3400~4DFFH: China, Japan and South Korea agree to the ideographic expansion of a district, a total of 6,582 Japanese and Korean Chinese characters.

4E00~9FFFH: China, Japan and South Korea identify with the ideographic area, a total of 20,902 Japanese and Korean Chinese characters.

A000~a4ffh: Yi language area, accepting Chinese Southern Yi language and word root.

AC00~D7FFH: Korean phonetic combination word area, take in Korean notes spelled into the text.

F900~FAFFH: China, Japan and South Korea compatible Ideographic area, a total of 302 Chinese and Japanese Han characters.

FB00~FFFDH: Text manifestation area, the reception combination of Latin text, Hebrew, Arabic, Chinese and Japanese straight punctuation, small symbols, half-width symbols, full-width symbols and so on.

such as the need to match all the Chinese and Japanese Han symbol characters, then the regular expression should be ^[u3400-u9fff]+$

Theoretically yes, but I went to Msn.co.ko casually copied a Korean down, found that the wrong, strange

Again to msn.co.jp copy a ' o ' sei ', also not line.

Then extend the scope to ^[u2e80-u9fff]+$, this is all passed, this should be matching the Chinese and Japanese characters of the regular expression, including my???? A ∵? In the blind use of the traditional Chinese

And the regular expression of Chinese, should be ^[u4e00-u9fff]+$, and the forum is often mentioned ^[u4e00-u9fa5]+$ very close

Note that the forum said ^[u4e00-u9fa5]+$ this is specifically used to match the regular expression of Simplified Chinese, in fact, the traditional characters are also inside, I tested under the "middle? People's Republic??? Also passed, of course, ^[u4e00-u9fff]+$ is the same result.


Using Regular expressions in Python to match Chinese questions and answers

I would like to use the regular expression in Python to match the Chinese, with the [\U4E00-\U9FA5] this code ~ ~ But the matching result is problematic, this expression can not only match the Chinese, but also can match English characters ~ ~
In other languages The experiment is good, but in Python it's not OK to ask what ~ ~ is the problem of coding?

The coding problem is more complex, considering the encoding format of the data source itself, the different operating systems and settings will cause different structures.

I. Coding range

1. GBK (gb2312/gb18030)
/x00-/xff GBK Two-byte coding range
/x20-/x7f ASCII
/xa1-/xff Chinese
/x80-/xff Chinese

2. UTF-8 (Unicode)
/u4e00-/u9fa5 (Chinese)
/x3130-/x318f (Korean
/XAC00-/XD7A3 (Korean)
/u0800-/u4e00 (Japanese)
PS: Korean is greater than [/U9FA5] character


Regular example:
Preg_replace ("/([/x80-/xff])/", "", $str);
Preg_replace ("/([U4E00-U9FA5])/", "", $str);



Second, code examples

Determine if there is any Chinese-gbk (PHP) in the content
function Check_is_chinese ($s) {
Return Preg_match ('/[/x80-/xff]./', $s);
}

Get string length-GBK (PHP)
function Gb_strlen ($STR) {
$count = 0;
for ($i =0; $i <strlen ($STR); $i + +) {
$s = substr ($str, $i, 1);
if (Preg_match ("/[/x80-/xff]/", $s)) + + $i;
+ + $count;
}
return $count;
}

Intercepting string Strings-GBK (PHP)
function Gb_substr ($STR, $len) {
$count = 0;
for ($i =0; $i <strlen ($STR); $i + +) {
if ($count = = $len) break;
if (Preg_match ("/[/x80-/xff]/", substr ($str, $i, 1)) + + $i;
+ + $count;
}
Return substr ($str, 0, $i);
}

Statistic string length-utf8 (PHP)
function Utf8_strlen ($STR) {
$count = 0;
for ($i = 0; $i < strlen ($STR); $i + +) {
$value = Ord ($str [$i]);
if ($value > 127) {
$count + +;
if ($value >= && $value <= 223) $i + +;
ElseIf ($value >= 224 && $value <= 239) $i = $i + 2;
ElseIf ($value >= && $value <= 247) $i = $i + 3;
Else die (' Not a UTF-8 compatible string ');
}
$count + +;
}
return $count;
}


Intercept string-utf8 (PHP)
function Utf8_substr ($str, $position, $length) {
$start _position = strlen ($STR);
$start _byte = 0;
$end _position = strlen ($STR);
$count = 0;
for ($i = 0; $i < strlen ($STR); $i + +) {
if ($count >= $position && $start _position > $i) {
$start _position = $i;
$start _byte = $count;
}
if (($count-$start _byte) >= $length) {
$end _position = $i;
Break
}
$value = Ord ($str [$i]);
if ($value > 127) {
$count + +;
if ($value >= && $value <= 223) $i + +;
ElseIf ($value >= 224 && $value <= 239) $i = $i + 2;
ElseIf ($value >= && $value <= 247) $i = $i + 3;
Else die (' Not a UTF-8 compatible string ');
}
$count + +;

}
Return (substr ($str, $start _position, $end _position-$start _position));
}


String length statistics-utf8 [Chinese 3 bytes, Russian, Korean accounted for 2 bytes, letters accounted for 1 bytes] (Ruby)
def utf8_string_length (str)
temp = Cgi::unescape (str)
i = 0;
j = 0;
Temp.length.times{|t|
If TEMP[T] < 127
i + 1
ElseIf Temp[t] >= 127 and temp[t] < 224
J + 1
If 0 = = (j% 2)
i + 2
j = 0
End
Else
J + 1
If 0 = = (j% 3)
I +=2
j = 0
End
End
}
return I
}

Determine if there is a Korean-utf-8 (JavaScript)
function Checkkoreachar (str) {
For (i=0 i<str.length; i++) {
if ((Str.charcodeat (i) > 0x3130 && str.charcodeat (i) < 0x318f) | | (Str.charcodeat (i) >= 0xac00 && str.charcodeat (i) <= 0xd7a3))) {
return true;
}
}
return false;
}

Determine if there is a Chinese character-gbk (JavaScript)
function Check_chinese_char (s) {
Return (s.length!= s.replace (/[^/x00-/xff]/g, "* *"). length);
}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.