First, make sure all encodings are Unicode
such as Str.decode (' UTF8 ') #从utf8文本中
U "Ah L" #在控制台输出中
(wordy) would like to use a reference to a certain code HEX but the depressing is that each word seems to occupy 2 positions, using a regular match without results.
Second, determine the Chinese scope: [U4E00-U9FA5]
(note here when the Python re is written) want U "[u4e00-u9fa5]" #确定正则表达式也是 Unicode
Demo:
>>> print re.match (ur "[u4e00-u9fa5]+", "Ah")
None
>>> print re.match (ur "[u4e00-u9fa5]+", U "Ah")
<_sre. Sre_match Object at 0x2a98981308>
>>> print re.match (ur "[u4e00-u9fa5]+", U "T")
None
>>> Print TT
Now I understand.
>>> TT
' XE7X8EXB0XE5X9CXA8XE6X89X8DXE6X98X8EXE7X99XBD '
>>> Print Re.match (r "[U4e00-u9fa5]", Tt.decode (' UTF8 '))
None
>>> print re.match (ur "[u4e00-u9fa5]", Tt.decode (' UTF8 '))
<_sre. Sre_match Object at 0x2a955d9c60>
>>> print re.match (ur "*[" u4e00-"u9fa5]+", U "Hi, Match to")
<_sre. Sre_match Object at 0x2a955d9c60>
>>> print re.match (ur "*[" u4e00-"u9fa5]+", U "Hi,no no")
None
other scope of expansion
Here are a few of the major non-English language character ranges (found on Google):
2E80~33FFH: China, Japan and South Korea symbol area. Host Kangxi Dictionary Radicals, Chinese and Japanese auxiliary radicals, phonetic symbols, Japanese kana, Korean notes, Chinese and Japanese symbols, punctuation, circled or with rune numbers, month, and Japan's kana combination, units, the year, month, date, time and so on.
3400~4DFFH: China, Japan and South Korea agree to the ideographic expansion of a district, a total of 6,582 Japanese and Korean Chinese characters.
4E00~9FFFH: China, Japan and South Korea identify with the ideographic area, a total of 20,902 Japanese and Korean Chinese characters.
A000~a4ffh: Yi language area, accepting Chinese Southern Yi language and word root.
AC00~D7FFH: Korean phonetic combination word area, take in Korean notes spelled into the text.
F900~FAFFH: China, Japan and South Korea compatible Ideographic area, a total of 302 Chinese and Japanese Han characters.
FB00~FFFDH: Text manifestation area, the reception combination of Latin text, Hebrew, Arabic, Chinese and Japanese straight punctuation, small symbols, half-width symbols, full-width symbols and so on.
such as the need to match all the Chinese and Japanese Han symbol characters, then the regular expression should be ^[u3400-u9fff]+$
Theoretically yes, but I went to Msn.co.ko casually copied a Korean down, found that the wrong, strange
Again to msn.co.jp copy a ' o ' sei ', also not line.
Then extend the scope to ^[u2e80-u9fff]+$, this is all passed, this should be matching the Chinese and Japanese characters of the regular expression, including my???? A ∵? In the blind use of the traditional Chinese
And the regular expression of Chinese, should be ^[u4e00-u9fff]+$, and the forum is often mentioned ^[u4e00-u9fa5]+$ very close
Note that the forum said ^[u4e00-u9fa5]+$ this is specifically used to match the regular expression of Simplified Chinese, in fact, the traditional characters are also inside, I tested under the "middle? People's Republic??? Also passed, of course, ^[u4e00-u9fff]+$ is the same result.
Using Regular expressions in Python to match Chinese questions and answers
I would like to use the regular expression in Python to match the Chinese, with the [\U4E00-\U9FA5] this code ~ ~ But the matching result is problematic, this expression can not only match the Chinese, but also can match English characters ~ ~
In other languages The experiment is good, but in Python it's not OK to ask what ~ ~ is the problem of coding?
The coding problem is more complex, considering the encoding format of the data source itself, the different operating systems and settings will cause different structures.
I. Coding range
1. GBK (gb2312/gb18030)
/x00-/xff GBK Two-byte coding range
/x20-/x7f ASCII
/xa1-/xff Chinese
/x80-/xff Chinese
2. UTF-8 (Unicode)
/u4e00-/u9fa5 (Chinese)
/x3130-/x318f (Korean
/XAC00-/XD7A3 (Korean)
/u0800-/u4e00 (Japanese)
PS: Korean is greater than [/U9FA5] character
Regular example:
Preg_replace ("/([/x80-/xff])/", "", $str);
Preg_replace ("/([U4E00-U9FA5])/", "", $str);
Second, code examples
Determine if there is any Chinese-gbk (PHP) in the content
function Check_is_chinese ($s) {
Return Preg_match ('/[/x80-/xff]./', $s);
}
Get string length-GBK (PHP)
function Gb_strlen ($STR) {
$count = 0;
for ($i =0; $i <strlen ($STR); $i + +) {
$s = substr ($str, $i, 1);
if (Preg_match ("/[/x80-/xff]/", $s)) + + $i;
+ + $count;
}
return $count;
}
Intercepting string Strings-GBK (PHP)
function Gb_substr ($STR, $len) {
$count = 0;
for ($i =0; $i <strlen ($STR); $i + +) {
if ($count = = $len) break;
if (Preg_match ("/[/x80-/xff]/", substr ($str, $i, 1)) + + $i;
+ + $count;
}
Return substr ($str, 0, $i);
}
Statistic string length-utf8 (PHP)
function Utf8_strlen ($STR) {
$count = 0;
for ($i = 0; $i < strlen ($STR); $i + +) {
$value = Ord ($str [$i]);
if ($value > 127) {
$count + +;
if ($value >= && $value <= 223) $i + +;
ElseIf ($value >= 224 && $value <= 239) $i = $i + 2;
ElseIf ($value >= && $value <= 247) $i = $i + 3;
Else die (' Not a UTF-8 compatible string ');
}
$count + +;
}
return $count;
}
Intercept string-utf8 (PHP)
function Utf8_substr ($str, $position, $length) {
$start _position = strlen ($STR);
$start _byte = 0;
$end _position = strlen ($STR);
$count = 0;
for ($i = 0; $i < strlen ($STR); $i + +) {
if ($count >= $position && $start _position > $i) {
$start _position = $i;
$start _byte = $count;
}
if (($count-$start _byte) >= $length) {
$end _position = $i;
Break
}
$value = Ord ($str [$i]);
if ($value > 127) {
$count + +;
if ($value >= && $value <= 223) $i + +;
ElseIf ($value >= 224 && $value <= 239) $i = $i + 2;
ElseIf ($value >= && $value <= 247) $i = $i + 3;
Else die (' Not a UTF-8 compatible string ');
}
$count + +;
}
Return (substr ($str, $start _position, $end _position-$start _position));
}
String length statistics-utf8 [Chinese 3 bytes, Russian, Korean accounted for 2 bytes, letters accounted for 1 bytes] (Ruby)
def utf8_string_length (str)
temp = Cgi::unescape (str)
i = 0;
j = 0;
Temp.length.times{|t|
If TEMP[T] < 127
i + 1
ElseIf Temp[t] >= 127 and temp[t] < 224
J + 1
If 0 = = (j% 2)
i + 2
j = 0
End
Else
J + 1
If 0 = = (j% 3)
I +=2
j = 0
End
End
}
return I
}
Determine if there is a Korean-utf-8 (JavaScript)
function Checkkoreachar (str) {
For (i=0 i<str.length; i++) {
if ((Str.charcodeat (i) > 0x3130 && str.charcodeat (i) < 0x318f) | | (Str.charcodeat (i) >= 0xac00 && str.charcodeat (i) <= 0xd7a3))) {
return true;
}
}
return false;
}
Determine if there is a Chinese character-gbk (JavaScript)
function Check_chinese_char (s) {
Return (s.length!= s.replace (/[^/x00-/xff]/g, "* *"). length);
}