Chinese code
Double-byte character encoding range:
1. GBK (gb2312/gb18030)
\X00-\XFF→GBK two-byte coding range
\x20-\x7f→ascii remove characters from non-Chinese characters
\xa1-\xff Chinese →gb2312 out all Chinese (excluding alphanumeric and characters)
\x80-\xff Chinese →GBK out all Chinese (excluding alphanumeric and characters)
2. UTF-8 (Unicode)
\ a-\? → (Chinese) remove all Chinese
\x3130-\x318f→ (Korean) remove all Korean
\xac00-\xd7a3→ (Korean) remove all Korean
\?-\ → (Japanese) Remove all Japanese
^ can be used to convert to each other
PS: Korean is greater than [U9FA5] character
code Example
Determine if there is any Chinese-gbk (PHP) in the content
function Check_is_chinese ($s) {
Return Preg_match ('/[x80-xff]./', $s);
}
Get string length-GBK (PHP)
function Gb_strlen ($STR) {
$count = 0;
for ($i =0; $i <strlen ($STR); $i + +) {
$s = substr ($str, $i, 1);
if (Preg_match ("/[x80-xff]/", $s)) + + $i;
+ + $count;
}
return $count;
}
Intercepting string Strings-GBK (PHP)
function Gb_substr ($STR, $len) {
$count = 0;
for ($i =0; $i <strlen ($STR); $i + +) {
if ($count = = $len) break;
if (Preg_match ("/[x80-xff]/", substr ($str, $i, 1)) + + $i;
+ + $count;
}
Return substr ($str, 0, $i);
}
Statistic string length-utf8 (PHP)
function Utf8_strlen ($STR) {
$count = 0;
for ($i = 0; $i < strlen ($STR); $i + +) {
$value = Ord ($str [$i]);
if ($value > 127) {
$count + +;
if ($value >= && $value <= 223) $i + +;
ElseIf ($value >= 224 && $value <= 239) $i = $i + 2;
ElseIf ($value >= && $value <= 247) $i = $i + 3;
Else die (' Not a UTF-8 compatible string ');
}
$count + +;
}
return $count;
}
Intercept string-utf8 (PHP)
function Utf8_substr ($str, $position, $length) {
$start _position = strlen ($STR);
$start _byte = 0;
$end _position = strlen ($STR);
$count = 0;
for ($i = 0; $i < strlen ($STR); $i + +) {
if ($count >= $position && $start _position > $i) {
$start _position = $i;
$start _byte = $count;
}
if (($count-$start _byte) >= $length) {
$end _position = $i;
Break
}
$value = Ord ($str [$i]);
if ($value > 127) {
$count + +;
if ($value >= && $value <= 223) $i + +;
ElseIf ($value >= 224 && $value <= 239) $i = $i + 2;
ElseIf ($value >= && $value <= 247) $i = $i + 3;
Else die (' Not a UTF-8 compatible string ');
}
$count + +;
}
Return (substr ($str, $start _position, $end _position-$start _position));
}
Determine if there is a Korean-utf-8 (JavaScript)
function Checkkoreachar (str) {
For (i=0 i<str.length; i++) {
if ((Str.charcodeat (i) > 0x3130 && str.charcodeat (i) < 0x318f) | | (Str.charcodeat (i) >= 0xac00 && str.charcodeat (i) <= 0xd7a3))) {
return true;
}
}
return false;
}
Determine if there is a Chinese character-gbk (JavaScript)
function Check_chinese_char (s) {
Return (s.length!= s.replace (/[^x00-xff]/g, "* *"). length);
}
Summarize
Regular expressions are often used to determine non-English characters such as Chinese and Korean, and to back up the range of these encodings for easy querying.
In regular expressions, you often judge Chinese, GBK encoding is double-byte, and UTF8 is three bytes.