Coding
I. Coding Range
1. GBK (gb2312/gb18030)
\x00-\xff GBK Two-byte coding range
\x20-\x7f ASCII
\xa1-\xff Chinese
\x80-\xff Chinese
2. UTF-8 (Unicode)
\u4e00-\u9fa5 (Chinese)
\x3130-\x318f (Korean
\XAC00-\XD7A3 (Korean)
\u0800-\u4e00 (Japanese)
PS: Korean is greater than [\U9FA5] character
Regular example:
Preg_replace ("/([\x80-\xff])/", "", $str);
Preg_replace ("/([U4E00-U9FA5])/", "", $str);
Second, code examples
Determine if there is any Chinese-gbk (PHP) in the content
function Check_is_chinese ($s) {
Return Preg_match ('/[\x80-\xff]./', $s);
}
Get string length-GBK (PHP)
function Gb_strlen ($STR) {
$count = 0;
for ($i =0; $i <strlen ($STR); $i + +) {
$s = substr ($str, $i, 1);
if (Preg_match ("/[\x80-\xff]/", $s)) + + $i;
+ + $count;
}
return $count;
}
Intercepting string Strings-GBK (PHP)
function Gb_substr ($STR, $len) {
$count = 0;
for ($i =0; $i <strlen ($STR); $i + +) {
if ($count = = $len) break;
if (Preg_match ("/[\x80-\xff]/", substr ($str, $i, 1)) + + $i;
+ + $count;
}
Return substr ($str, 0, $i);
}
Statistic string length-utf8 (PHP)
function Utf8_strlen ($STR) {
$count = 0;
for ($i = 0; $i < strlen ($STR); $i + +) {
$value = Ord ($str [$i]);
if ($value > 127) {
$count + +;
if ($value >= && $value <= 223) $i + +;
ElseIf ($value >= 224 && $value <= 239) $i = $i + 2;
ElseIf ($value >= && $value <= 247) $i = $i + 3;
Else die (' Not a UTF-8 compatible string ');
}
$count + +;
}
return $count;
}
Intercept string-utf8 (PHP)
function Utf8_substr ($str, $position, $length) {
$start _position = strlen ($STR);
$start _byte = 0;
$end _position = strlen ($STR);
$count = 0;
for ($i = 0; $i < strlen ($STR); $i + +) {
if ($count >= $position && $start _position > $i) {
$start _position = $i;
$start _byte = $count;
}
if (($count-$start _byte) >= $length) {
$end _position = $i;
Break
}
$value = Ord ($str [$i]);
if ($value > 127) {
$count + +;
if ($value >= && $value <= 223) $i + +;
ElseIf ($value >= 224 && $value <= 239) $i = $i + 2;
ElseIf ($value >= && $value <= 247) $i = $i + 3;
Else die (' Not a UTF-8 compatible string ');
}
$count + +;
}
Return (substr ($str, $start _position, $end _position-$start _position));
}
String length statistics-utf8 [Chinese 3 bytes, Russian, Korean accounted for 2 bytes, letters accounted for 1 bytes] (Ruby)
def utf8_string_length (str)
temp = Cgi::unescape (str)
i = 0;
j = 0;
Temp.length.times{|t|
If TEMP[T] < 127
i + 1
ElseIf Temp[t] >= 127 and temp[t] < 224
J + 1
If 0 = = (j% 2)
i + 2
j = 0
End
Else
J + 1
If 0 = = (j% 3)
I +=2
j = 0
End
End
}
return I
}
Determine if there is a Korean-utf-8 (JavaScript)
function Checkkoreachar (str) {
For (i=0 i<str.length; i++) {
if ((Str.charcodeat (i) > 0x3130 && str.charcodeat (i) < 0x318f) | | (Str.charcodeat (i) >= 0xac00 && str.charcodeat (i) <= 0xd7a3))) {
return true;
}
}
return false;
}
Determine if there is a Chinese character-gbk (JavaScript)
function Check_chinese_char (s) {
Return (s.length!= s.replace (/[^\x00-\xff]/g, "* *"). length);
}
Third, reference documents
http://www.unicode.org/
Http://examples.oreilly.com/cjkvinfo/doc/cjk.inf
Http://www.ansell-uebersetzungen.com/gbuni.html
Http://www.haiyan.com/steelk/navigator/ref/gbk/gbindex.htm
Http://baike.baidu.com/view/40801.htm
Http://www.chedong.com/tech/hello_unicode.html