Function utf8_gb2312 ($ str, $ default = 'gb2312 ')
{
$ Str = preg_replace ("/[x01-x7f] +/", "", $ str );
If (empty ($ str) return $ default;
$ Preg = array (
"Gb2312" => "/^ ([xa1-xf7] [xa0-xfe]) + $/", // regular determine whether it is gb2312
"UTF-8" => "/^ [x {4e00}-x {9fa5}] + $/u", // check whether the regular expression is a Chinese character (utf8 encoding condition ), this range actually contains traditional Chinese text.
);
If ($ default = 'gb2312 '){
$ Option = 'utf-8 ';
} Else {
$ Option = 'gb2312 ';
}
If (! Preg_match ($ preg [$ default], $ str )){
Return $ option;
}
$ Str = @ iconv ($ default, $ option, $ str );
// The value cannot be converted to $ option, indicating that the original value is not $ default.
If (empty ($ str )){
Return $ option;
}
The default encoding is gb2312, and I have made statistics. In 90% cases, it is gb2312. Therefore, my detection function cannot appear originally gb2312, and the result is utf8. the basic idea is:
1. Remove all ascii values. If all values are ascii values, they are gb2312.
2. Assume that the string is gb2312 and use a regular expression to check whether it is actually gb2312. If not, it is UTF-8.
3. Then, use iconv to convert the string to utf8. If the conversion fails, it may not be a real gb2312 encoded character.
(I have tried to use regular expression matching as accurately as possible, but the gb2312 encoding is not continuous and there will still be holes), then the final encoding is UTF-8.
4. Otherwise, it is gb2312 encoding.
After such a check function is added, there is a garbled text in the 1000 keywords, which is much less garbled than the previous 100 keywords.