As we all know, different character encodings, the number of bytes they occupy in memory is not the same. If ASCII-encoded characters occupy 1 bytes, the UTF-8-encoded Chinese character is 3 bytes and GBK is 2 bytes.
PHP also comes with several string intercept functions, which are commonly used in substr and MB_SUBSTR.
When you use substr to intercept Chinese characters, garbled characters occur because the substr is intercepted by byte. That is, UTF-8 encoded in Chinese, using substr interception, will only intercept 1/3 Chinese, of course, there are garbled.
parameters in Mb_substr (string $str
, int $start
[, int $length
[, string $encoding
]]) $encoding can refer to the Code, if omitted, the internal character encoding is used.
If you do not know the encoding format of the string, you can use the Mb_detect_encoding check:
$encoding = mb_detect_encoding ($string, Array ("ASCII", "utf-8′," gb2312′, "GBK", ' big5′ '));
And then:
mb_substr ( string $str
, int $start
[, int $length
[, string $encoding
]])
If you realize mb_substr, the efficiency is not very good.
Encoding-related PHP functions using
Ord (substr ($str, $i, 1)) > 0xa0)
Ord ($string) returns the ASC code of the first character of the string, which is used to determine whether the first character of the intercepted string is kanji, because for example gb2312 encodes a text that is 2 bytes, and UTF8 is three bytes. That is , the code is more than 256 of the Chinese characters.
Regular characters:
Matching Chinese characters: Preg_match_all ('/[\x80-\xff]? /', $string, $match);
Match English: Preg_match_all ("/[/x01-/x7f]+/", $string, $match);
Encoding Conversion
Iconv ( string $in_charset
, string $out_charset
, string $str
)
such as GB2312 turn UTF-8: Iconv ("GB2312", "UTF-8", $text)
The string returned after encoding except for
-_. all non-alphanumeric characters are replaced with a percent sign (
%) followed by a two-bit hexadecimal number, and the space is encoded as a plus (
+). This encoding is the same encoding as the WWW form POST data and is encoded in the same way as the
application/x-www-form-urlencoded Media type.
It should be noted, however, that you should encode only part of the URL when encoding, or the colon and backslash in the URL will also be escaped.
There are generally two kinds of urlencode, one is the traditional encode based on GB2312, the other is encode based on UTF-8. such as:
Copy Code code as follows:
$url = ' China ';
echo UrlEncode ($url);
UTF-8:%E4%B8%AD%E5%9B%BD
Gb2312:%d6%d0%b9%fa
http://www.baidu.com/s?wd= %e4%b8%ad%e5%9b%bd&rsv_bp=0&ch=&tn=baidu&bar=&rsv_ spt=3&ie=utf-8&rsv_sug3=16&rsv_sug=0&rsv_sug4=302&rsv_sug1=11&inputt=22928
%E4%B8%AD%E5%9B%BD 。
UrlEncode and Rawurlencode: UrlEncode encodes the space as a plus sign "+", and Rawurlencode encodes the space as the plus sign "%20". &NBSP
URL decoding urldecode and Rawurldecode
1, in decoding, you can use the corresponding UrlDecode () and Rawurldecode (), accordingly, Rawurldecode () will not be the plus (' + ') decoded as a space, and UrlDecode () can.
2, UrlDecode () and Rawurldecode () decoded string is UTF-8 format encoding, if the URL contains UTF-8 encoded in Chinese, then the decoded string to convert.
For example, first set the PHP file to gb2312 encoding. You will see that part of it is garbled and part of it is normal.
$url = ' China ';
echo $a = UrlDecode (UrlEncode ($url)), ';
echo iconv (' gb2312 ', ' utf-8 ', $a);