As we all know, different characters are encoded in different memory bytes. For example, ASCII characters occupy 1 byte, UTF-8-Encoded chinese characters are 3 bytes, GBK is 2 bytes.
PHP also comes with several string truncation functions, including substr and mb_substr.
Garbled characters are generated when substr is used to intercept Chinese characters, because substr is captured by byte. That is, UTF-8 encoding of the Chinese, use substr interception, will only intercept 1/3 Chinese, of course, garbled.
Mb_substr(String$str
, Int$start
[, Int$length
[, String$encoding
]) Parameter $ encoding can specify the encoding. If it is omitted, internal character encoding is used.
If you do not know the encoding format of the string, you can use mb_detect_encoding to check:
$ Encoding = mb_detect_encoding ($ string, array ("ASCII", 'utf-8', "gb2312'," GBK ", 'big5 ′));
Then:
Mb_substr(String$str
, Int$start
[, Int$length
[, String$encoding
])
If you implement mb_substr by yourself, the efficiency is not very good.
Encoding-related php Functions
Ord (substr ($ str, $ I, 1)> 0xa0)
Ord ($ string) returns the ASC code of the first character of the string. It is used to determine whether the first character of the string to be intercepted is a Chinese character. For example, a gb2312 encoded text is 2 bytes, utf8 is three bytes. That is, if the encoding is greater than 256, It is a Chinese character.
Regular character:
Match Chinese characters: preg_match_all ('/[\ x80-\ xff]?. /', $ String, $ match );
Match English: preg_match_all ("/[/x01-/x7f] +/", $ string, $ match );
Encoding conversion
Iconv (string$in_charset
, String$out_charset
, String$str
)
For example, GB2312 to UTF-8: iconv ("GB2312", "UTF-8", $ text)
Url-encoded urlencodeExcept
-_.All other non-alphanumeric characters will be replaced with a semicolon (
%) Followed by two hexadecimal numbers, and space is encoded as the plus sign (
+). This encoding method is the same as that for WWW form POST data.
Application/x-www-form-urlencodedThe media type encoding method is the same.
However, it should be noted that only part of the URL should be encoded. Otherwise, the colon and backslash in the URL will be escaped.
URLEncode generally has two ways, one is the traditional Based on GB2312 Encode, the other is based on UTF-8 Encode. For example:
Copy codeThe Code is as follows: $ url = 'China ';
Echo urlencode ($ url );
// UTF-8: % E4 % B8 % AD % E5 % 9B % BD
// GB2312: % D6 % D0 % B9 % FA
For example, we use a browser to open Baidu, search for "China". In the address bar to see: http://www.baidu.com/s? Wd = % E4 % B8 % AD % E5 % 9B % BD & rsv_bp = 0 & ch = & tn = baidu & bar = & rsv_spt = 3 & ie = UTF-8 & rsv_sug3 = 16 & rsv_sug = 0 & rsv_sug4 = 302 & rsv_sug1 = 11 & inputT = 22928 then we can see that "China" is automatically converted: % E4 % B8 % AD % E5 % 9B % BD.
The difference between urlencode and rawurlencode: urlencode encodes the space into the plus sign "+", and rawurlencode encodes the space into the plus sign "% 20 ".
Url Decoding: urldecode and rawurldecode1. During decoding, you can use the corresponding urldecode () and rawurldecode (). Correspondingly, rawurldecode () does not decode the plus sign ('+') as a space, while urldecode () yes. 2. the string decoded by urldecode () and rawurldecode () is encoded in UTF-8 format. If the URL contains a Chinese character encoded in a non-UTF-8, the decoded string must be converted. Set the PHP file to gb2312 encoding as follows. You will see that some of them are garbled and some are normal. $ Url = 'China ';
Echo $ a = urldecode (urlencode ($ url )),'';
Echo iconv ('gb2312', 'utf-8', $ );
�� China