After Chinese urlencode, each%xx represents a byte, is that right?
So the result of UrlEncode (' Medium ') is%xx%xx%xx (Utf-8 encoding)
Reply content:
After Chinese urlencode, each%xx represents a byte, is that right?
So the result of UrlEncode (' Medium ') is%xx%xx%xx (Utf-8 encoding)
Yes, in order to solve the problem that Unicode is too occupied with memory space and extension, the UTF-8 specification appears.
For a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.
For the N-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.
That is to say, the utf-8 coding result is variable length.
中
The UTF-8 encoding of the word is E4B8AD
, so the corresponding UrlEncode is %E4%B8%AD
.
Yes, URL encoding is purely a hexadecimal representation of the special and non-ASCII Fu data, followed by a percent sign per byte (that is, two hexadecimal digits). The URL encoding for those non-special ASCII characters is itself.
The same word, if the GBK encoding is two bytes, if it is UTF-8 is three bytes.
Of course, the trouble is that URL coding is confusing. For example, some places use +
to represent the space, some %20
are used, the specific situation or to specific analysis. The former corresponds to the function is urlencode
, the latter corresponds to the rawurlencode
. The former is usually used in the form data (including the latter part of the URL ?
), which is used in the URL path (the part before query in host)