One, Character set
1) characters and bytes (Character)
Characters are the general name of all kinds of words and symbols, including garbled characters; one character corresponds to 1~n bytes, one byte corresponds to 8 bits, and each bit is represented by 0 or 1.
2) Character set (Character set)
A character set is a collection of multiple characters, each containing a different number of characters, a common character set name: The ASCII character set, the GB2312 character set, the Unicode character set, and so on.
3) Character set encoding (Character Encoding)
Character set encoding is the conversion of symbols into computer-readable binary, decoding is to convert the binary into human readable symbols.
Most of the character sets correspond to one encoding (for example, GBK corresponds to GBK encoding), but there are many Unicode encodings, including UTF-8, UTF-16, UTF-32, and UTF-7.
The most current web page used is "UTF-8", UTF-8 use one to four bytes per character encoding, is a superset of ASCII, so the existing ASCII text does not need to convert
Second, the browser into the system
1) Use decimal and hexadecimal in HTML attributes
Decimal in HTML can use "& #56;", hexadecimal, then use "& #x5a;", more than a decimal x, the code is also more a~f these 6 characters to represent 10~15.
2) Use decimal and hexadecimal in CSS properties
CSS is compatible with HTML, and in addition, hexadecimal can be expressed in the form of "\6c".
3) JavaScript encoding Package
The string octal and hexadecimal encodings can be executed directly through Eval, where octal is denoted by "\56" and hexadecimal is denoted by "\x5c".
If a Chinese character is applied in the code and a binary encoding is required, only hexadecimal Unicode encoding can be used, and its representation is: "\u4ee3\u7801".
In the "Web front-end hacker technology Disclosure " in the encapsulation of two methods to do encoding and decoding, mainly used in the following two methods, the specific code can be seen here .
The core code is: "str.charcodeat (char). toString (binary)" and "String.fromCharCode (parseint (code, Binary)")
The charCodeAt () method returns an integer from 0 to 65535 that represents the UTF-16 code unit at the given index
The static String.fromCharCode () method returns a string created using the specified sequence of Unicode values.
It is also possible to encode and decode "Monyerjs" via an online web page.
4) HTML automatic decoding mechanism
For example, enter 16 in the Web page "& #x0048;& #x0065;& #x006c;& #x006c;& #x006f;", automatically decoding to "Hello".
There are some more well-known spaces " " is also such a mechanism.
Third, the browser code
There are three pairs of functions in JavaScript that can decode string encodings, respectively:
escape/unescape,encodeuri/decodeuri,encodeuricomponent/decodeuricomponent.
The main difference is the number of characters that are not encoded.
1) escape does not encode characters with 69
*, + 、-、.、/, @, _, 0~9, A~z, a~z and escape output%u**** format when encoding Unicode values other than 0~255.
2) encodeURI does not encode 82 characters
!, #, $, &, ', (,), *, + 、,、-、.、/,:,;, = 、?、 @, _, ~, 0~9, A~z, a~z
3) encodeURIComponent does not encode 71 characters
!, ', (,), * 、-、.、 _, ~, 0~9, A~z, a~z
Resources:
Character Set and character encoding (Charset & Encoding)
Browser-based common sense
JavaScript character set encoding and decoding