Most of the time, we don't care too much about character encoding. For Chinese websites, we generally use gb2312, gbk, and gb18030, or UTF-8. However, we may not know how to select different codes, which may lead to program design defects. Multi-byte encoding is the most commonly used character set. The minimum character set is ascii and the second-level description is 00-7F. It is also the earliest common character set used by our computer. Early stage can represent almost all English characters. Later, we found that we wanted to use this to represent Chinese characters. The discovery is far from enough. The common Chinese character contains more than 7000 characters. The ascii code contains only 128 characters and is only 0-127 encoded. Therefore, our idea is how to make a larger character set and ensure compatibility with ascii encoding. To support more characters, select a larger character set. We can only describe one character in multiple bytes. The general practice is: each byte value is greater than> 7F. If it is multiple bytes, it is: [> 7F] [> 7F]. This encoding ensures that it is well separated from ascii, and the character set is expanded. If the gb2312 range is [0xA1-0xF7] [0xA1-0xFE] (many of which are not filled in the middle), it completely guarantees that all the bytes are above A0, which is fully satisfied with the 7F. Through the above analysis, we know that gb2312 encoding is a good way to separate it from ascii code. So let's take a look at the GBK encoding. It is fully compatible with gb2312 (that is to say, each character location in the gb2312 character set is exactly the same as that in the gb2312 Character Set and included in gb2312). However, it has more than 20 thousand characters. From the above, you can only select to sort down. From A1A0, we found that the actual encoding range is [0x81-0xFE] ([0x40-0x7E | 0x80-0xFE]). It is composed of two bytes, the first byte ranges from 7F to 2nd bytes, and the first byte ranges from 0 to 40-0x7E. This is the cause of the bug. Let's take a look at the following example! The ASCII code table contains the following characters: "A-Za-z @ [\] ^ _ '{| }~", A total of 63 characters. For the GBK Encoding vulnerability, select gbk encoding. Run the above Code and an error is caused by a simple command, which means that the string assignment is not over! Haha, it is estimated that many people will think this is a php Bug. However, if we change to $ a = "running a", we can see that it works normally. Isn't that amazing !! Cause Analysis: We know that files are stored on disks in the second-level mode. No matter what character you save, the character encoding is used to save the character in the selected character set. During php parsing, the latest Analysis Unit is byte. Whether you are a multi-byte, single-byte character. In the end, all data is processed in bytes. "Bytes" GBK encoding is D55C, php is interpreted in bytes, and 5C corresponds to the character. Directly followed by a ', which is equivalent to being escaped. An error occurs because it is not closed!. Let's see the problem. If you handle it yourself, you will naturally split the multibyte into a single byte. This will cause many strange problems. Summary: The multi-byte encoding process and causes of the vulnerability are analyzed. In fact, many programming languages use a single byte for parsing. In this way, when the number of Chinese Characters in multiple bytes falls into a special position, a strange problem will occur. In addition, it will also cause system vulnerabilities. I will talk about the examples of the vulnerability caused by GBK encoding defects later! Welcome!