Yesterday, I heard from the town side about the architecture in the store and mentioned an interesting question. Since Chinese versions all adopt GBK encoding, some incorrect matches may occur when searching (using strstr to search for substrings, to add some tags to the string, the front-end is garbled.
The GBK encoding of "xia Xin" is "0xcf 0xc4 0xd0 0xc2", and the GBK encoding of "male" is "0xc4 0xd0", which exactly matches the two middle bytes, if there are still Chinese characters behind "xia Xin", these words will be tragic. If you use UTF-8 encoding, there will be no problem, because Chinese uses UTF-8 encoding requires three bytes (1110 XXXX 10 xxxxxx 10 xxxxxx), and the first byte will be 'E ', the next two bytes start at 10, and the maximum value is 'B', so that there will be no error matching problem.
In the early days, due to bandwidth and storage device restrictions, many websites were using GBK encoding, so they had no choice. Linux for UTF-8 encoding support is very good, now every day to switch the encoding, in the screen encountered garbled real trouble, the window directly stuck, sometimes a period of time can be restored, sometimes the window can only be killed. Hadoop for GBK encoding support is not good, write Mr job often need to encode first into a UTF-8, when the task is running, and then into GBK encoding output.