Yesterday we found that after Decoding with htmldecode (), "& nbsp;" is not a space decoded as a halfwidth (ASCII code 0x20) but a question mark "?" (ASCII code 0x3f ). It is also strange that only the spaces at the front of each line have problems. If there are Chinese characters at the front, the spaces are still spaces. But what's more strange is that if trim () is directly added after htmldecode (), this question mark will be removed. Normally, question marks are not removed, and only spaces are removed.
When this problem occurs, I write the decoded content to the database, therefore, it is always considered that the character set problem or encoding method problem between SQL ***** and the application. After a long time, it was finally found that the content was already a question mark before it was sent to SQL.
I have been searching for a long time and cannot find out how to solve this problem. Therefore, you can only use the shanzhai solution:
1. Replace & nbsp; with spaces before decode.
2. Add trim () directly after decode ()
Obviously, this is not a good solution: When the browser is displayed, the space will disappear.
Recently carefully checked this problem, found the key problem, is the encoding method: if the use of encoding is a UTF-8, this will happen.
The root of the problem lies in the UTF-8 of this encoding, there is a special character, its encoding is "0xc2 0xa0", when converted to a character, expressed as a space, like normal half-width spaces (ASCII 0x20), the only difference is that its width is not compressed, so it is used for web page layout (such as the first line indent ). Other encoding methods such as gb2312 and Unicode do not have such characters. Therefore, if you perform simple encoding conversion, the gb2312/unocode string is generated, this character will be replaced with a question mark (ASCII ox3f ). In this case, if you write a database or a file, the question mark will be written directly. Of course, there will be a way to replace the question mark with a space. However, this method will also drop the original real question mark.
When using the UTF-8 for htmldecode, for the beginning of the statement (& nbsp;), will be automatically converted to this special space, may be judged as placed at the beginning of the space, it must be used for typographical layout. Before conversion to other codes, the special space will be treated as the same as the normal half-width space, and even be removed by TRIM.
Therefore, there are two reasons for this problem: one is the conversion under the UTF-8 encoding, produced this character; another is that the character is directly used in the webpage for typographical layout.
Now that you know the specific cause, you have a formal solution. The method is: after obtaining the UTF-8 string, first Replace A, replace this special space with a common space, if it is an HTML string, it is recommended to replace with (& nbsp ;). C # The Code is as follows:
Byte [] Space = new byte [] {0xc2, 0xa0}; string utfspace = encoding. getencoding ("UTF-8 "). getstring (Space); htmlstr = htmlstr. replace (utfspace, "& nbsp ;"); |
In this way, the question mark errors in the string will not be replaced with spaces. You will not see the annoying question mark, which can save the true colors of the original string.
It should be emphasized that encoding conversion cannot be performed before replacement, and UTF-8 encoding must continue. If it has been converted to another encoding, the error is irreversible. There is no way to distinguish between the question mark of this error and the normal question mark.
Address: http://www.jiaonan. TV /html/blog/1/29483.htm