Java Address Collection UTF-8 Web page space becomes question mark garbled

Source: Internet
Author: User

http://blog.csdn.net/bob007/article/details/27098875

After using this method to convert, see in the list of normal, but in the text box in the details page to see the  , had to filter out all the space

html = Html.replaceall (Utfspace, " "); Instead of HTML = Html.replaceall (Utfspace, "");

--------------the following copy-------------

Found yesterday, after using HtmlDecode () to decode, " " Instead of being decoded as a half-width space (ASCII code 0x20), it becomes a half-width question mark "?". (ASCII code 0x3f). And, oddly enough, only the spaces in front of each line will be problematic, and if there is a character behind it, the space will be blank. But it is even more strange that this question mark will be removed if you directly add trim () to the back of the HtmlDecode (). Under normal circumstances, the question mark is not removed, only the space will be removed.
When this happened, I was writing the decoded content to the database, so I always thought it was a character set problem or encoding problem between sql****** and the application. Engaged N Long, finally found in the sql****** before the content is already a question mark.

Looked for a long time, also can not find how to solve this problem. Therefore, only the cottage solution can be used:
Transfer from http://www.jiaonan.tv/html/blog/1/29483.htm
1. Replace   before Decode is a space.

2, after the decode directly add Trim ()

Obviously, this is not a good idea: when the browser is displayed, the space disappears.

Recently seriously to check this problem, found that the key to the problem is the encoding: if the use of encoding is UTF-8, this will happen.

The root of the problem lies in the fact that there is a special character in the UTF-8 encoding, which is encoded as "0xC2 0xA0", and when converted to a character, it is represented as a space, as with the general half-width space (ASCII 0x20), the only difference is that its width is not compressed, So more is used for Web page layout (such as first line indentation, etc.). Other encodings, such as GB2312 and Unicode, do not have such characters, so if you simply encode the conversion, the character is replaced with a question mark (ASCII ox3f) in the generated Gb2312/unocode string. If you write a library, write a file, and so on, you will write the question mark directly. Of course, there will be a cottage way: the direct substitution question mark is a space. But this way, the original question mark will also be killed.

When using UTF-8 for HtmlDecode, the beginning of the statement ( ), will be automatically converted into this special space, may be judged to be placed in the beginning of the space, must be used for typesetting. Before converting to another encoding, this special space is treated in the same order as the normal half-width space and is even removed by trim ().

Therefore, there are two reasons for encountering this problem: one is the conversion under the UTF-8 code, the character is generated, and the other is that the character is directly used in the page layout.


Knowing the specific reasons, there is a formal solution. The method is: After getting the UTF-8 string, make a substitution, replace this special space with normal space, if it is an HTML string, it is suggested to replace ( ). The C # code is as follows:


Byte[] space = new BYTE[]{0XC2,0XA0}; String utfspace = encoding.getencoding ("UTF-8").      GetString (space); Htmlstr = Htmlstr.replace (Utfspace, " ");

Java Edition:

byte bytes[] = {(byte) 0xC2, (byte) 0xA0};
String utfspace = new String (bytes, "Utf-8");
html = Html.replaceall (Utfspace, " ");

If you do this, you will not replace the question mark in the string with a blank space. You will not see the nasty question mark, can save the original string of the true colors.
It is important to emphasize that you cannot encode the conversion before replacing it, so be sure to continue using UTF-8 encoding. If it has been converted to another encoding, then the error is irreversible. There is no way to differentiate between this wrong question mark and the normal question mark.

Java Address Collection UTF-8 Web page space becomes question mark garbled

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.