[Reprinted] the debug version of The isspace function has problems with Chinese processing.

Source: Internet
Author: User

From: http://www.cppblog.com/luonjtu/archive/2009/03/12/76332.html

 

The debug version of isspace function in VC 2005 SP1 has problems with Chinese processing.

Today, I encountered a strange problem. A piece of code from someone else was first written under GCC, and then transplanted to VC for compilation. As a result, there was always an Assert error during debugging. After reading the code, the error occurs in a trim function. The TRIM function accepts a char * type string parameter, removing leading and trailing spaces, tabs, and other blank characters. The isspace function is used to determine whether it is a blank character. According to the general idea, both GBK and UTF-8 character encoding in char * strings are compatible with ASCII, so the isspace function should not be faulty. But the fact is that as long as the string has Chinese characters, whether it is GBK or UTF-8 encoding, there are assert errors in isspace. To facilitate the description, the code is extracted as follows:

Char * lpszbuffer = (char *) "high ";
Int nlen = (INT) strlen (lpszbuffer );
For (INT I = 0; I
{
Printf ("0x % x % DN", lpszbuffer [I], isspace (lpszbuffer [I]);
}

This piece of code fails assert during debug compilation. There is no way to trace to the C Runtime Library. The implementation of isspace is as follows (in the "_ ctype. c" file ):

Extern _ inline int (_ cdecl isspace )(
Int C
)
{
If (_ locale_changed = 0)
{
Return _ fast_ch_check (C, _ space );
}
Else
{
Return (_ isspace_l) (C, null );
}
}

Tracking found that the value of __locale_changed is 0, taking the first branch and calling _ fast_ch_check. It is actually a macro definition and finally enters the _ chvalidator function. Its implementation code is as follows:

Extern "C" int _ cdecl _ chvalidator (
Int C,
Int mask
)
{
_ Asserte (unsigned) (C + 1) <= 256 );
Return _ chvalidator_l (null, C, mask );
}

The error occurs in the first line of the function, "_ asserte (unsigned) (C + 1) <= 256 );".

The code shows the cause of the error. The "high" GBK encoding is "B8 DF", and the isspace function is called to determine one byte, but the parameters accepted by isspace and _ chvalidator are both int, this will generate a transformation from Char to int. In VC, char is "signed Char" by default. In this way, after char "B8" is transformed to int, it will become a negative number, then, at the time of assert, it was forced to be transformed into unsigned, and it became a very huge positive number. Naturally, assert failed.
Why is there no error in GCC? I didn't see the implementation of the isspace function under GCC, but the char in my GCC is "unsighed char" by default, so even if the implementation of isspace is the same as above, there will be no problems.
In addition, the implementation of isspace in Win32 release and wince is not the same as that in these environments.

The problem is found, and the rest depends on how to solve it.
First, use the simplest method. Since the problem occurs in the transformation, you only need to change the default char type to unsigned char. VC also provides this option. However, there are several problems: first, you have to modify the compilation options when using your code, and second, modifying the compilation options of the entire project may have a bad impact on other code.
Since the above method is not very good, think about other methods. It is observed that the isspace function controls the flow by the _ locale_changed variable and searches for the source code of the entire CRT. It is found that the value of _ locale_changed is only changed in the setlocale function. Finally, I changed the code to the following:

# If defined (win32) & defined (_ Debug)
Char * locale = setlocale (lc_all, ". OCP ");
# Endif
Char * lpszbuffer = (char *) "high ";
Int nlen = (INT) strlen (lpszbuffer );
For (INT I = 0; I
{
Printf ("0x % x % DN", lpszbuffer [I], isspace (lpszbuffer [I]);
}

In this way, there will be no more problems in the debug version, but there may still be some hidden risks. Let's wait until it is met. Continue with the current work.

 

Sharp comment:

If I do not understand the error, you try to use the isspace where locale is ASCII to judge the space encoded by GBK, right? If I understand it correctly, this is not a VC issue, but a usage issue.

For C ++, isspace (CH, Loc) should be used. In this version, Loc is a variable of the type STD: locale. If you want to judge the space of GBK, so let loc be the locale of GBK, and then this function will be normal.

You are currently using the C isspace (CH) function. This function uses the default global locale. You can set this global function to GBK and solve this problem. In short, the isspace of locale, which is the default ASCII, is called to determine whether the string encoded as GBK is a space, which is logically incorrect.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.