Filter non-GBK characters

Source: Internet
Author: User

You can see one of Baidu's pen questions.

It is known that a string is composed of a combination of GBK Chinese characters and ANSI-encoded numbers and letters. You can write a C function to remove all ANSI-encoded numbers and letters from it.
(Including case sensitive). Results must be returned on the original string.
Function interface: int filter_ansi (char * gbk_string ).
Note: The GBK encoding range of Chinese characters is 0x8140-0 xfefe.

In fact, the idea of this question is similar to deleting a specific character in a string. In fact, it is simpler because there is no Filtering Rule (refer to the parameter after the blog post as a filtering rule ).

Generally, GBK and gb2312 are commonly used for Chinese character encoding. In order to distinguish ANSI encoding, the highest bit is generally 1.

The code below is as follows:

# Include <stdio. h>
Int del_ansic (char * GBK)
{
Char * First = GBK;
Char * Last = GBK;

While (* Last)
{
// If it is a Chinese character, the double byte height is 1
If (* Last & 0x80 ))
{
* First ++ = * Last ++;
* First ++ = * Last ++;
} Else
++ Last;
}
* First = '\ 0 ';
}

Int main ()
{
Char gbk_str [] = "the Chinese Bak has a profound and profound supername, so we need to be down to TER Baidu ";
Del_ansic (gbk_str );
Printf ("GBK % s \ n", gbk_str );
Return 0;
}

However, this code is invalid for some codes, such as "quiet s", which cannot be converted, and there is a problem with the "fixed" encoding of the test estimation. If you are interested, you can try the GBK encoding range given in the question settings. Add a ch variable to test the range.

The following is the data of gbk2312.

The 01-09 area is a special symbol.
Areas 16-55 are top-level Chinese characters sorted by pinyin.
Area 56-87 contains second-level Chinese characters, which are sorted by the beginning or strokes.
Each Chinese Character and symbol is expressed in two bytes. The first byte is called "high byte", and the second byte is called "low Byte ".
"High Byte" uses 0xa1-0xf7 (add the area code of area 01-87 with 0xa0), and "low Byte" uses 0xa1-0xfe (add 01-94 with 0xa0 ).
For example, the word "ah" is stored in 0xb0a1 in most programs. (Compare with the location code: 0xb0 = 0xa0 + 16, 0xa1 = 0xa0 + 1 ).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.