Prevents tail garbled characters in GBK Encoding

Source: Internet
Author: User

2012-07-14 wcdj

Problem description:

During string processing, if the buffer length is fixed, the copy string to the buffer will be truncated. Here we consider that the string to be copied uses GBK encoding and contains Chinese characters. If truncation occurs, garbled characters may occur at the end of the content of the buffer zone. This issue may cause exceptions in importing data to the dB, and other issues.

Solution:

First, you must understand the cause of this problem, that isCharacter Set Encoding Problems. The following is an introduction to GBK and UTF-8:

GBK http://baike.baidu.com/view/25421.htm
UTF-8 http://baike.baidu.com/view/25412.htm
Unicode http://baike.baidu.com/view/40801.htm

You need to know:

(1) GBK adopts dual-byte representation. The total encoding range is 8140-fefe. The first byte is between 81-fe, And the last byte is between 40-fe. The GBK Chinese encoding is expressed in two bytes. The English encoding is represented in ASCII code, which is expressed in single bytes. However, the GBK encoding table also has a dubyte representation of English characters. Therefore, English letters can be expressed in two GBK formats. To distinguish Chinese characters, set the highest bit to 1. The maximum bit of a single English byte is 0. When GBK is used for decoding, if the maximum bit of a high byte is 0, it is decoded using an ascii code table. If the maximum bit of a high byte is 1, it is decoded using a GBK encoded table.
(2) UTF-8 is a variable-length character encoding of Unicode, also known as Wanguo code, it uses 1 to 6 bytes to encode Unicode characters.
(3) Unicode is a character encoding scheme developed by international organizations to accommodate all texts and symbols in the world. Unicode maps these characters with numbers 0-0x10ffff. It can contain up to 1114112 characters, or contain 1114112 characters. The bitwise is the number that can be allocated to characters. UTF-8, UTF-16, and UTF-32 are encoding schemes that convert numbers to program data.

How to view the file encoding method in Linux:

Method 1: Use the VI Editor

We usually write the code in windows and then upload the file to Linux. This may cause file encoding conversion problems. Windows's default file format is GBK (gb2312) encoding, while Linux is generally UTF-8 encoding.

In Linux, you can use the VI editor to view the file encoding and convert the file encoding:
You can directly view the file encoding in VI:
: Se fileencoding
You can directly convert the file encoding in VI:
: Se fileencoding = UTF-8
: Se fileencoding = GBK
This document uses the GBK encoding format.

PS:
VI has four options related to the character encoding method:
(1) character encoding method used inside encoding vi
(2) character encoding of the currently edited file in fileencoding vi
(3) fileencodings VI automatically detects the list of fileencoding Sequences
(4) character encoding of terminals operated by termencoding vi

For the possible values of these options, see VI online help: Help encoding-names

In Vi, you can use the following two commands to view the hexadecimal character to confirm the current encoding:
(1) Ga
Displays the ASCII value of the character under the cursor, hexadecimal, octal value.
(2): %! Xxd
Display and edit files in hexadecimal format.
For example:
: %! Xxd the entire file is displayed in hexadecimal format
: 3! In the xxd file, row 3rd is displayed in hexadecimal format.
: %! Od hides the content of the right text Column
Note: After editing, use the command! Xxd-r converts the modified hexadecimal content back. Otherwise, the modified hexadecimal content will be treated as normal text.

Method 2: Use the OD command

Cat file | OD-x
If different file encoding is used, you can view the actual data stored in the file.

Method 3: Use the xxd command

Xxd file | less

Method 4: Use the hexdump command

Hexdump-C file | less

Test code

/** Prevent garbled characters (GBK) at the end of the string * gerryyang * 2012-07-14 */# include <stdio. h> # include <string. h ># include <string> Using STD: string;/** function: Calculate the length of a GBK Chinese string to prevent garbled characters at the end of the string * @ para s: header pointer containing GBK encoded strings * @ para ileft: the buffer size that can be used outside the function * @ para RET: function returns the length of string s to be intercepted */INT gbksubstring (const char * s, int ileft) {int Len = 0, I = 0; if (S = NULL | * s = 0 | ileft <= 0) Return (0); While (* s) {If (* S & 0x80) = 0) {I + +; S ++; Len ++;} else {If (* (S + 1) = 0) break; I + = 2; S + = 2; len + = 2;} if (I = ileft) break; else if (I> ileft) {len-= 2; break ;}} return (LEN );} int main (INT argc, char ** argv) {char szbuf [10] = {0}; string STR = "123abc hello "; /** [1] Chinese truncation not processed */snprintf (szbuf, sizeof (szbuf), "% s", str. c_str (); printf ("szbuf: % s \ n", szbuf);/** [2] handling possible truncation of Chinese characters */memset (szbuf, 0x0, sizeof (szbuf); In T ibufleftlen = sizeof (szbuf)-1; // calculate the valid length as igbkvalidlenint igbkvalidlen = gbksubstring (Str. c_str (), ibufleftlen); puts (""); printf ("the length of STR is reasonably intercepted: % d \ n", igbkvalidlen); snprintf (szbuf, igbkvalidlen + 1, "% s", str. c_str (); printf ("szbuf: % s \ n", szbuf); Return 0;}/* g ++-wall-g-o Test code_test.coutput: szbuf: 123abc you? The reasonable length of STR is: 8 szbuf: 123abc you */

For more information, see:

[1]
Chinese characters of files in Linux are garbled
[2] Vim encoding and font
[3] Vim character encoding settings

[4] I will talk about unicode encoding and briefly explain the terminologies such as UCOS, UTF, BMP, and BOM.
Http://blog.csdn.net/fmddlmyy/article/details/372148
[5] talking about text encoding and Unicode (I)
Http://blog.csdn.net/fmddlmyy/article/details/1510189
[6] talking about text encoding and Unicode (II)
Http://blog.csdn.net/fmddlmyy/article/details/1510193
[7] differences between Character Set GBK and utf8
Http://space.itpub.net/55022/viewspace-713901
[8] common character encoding and encoding header BOM
Http://xouou.iteye.com/blog/1337417

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.