1. If you determine whether a character is a Spanish character or a Chinese character
As you know, Spanish characters mainly refer to ASCII codes, which are expressed in one byte. After this character is converted to a number, the number is greater than 0, and the Chinese character is two bytes. After the first byte is converted to a number, it should be less than 0, therefore, you can determine whether each byte is a Chinese character based on whether it is less than 0 after being converted to a number.
For example, if the input word is strin,
If (strin. At (0) <0)
Cout <"is a Chinese character" <Endl;
Else cout <"not Chinese character" <Endl;
Ii. Chinese character encoding in C ++
It can be determined based on the encoding range of Chinese characters. For gb2312 and GBK, use the following two programs.
1. Determine if it is gb2312
Bool isgbcode (const string & strin)
{
Unsigned char success;
Unsigned char CH2;
If (strin. Size ()> = 2)
{
Substring = (unsigned char) strin. At (0 );
CH2 = (unsigned char) strin. at (1 );
If (latency> = 176 & latency <= 247 & CH2> = 160 & CH2 <= 254)
Return true;
Else return false;
}
Else return false;
}
2. Determine whether it is GBK encoding.
Bool isgbkcode (const string & strin)
{
Unsigned char success;
Unsigned char CH2;
If (strin. Size ()> = 2)
{
Substring = (unsigned char) strin. At (0 );
CH2 = (unsigned char) strin. at (1 );
If (latency> = 129 & latency <= 254 & CH2> = 64 & CH2 <= 254)
Return true;
Else return false;
}
Else return false;
}
3. Determine if it is big5
It ranges from 0xa0 to 0xfe, from 0x40 to 0x7e, and from 0xa1 to 0xfe. To determine whether a Chinese character is big5 encoding, you can determine the encoding range of the character above.
Iii. Character locating
1. big5
How to locate it? We can also imagine that all codes are arranged as two-dimensional coordinates. The ordinate coordinates are high bytes and the horizontal coordinates are low bytes. In this way, the number of Chinese characters in a row is: (0x7e-0x40 +)
1) + (0xfe-0xa1 + 1) = 157. The positioning algorithm is divided into two parts:
If 0x40 <= CH2 <= 0x7e: # Is big5 char
Index = (ch1-0xA1) * 157 + (ch2-0x40) * 2
Elif 0xa1 <= CH2 <= 0xfe: # Is big5 char
Index = (ch1-0xA1) * 157 + (ch2-0xA1 + 63) * 2
For the second part, when calculating the offset, because there are two values, when calculating the next value, do not forget that there is another value. 0x7e-0x40 + 1 = 63.
2. Others should be similar.
It can be as follows: Hash Chinese Characters
To facilitate Chinese Character Processing, we usually use the hash method when searching for Chinese characters. How can we determine the position of a Chinese character? This is related to the arrangement of each encoding. Here we mainly provide a hash function policy.
(1) gb2312 Encoding
For gb2312 encoding, set the input Chinese character to gbword
(STD: string), we can use the formula (C1-176) * 94 +
(C2-161) Determine gbindex. C1 indicates the first byte, and C2 indicates the second byte. The details are as follows:
Gbindex = (unsigned char) gbword. At (0)-176) * 94 + (unsigned
Char) gbword. at (1) to 161;
The unsigned char type is used because Char is a byte.
Int, because Int Is 4 bytes, it may cause expansion and errors.
(2) GBK Encoding
For GBK encoding, if the input Chinese character is gbkword, the formula can be used.
Index = (ch1-0x81) * 190 + (ch2-0x40)-(CH2/128) Where bytes is the first byte and CH2 is the second byte.
Specific,
Gbkindex = (unsigned char) gbkword [0]-129) * 190 +
(Unsigned char) gbkword [1]-64)-(unsigned
Char) gbkword [1]/128;
[Reference] character set encoding details
Http://www.cppblog.com/humanchao/archive/2007/09/27/32989.html
[Reference] A tool for determining character encoding and Transcoding
Http://hi.baidu.com/pazhu/blog/item/efcce7a2034ae9a8caefd05b.html