1. Difference Between fullwidth and halfwidth
The characteristics of Chinese characters bring us into two basic and very important concepts: full angle and half angle. In terms of image, when an English character (such as a) occupies a half-width position on a computer screen, the position of a Chinese character is equal to two English characters, so it is called the fullwidth.
In an English input method, whether it is a letter, a symbol, or a number, it always occupies only one English character position, that is, the halfwidth. However, in the Chinese input method, there are two options: full-width and half-width. For Chinese characters, these two options do not affect them, and they always occupy the positions of two English characters, however, the entered symbols, numbers, and English letters in this status are very important, as shown below:
China
China
The former selects the halfwidth, and the latter selects the fullwidth. After selecting the fullwidth, even letters, symbols, and numbers are processed as Chinese characters without exception. Visually, they are quite awkward because they occupy two English characters.
2. Conversion of full and half angles
ANSI multi-byte font set is GBK-encoded in Chinese (Chinese characters and Chinese characters) and ASCII in English (English letters and English characters). Generally, ANSI-encoded text is convenient.Program. We recommend that you use full or half-width pairs before performing operations on strings in the text. For example, the full-width space""(Corresponding GBK encoding a1a1), while the halfwidth Space" "(The corresponding ASCII code is \ x20). If the text is used in combination, when you want to find the space position or split the paragraph by space, it is obvious that the two spaces will cause unnecessary trouble. The content in the text can be divided into the following four types:
Category |
Explanation |
Conversion Method |
Chinese and Chinese characters |
For example, Chinese characters such as "China" and "good" and "..." "--" And other special symbols in Chinese. These can only be full-width, and there is no conversion problem. |
Null |
English letters and symbols |
It refers to the character 33-126 In the ASCII table, for example :! * + 012abc. They have two forms: full angle and half angle, such as China and China. |
Halfwidth to fullwidth: the first byte A3 is added, and the highest position of the original byte is 1. Full-width to half-width: discard the first byte, and the second byte is at the highest position of 0 |
Space |
The space of the ASCII code is \ x20, and the space of GBK encoding is a1a1 |
Unlike English letters and symbols, this should be special |
Control characters |
0-31 characters in the ASCII table, used to control the file format. If you press enter to wrap the line, it corresponds to \ x0d \ x0a (cr lf). This part can only be half-width and does not need to be converted. |
Null |
As a result, the procedure for converting the full-width to the half-width is as follows:
/* * Author: Hou Kai * Description: mutual conversion between fullwidth and halfwidth * Date: */ # Include <Iostream> # Include <Fstream> # Include < String > Using STD :: String ; Using Namespace STD; Const Char Sbc_high =- 93 ; // The first byte of the fullwidth character is A3. Const Char Sbc_space =- 95 ; // The full-width space is a1a1. // Fullwidth to halfwidth String Sbc2dbc ( Const String & SBC ){ String DBC = "" ; Int Len = SBC. Length (); For ( Int I = 0 ; I <Len; ++ I ){ If (SBC [I]> 0 ) // It is a single-byte character or a control character. {DBC. append ( 1 , SBC [I]);} Else { If (SBC [I] = sbc_high) // English letters or English symbols of the fullwidth, such! (A3a1) {DBC. append ( 1 , SBC [I + 1 ] & 0x7f );} Else If (SBC [I] = sbc_space & SBC [I +1 ] = Sbc_space) // Separate space Processing {DBC. append ( 1 , ' ' );} Else // For Chinese characters and ~...... And other Chinese Characters {DBC + = SBC. substr (I, 2 );} ++I ;}} Return DBc ;}
The procedure for converting the halfwidth to fullwidth is as follows:
// Halfwidth to fullwidth String Dbc2sbc ( Const String & DBC ){ String SBC = "" ; Int Len = DBC. Length (); For ( Int I = 0 ; I <Len; ++ I ){ If (DBC [I] < 0 ) // It is a double byte character or a Chinese character. {SBC + = DBC. substr (I, 2 ); ++ I ;} Else If (DBC [I] =' ' ) // Separate space Processing {SBC + = " " ;} Else { If (DBC [I]> = 33 & DBC [I] <= 126 ) // English letters or symbols {SBC. append ( 1 , Sbc_high); SBC. append ( 1 , DBC [I] | 0x80 );} Else {SBC. append ( 1 , DBC [I]); // Control characters }}} Return SBC ;}
The example program for processing a text conversion is as follows:
// A simple example of processing TXT text Void Processfile ( Const Char * Filename) {ifstream infile; String Strline = "" ; String Strresult = "" ; Infile. Open (filename ); If (Infile ){ While (! Infile. EOF () {Getline (infile, strline); strresult + = Strline + " \ N " ;}} Infile. Close (); strresult = Sbc2dbc (strresult ); // Conversion Ofstream OUTFILE; OUTFILE. Open (filename); OUTFILE. Write (strresult. c_str (), strresult. Length (); OUTFILE. Flush (); OUTFILE. Close ();}
3. Unicode Conversion
In actual work, you may encounter Unicode-encoded text. In this case, the conversion process of the halfwidth and fullwidth is the same as that of the preceding method. You only need to process "English letters and English symbols" and "spaces. Of course, in Unicode encoding, all characters are expressed in two bytes, such as the halfwidth space \ x0020, The fullwidth space is \ x3000, (for Unicode encoding, see: http://www.cnblogs.com/houkai/archive/2013/06/04/3116955.html ), this avoids adding or dropping characters and makes processing easier. You can download the Unicode encoding table. Conversion Method:
A. The fullwidth space is 12288, And the halfwidth space is 32.
B. the correspondence between the half-width (33-126) of other characters and the full-width (65281-65374) is as follows: the difference is 65248.
The program implementation is relatively simple. Based on utf8 to the changetxtencoding function in ANSI, the following is an example after Rewriting:
// The function of the original function is to convert utf8-encoded szu8 characters to Unicode and Unicode to ANSI. // Here, After utf8 to Unicode, add the "full-width to halfwidth program" section. Char * Changetxtencoding ( Char * Szu8 ){ Int Wcslen =: multibytetowidechar (cp_utf8, null, szu8 ,- 1 , Null, 0 ); Wchar_t * Wszstring = New Wchar_t [wcslen];: multibytetowidechar (cp_utf8, null, szu8, - 1 , Wszstring, wcslen ); // Full-width-to-half-width Program // Wszstring is unicode encoded. For ( Int I =0 ; I <wcslen; I ++ ){ If (Wszstring [I] = 12288 ) // Space {Wszstring [I] = 32 ;} If (Wszstring [I]> = 65281 ) & (Wszstring [I] <= 65374 )) // Other characters {Wszstring [I] -= 65248 ;}} Int Ansilen =: widechartomultibyte (cp_acp, null, wszstring ,- 1 , Null, 0 , Null, null ); // Wcslen (wszstring) Char * Szansi = New Char [Ansilen];: widechartomultibyte (cp_acp, null, wszstring, - 1 , Szansi, ansilen, null, null); Delete [] wszstring; Return Szansi ;}