C language for String Conversion in UTF-8, Unicode, GB2312 format

Source: Internet
Author: User
Tags 04x

Source: http://www.study-code.com/visual-studio/c/72913.htm

(Declaration: This article is original. If it is reproduced, please indicate the author and the original link)
/* Author: wu. jian (wu jian) English name: Sword
/* Date: 2007-12-13
/* Purpose: Knowledge Sharing

These days I encountered the problem of converting UTF-8 to GB2312, and in the embedded environment, there is no API available, check a lot of online information, most of them call interfaces provided by VC or linux. Here I will summarize my work over the past two days.
In general, there are two major steps (here we will not introduce the basic knowledge ):

1. UTF8-> Unicode
UTF8 is related to Unicode, so you can directly convert it without any libraries. First, you must understand the UTF-8 encoding format:
U-00000000-U-0000007F: 0 xxxxxxx
U-00000080-U-000007FF: 110 xxxxx 10 xxxxxx
U-00000800-U-0000FFFF: 1110 xxxx 10 xxxxxx 10 xxxxxx
U-00010000-U-001FFFFF: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-00200000-U-03FFFFFF: 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-04000000-U-7FFFFFFF: 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
The first few ones indicate that the last few bytes belong together. This is useful if you want to parse a long string in UTF8 format. The following function is used to determine the first few ones (define APP_PRINT printf is available here. In this way, when release is used, this macro can be defined as null and it does not need to be modified one by one, debugging is also convenient ):
Int GetUtf8ByteNumForWord (u8 firstCh)
{
U8 temp = 0x80;
Int num = 0;
 
While (temp & firstCh)
{
Num ++;
Temp = (temp> 1 );
}

APP_PRINT ("the num is: % d", num );
Return num;
}
Using this function, we can obtain the several bytes in the string. Because UTF8 has a maximum of 6 bytes, I only process the UTF-8 encoding of 3 bytes and 1 byte Based on the returned value, generally, Chinese is 3 bytes in UTF8.

// Convert the len UTF-8 format to the GB2312 format and store it in the pre-applied buffer zone of temp.
Void Utf8ToGb2312 (const char * utf8, int len, char * temp)
{
APP_PRINT ("utf8-> unicode: \ n ");
APP_PRINT ("utf8 :[");
For (int k = 0; k <len; k ++)
{
APP_PRINT ("% 02x", utf8 [k]);
}
APP_PRINT ("] \ n ");
 
Int byteCount = 0;
Int I = 0;
Int j = 0;

2017-11-unicodekey = 0;
B2gbkey = 0;

// Loop Parsing
While (I <len)
{
Switch (GetUtf8ByteNumForWord (u8) utf8 [I])
{
Case 0:
Temp [j] = utf8 [I];
ByteCount = 1;
Break;

Case 2:
Temp [j] = utf8 [I];
Temp [j + 1] = utf8 [I + 1];
ByteCount = 2;
Break;

Case 3:
// UTF8-> Unicode
Temp [j + 1] = (utf8 [I] & 0x0F) <4) | (utf8 [I + 1]> 2) & 0x0F );
Temp [j] = (utf8 [I + 1] & 0x03) <6) + (utf8 [I + 2] & 0x3F );

// Obtain the Unicode Value
Memcpy (& unicodeKey, (temp + j), 2 );
APP_PRINT ("unicode key is: 0x % 04X \ n", unicodeKey );

// Obtain the corresponding GB2312 value based on this value.
GbKey = SearchCodeTable (unicodeKey );
APP_PRINT ("gb2312 key is: 0x % 04X \ n", gbKey );

If (gbKey! = 0)
{
// Here change the byte
// If the value is not 0, it indicates that the search is successful, and the high and low bytes are converted into the desired format.
GbKey = (gbKey> 8) | (gbKey <8 );
APP_PRINT ("after changing, gb2312 key is: 0x % 04X \ n", gbKey );
Memcpy (temp + j), & gbKey, 2 );
}

ByteCount = 3;
Break;

Case 4:
ByteCount = 4;
Break;
Case 5:
ByteCount = 5;
Break;
Case 6:
ByteCount = 6;
Break;

Default:
APP_PRINT ("the len is more than 6 \ n ");
Break;
}

I + = byteCount;
If (byteCount = 1)
{
J ++;
}
Else
{
J + = 2;
}

}
APP_PRINT ("utf8 :[");
For (k = 0; k <j; k ++)
{
APP_PRINT ("% 02x", temp [k]);
}
APP_PRINT ("] \ n ");
}

Ii. Next we will talk about Unicode-> GB2312 conversion using the look-up table method. First, we will download the code table. Generally, the code table will put GB2312 in front and Unicode in the back, which is inconvenient for us to use, so I converted the Unicode to the front and sorted it in ascending order. (Here, we only need to consider the situation that both are two bytes, because the preceding UTF8-> Unicode does not convert the single-byte ASCII to Unicode)
(1) Do table :( can download here: http://blog.91bs.com /? Action = show & id = 20. Thank you.Pig of Dregs)
This is the original format:
0x8140 0x4E02 # CJK uniied IDEOGRAPH
0x8141 0x4E04 # CJK uniied IDEOGRAPH
0x8142 0x4E05 # CJK uniied IDEOGRAPH
First (this can be done by writing a small program. I did it on VC. If necessary, contact me ):
{0x4E02, 0x8140}, // CJK uniied IDEOGRAPH
{0x4E04, 0x8141}, // CJK uniied IDEOGRAPH
{0x4E05, 0x8142}, // CJK uniied IDEOGRAPH
In this way, you can put these in the. h file. below is my definition:
Typedef struct unicode_gb
{
Unsigned short unicode;
Unsigned short gb;
} UNICODE_GB;

UNICODE_GB code_table [] =
{
{0x4E02, 0x8140}, // CJK uniied IDEOGRAPH
{0x4E04, 0x8141}, // CJK uniied IDEOGRAPH
{0x4E05, 0x8142}, // CJK uniied IDEOGRAPH
...... Omitted

The following step is also very simple. In VC, the whole table is sorted by the bubble sort method. Here, the sorting result is printed according to the unicode value, run name> 1.txt in cmd and output it to the file. In this way, a unicode-> gb2312 code table is prepared in order of unicode. The source code is as follows:

Int main (int argc, char * argv [])
{

Int num = 0;
UNICODE_GB temp;
Int I = 0;
Int j = 0;

Num = sizeof (code_table)/sizeof (UNICODE_GB );

Printf ("struct size: % d | total size: % d | num is: % d \ n ",
Sizeof (UNICODE_GB), sizeof (code_table), num );

For (I = 0; I <num; I ++)
{
For (j = 1; j <num-I; j ++)
{
If (code_table [j-1]. unicode> code_table [j]. unicode)
{
Temp. unicode = code_table [j-1]. unicode;
Temp. gb = code_table [j-1]. gb;
Code_table [j-1]. unicode = code_table [j]. unicode;
Code_table [j-1]. gb = code_table [j]. gb;
Code_table [j]. unicode = temp. unicode;
Code_table [j]. gb = temp. gb;
}
}
}

Printf ("here is the code table sorted by unicode \ n ");

For (I = 0; I <num; I ++)
{
Printf ("{\ t0x % 04X, \ t0x % 04X \ t}, \ t \ n", code_table [I]. unicode, code_table [I]. gb );
}

Printf ("\ n print over! \ N ");

// The comment below is actually used to add, {,}, and so on to the original code table.
/*
Char buff [100];
Char buff_1 [2, 100];
 
FILE * fp = NULL;
FILE * fp_1 = NULL;

Memset (buff, 0,100 );
Memset (buff_1, 0,100 );
 
Fp = fopen ("table.txt", "rw ");
Fp_1 = fopen ("table_1.txt", "a + ");

If (fp = NULL) | (fp_1 = NULL ))
{
Printf ("open file error! \ N ");
Return 1;
}

While (fgets (buff, 100, fp )! = NULL)
{
Buff [8] = ',';

Fputs (buff, fp_1 );
}
*/

Return 0;
}

Finally, it is the search algorithm. We have sorted the order above. Now we put the sorted code table in the. h file we really need. You should guess what algorithm I used to search for it. The binary method.

# Define CODE_TABLE_SIZE 21791
// This table is dead, so the length is expressed using a macro instead of using the size each time. However, this may not be good for portability.
Service.ap-southeast-1.maxcompute.aliyun-inc.com/api)
{
Int first = 0;
Int end = CODE_TABLE_SIZE-1;
Int mid = 0;

While (first <= end)
{
Mid = (first + end)/2;

If (code_table [mid]. unicode = unicodeKey)
{
Return code_table [mid]. gb;
}
Else if (code_table [mid]. unicode> unicodeKey)
{
End = mid-1;
}
Else
{
First = mid + 1;
}
}
Return 0;
}
At this point, the UTF8 string can be converted to GB2312. It is a long string, rather than encoding and conversion of a single Chinese character.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.