Recently, I used vc6 to capture information. I first studied regular expressions two days ago and easily crawled information from the Internet. I used the MySQL database for background storage, this is mainly for the convenience of web development in the future. However, MySQL encountered some Encoding Problems during usage. Here we will record them to provide reference for friends who may encounter such problems.
Currently, the Chinese webpage code is mainly gb2132 and utf8. gb2132 does not need to be converted in vc6 when capturing webpages from the Internet, because vc6 uses multi-byte storage by default, there will be no garbled problem, but if the webpage is UTF-8 encoded, the captured result will be garbled. You need to convert utf8 to multiple bytes for processing in VC. For example, convert it as follows:
//////////////////////////
Int n = multibytetowidechar (cp_utf8, 0, strdata, strdata. getlength (), null, 0 );
Wchar * pchar = new wchar [n + 1];
Multibytetowidechar (cp_utf8, 0, strdata, strdata. getlength (), pchar, N );
Pchar [N] = 0;
Char szansi [1024];
Widechartomultibyte (cp_acp, wc_compositecheck, pchar,-1, szansi, sizeof (szansi), null, null );
///////////////////////////
Szansi is a multi-byte string that can be used.
After processing the strings inside vc6, encoding problems may also occur when the strings are stored in the MySQL database.
It should be noted that if the encoding is not performed according to the encoding settings of the database during storage, although the data can be stored in the database, a problem may occur during retrieval, in most cases, you will see a pair of garbled characters without knowing how to handle them.
Therefore, here we set the storage encoding format of the database to utf8 (mainly for ease of application, or something else, such as gb3132), so it is best to explicitly describe the table creation.
//////////////////////////
Create Table if not exists XXX (ID int (4) not null primary key auto_increment,...) default charset = utf8;
When connecting to the database, set the read/write encoding after mysql_init (& g_mysql:
Mysql_query (& g_mysql, _ T ("set names 'utf8 '")
//////////////////////////
In this way, the preparation is complete, and the storage is the encoding conversion problem. Because the string in vc6 is multi-byte encoding (in fact, it should be gb2132), MySQL database uses utf8 encoding during storage. If it is not converted, an error will occur during insertion, error message that cannot be recognized.
Here we need a character encoding conversion function, refer to http://www.vckbase.com/document/viewdoc? Id = The Conversion Function gb2312toutf_8 in 1444:
//////////////////////////
// Gb2312 into UTF-8
Char * gb2312toutf_8 (char * ptext, int Plen)
{
Int nulen = 1 + Plen * 2; // Plen + (Plen> 2) + 2;
Char Buf [4];
Char * rst = new char [nulen];
Memset (BUF, 0, 4 );
Memset (RST, 0, nulen );
Int I = 0;
Int J = 0;
While (I <Plen)
{
// Directly copy data in English
If (* (ptext + I)> = 0)
{
RST [J ++] = ptext [I ++];
}
Else
{
Wchar pbuffer;
Gb2312tounicode (& pbuffer, ptext + I );
Unicodetoutf_8 (BUF, & pbuffer );
Unsigned short int TMP = 0;
TMP = rst [J] = Buf [0];
TMP = rst [J + 1] = Buf [1];
TMP = rst [J + 2] = Buf [2];
J + = 3;
I + = 2;
}
}
RST [J] = '\ 0 ';
Return RST;
}
Note that there is a small problem in the original article when calculating the string length. I changed it above and returned the memory opening pointer. After using it, remember to release the memory;
By the way, the following functions are used:
// Convert the UTF-8 to Unicode
Void utf_8tounicode (wchar * pout, char * ptext)
{
Char * uchar = (char *) pout;
Uchar [1] = (ptext [0] & 0x0f) <4) + (ptext [1]> 2) & 0x0f );
Uchar [0] = (ptext [1] & 0x03) <6) + (ptext [2] & 0x3f );
Return;
}
// Unicode to UTF-8
Void unicodetoutf_8 (char * pout, wchar * ptext)
{
// Pay attention to the order of wchar high and low characters. The lower byte is in the front and the higher byte is in the back
Char * pchar = (char *) ptext;
Pout [0] = (0xe0 | (pchar [1] & 0xf0)> 4 ));
Pout [1] = (0x80 | (pchar [1] & 0x0f) <2) + (pchar [0] & 0xc0)> 6 );
Pout [2] = (0x80 | (pchar [0] & 0x3f ));
Return;
}
// Convert Unicode to gb2312
Void unicodetogb2312 (char * pout, unsigned short udata)
{
Widechartomultibyte (cp_acp, null, & udata, 1, pout, sizeof (wchar), null, null );
Return;
}
// Convert gb2312 to Unicode
Void gb2312tounicode (wchar * pout, char * gbbuffer)
{
: Multibytetowidechar (cp_acp, mb_precomposed, gbbuffer, 2, pout, 1 );
Return;
}
//////////////////////////
After the preceding encoding and conversion, the string can be stored in the utf8-encoded MySQL database.
Strsql. Format ("insert into XXX (name,...) values ('% s',...)", szname ...);
If (mysql_query (& g_mysql, strsql )! = 0)
{
Cout <mysql_error (& g_mysql) <Endl;
Errorfile. writestring (strsql );
Continue;
}
Delete [] szname;
/////////////////////////////////////////
However, I still encountered a problem when using the above Code. The following utf8 characters in the string may cause SQL Execution to fail,
\ Xe0 \ x84 \ x81
\ Xe0 \ x82 \ xb7
\ Xe0 \ X80 \ xbf
\ Xe0 \ x90 \ x96
\ Xe0 \ x8b \ x8a
The initial idea is to replace these characters. If there is a better way, you are welcome to discuss them further.
--------------------------------------------
Ppzhang | giszhang@gmail.com |