During some text processing recently, we found that character encoding is a headache, and conversion between various encodings often causes garbled characters, therefore, Perl and C ++ are respectively used to write auxiliary conversions.ProgramTo help some friends.
First, I have a general understanding of common character encoding. The initial character encoding was the American standard information interchange code (ASCII), which was still in the era when software was not very global, ASCII is enough to meet the encoding requirements. However, with the internationalization of software products, it is basically impossible to support character sets of certain language character systems by ASCII character encoding alone, unicode was born in this context.
Common Unicode encodings include utf8, UTF16, and UTF32. UTF16 and UTF32 are encoded in different sizes. The size is only related to the order in which data is stored in the memory, I have introduced a lot on the Internet. At present, utf8 is a variable-length encoding. It can be encoded as 1 ~ according to the character value range ~ 4 bytes, which is a very popular encoding format. This article uses the conversion of ASCII code to utf8 as an example. Basically, the conversion between other encoding formats can be provided in this articleCode.
Windows provides two useful encoding and conversion API functions: multibytetowidechar and widechartomultibyte. In this article, the former is used to convert ASCII code to utf8. The API function prototype is as follows:
Int Multibytetowidechar (
_ In uint codePage,
_ In DWORD dwflags,
_ In lpcstr lpmultibytestr,
_ In Int Cbmultibyte,
_ Out lpwstr lpwidecharstr,
_ In Int Cchwidechar
);
This API function has been explained in detail on msdn. However, please note that the first parameter of this API function specifies the character encoding to be converted, instead of the target transcoding, the optional values are listed in detail on msdn. In this example, because the converted character encoding is a string written by the notepad of the local machine, the encoding and the local system environment are highly dependent, therefore, you can specify cp_acp as the default ANSI code page of the current Windows system ).
Well, the program code example written in C ++ is short, so the project download is not provided. In the next blog, we will provide a conversion tool written in Perl. I look forward to it ~
# Include " Stdafx. h "
# Include < Windows. h >
# DefineBloc_size 1024
# DefineMax_size, 4098
Int _ Tmain ( Int Argc, _ tchar * Argv [])
{
If ( 3 ! = Argc)
{
Printf_s ( " Parameter Error! \ N " );
Printf_s ( " Usage: ascii2utf8.exe asciifilepath utf8filepath! \ N " );
Return 1 ;
}
File * G_asciifile = _ Tfopen (argv [ 1 ], L " RT " );
If (Null = G_asciifile)
{
Printf_s ( " Open ASCII file failed! \ N " );
Return 1 ;
}
// Get File Size
Long Filesize = 0 ;
If (Fseek (g_asciifile, 0 , Seek_end ))
{
Printf_s ( " 1 get ASCII file size error! \ N " );
Return 1 ;
}
Filesize = Ftell (g_asciifile );
If (Fseek (g_asciifile, 0 , Seek_set ))
{
Printf_s ( " 2 get ASCII file size error! \ N " );
Return 1 ;
}
File * G_utffile = _ Tfopen (argv [ 2 ], L " WT, CCS = UTF-8 " );
If (Null = G_utffile)
{
Printf_s ( " Create utf8 file failed! \ N " );
Return 1 ;
}
Long Offset = 0 ;
Char Asciiarr [block_size];
Tchar utfarr [max_size];
While ( 1 )
{
Memset (asciiarr, ' \ 0 ' , Block_size );
Memset (utfarr, ' \ 0 ' , Max_size );
Int Acual_count = Fread (asciiarr, Sizeof ( Char ), Block_size - 1 , G_asciifile );
If (Acual_count <= 0 )
Break ;
If ( 0 = Multibytetowidechar (cp_acp, null, asciiarr, - 1 , Utfarr, max_size ))
{
Printf_s ( " Parse mutibyte to wide char failed! \ N " );
Return 1 ;
}
_ Ftprintf (g_utffile, l " % S " , Utfarr );
Offset + = Acual_count;
If (Offset > = Filesize)
Break ;
}
If(G_asciifile)
{
Fclose (g_asciifile );
G_asciifile=NULL;
}
If(G_utffile)
{
Fclose (g_utffile );
G_utffile=NULL;
}
Return 0;
}
The above is the code for implementing ASCII to Unicode transcoding. If you want to convert the encoding format to the B encoding format, you need to modify the first parameter in multibytetowider, change to a encoding. When using the _ tfopen function to create the B encoding file, specify the second parameter as B encoding. In this example, _ fopen (L "output.txt", l "WT, CCS = B encoding "). In this way, you can complete the task of converting the encoding format to the B encoding format ~