Convert an ASCII file to a Unicode file

Source: Internet
Author: User

During some text processing recently, we found that character encoding is a headache, and conversion between various encodings often causes garbled characters, therefore, Perl and C ++ are respectively used to write auxiliary conversions.ProgramTo help some friends.

First, I have a general understanding of common character encoding. The initial character encoding was the American standard information interchange code (ASCII), which was still in the era when software was not very global, ASCII is enough to meet the encoding requirements. However, with the internationalization of software products, it is basically impossible to support character sets of certain language character systems by ASCII character encoding alone, unicode was born in this context.

Common Unicode encodings include utf8, UTF16, and UTF32. UTF16 and UTF32 are encoded in different sizes. The size is only related to the order in which data is stored in the memory, I have introduced a lot on the Internet. At present, utf8 is a variable-length encoding. It can be encoded as 1 ~ according to the character value range ~ 4 bytes, which is a very popular encoding format. This article uses the conversion of ASCII code to utf8 as an example. Basically, the conversion between other encoding formats can be provided in this articleCode.

Windows provides two useful encoding and conversion API functions: multibytetowidechar and widechartomultibyte. In this article, the former is used to convert ASCII code to utf8. The API function prototype is as follows:

Int Multibytetowidechar (
_ In uint codePage,
_ In DWORD dwflags,
_ In lpcstr lpmultibytestr,
_ In Int Cbmultibyte,
_ Out lpwstr lpwidecharstr,
_ In Int Cchwidechar
);

This API function has been explained in detail on msdn. However, please note that the first parameter of this API function specifies the character encoding to be converted, instead of the target transcoding, the optional values are listed in detail on msdn. In this example, because the converted character encoding is a string written by the notepad of the local machine, the encoding and the local system environment are highly dependent, therefore, you can specify cp_acp as the default ANSI code page of the current Windows system ).

Well, the program code example written in C ++ is short, so the project download is not provided. In the next blog, we will provide a conversion tool written in Perl. I look forward to it ~

# Include " Stdafx. h "
# Include < Windows. h >

# DefineBloc_size 1024
# DefineMax_size, 4098

Int _ Tmain ( Int Argc, _ tchar * Argv [])
{
If ( 3   ! = Argc)
{
Printf_s ( " Parameter Error! \ N " );
Printf_s ( " Usage: ascii2utf8.exe asciifilepath utf8filepath! \ N " );
Return   1 ;
}

File * G_asciifile = _ Tfopen (argv [ 1 ], L " RT " );
If (Null = G_asciifile)
{
Printf_s ( " Open ASCII file failed! \ N " );
Return   1 ;
}

// Get File Size
Long Filesize =   0 ;
If (Fseek (g_asciifile, 0 , Seek_end ))
{
Printf_s ( " 1 get ASCII file size error! \ N " );
Return   1 ;
}
Filesize = Ftell (g_asciifile );
If (Fseek (g_asciifile, 0 , Seek_set ))
{
Printf_s ( " 2 get ASCII file size error! \ N " );
Return   1 ;
}

File * G_utffile = _ Tfopen (argv [ 2 ], L " WT, CCS = UTF-8 " );
If (Null = G_utffile)
{
Printf_s ( " Create utf8 file failed! \ N " );
Return   1 ;
}

Long Offset =   0 ;
Char Asciiarr [block_size];
Tchar utfarr [max_size];

While ( 1 )
{
Memset (asciiarr, ' \ 0 ' , Block_size );
Memset (utfarr, ' \ 0 ' , Max_size );
Int Acual_count = Fread (asciiarr, Sizeof ( Char ), Block_size - 1 , G_asciifile );

If (Acual_count <=   0 )
Break ;

If ( 0   = Multibytetowidechar (cp_acp, null, asciiarr, - 1 , Utfarr, max_size ))
{
Printf_s ( " Parse mutibyte to wide char failed! \ N " );
Return   1 ;
}

_ Ftprintf (g_utffile, l " % S " , Utfarr );
Offset + = Acual_count;
If (Offset > = Filesize)
Break ;
}

If(G_asciifile)
{
Fclose (g_asciifile );
G_asciifile=NULL;
}

If(G_utffile)
{
Fclose (g_utffile );
G_utffile=NULL;
}

Return 0;
}

The above is the code for implementing ASCII to Unicode transcoding. If you want to convert the encoding format to the B encoding format, you need to modify the first parameter in multibytetowider, change to a encoding. When using the _ tfopen function to create the B encoding file, specify the second parameter as B encoding. In this example, _ fopen (L "output.txt", l "WT, CCS = B encoding "). In this way, you can complete the task of converting the encoding format to the B encoding format ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.