C ++ Unicode file read/write

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Read and Write Unicode text in C ++

<Salute Original Author> http://librawill.blogspot.com/2008/08/cunicode_2881.html
Familiar with character types, Char, wchar_t, and tchar. The most familiar char is a single-byte character, which is suitable for ANSI encoding. wchar_t is a double-byte wide character type and is suitable for Unicode encoding; tchar is a macro. It is defined as char in the ANSI environment and wchar_t In the Unicode environment.

How to represent a string? Yes, character arrays. You must know that in the C ++ language, there is no data structure of arrays. The so-called arrays are represented by pointers and lengths. Const char *, const wchar_t *, and const tchar * can be used to represent strings in different environments. Let's talk about the related macros: lpstr: Long Point string, which is equivalent to char *; lpcstr: Long Point const string, which is equivalent to const char *; lpcwstr: Long Point const wide string,
Equivalent to const wchar_t *; lpctstr: similar, equivalent to const tchar *; Do not memorize them. Remember the meaning of uppercase letters to guess their meaning.

A string, for example, "Beijing 2008". The corresponding ANSI code is const char * CHA = "Beijing 2008". The Unicode code is const wchar_t * wcha = l "Beijing 2008 ";. It is stored in binary in the memory. The ANSI encoding is 0x b1b1 bea9 32 30 30 38, and the Unicode encoding is 0x1753 ac4e 3200 3000 3000.

Back to the above, Why can a struct pointer represent a string? The computer finds this pointer and can only know the first character of the string. Here, because the string has a default Terminator '\ 0' (ANSI or ASCII is expressed as 0x00), it starts with the first character, the computer starts searching backward until 0x00, and considers the string to end. Therefore, when storing the string, the computer carries a special Terminator. Note that 0x00 is the end character defined by the ASCII code. What about the wide character Unicode environment? What is the Terminator? It is 0x0000.

How do I represent a non-const string? How does the char * method dynamically define the length? Easy to handle, you can use new to manually allocate memory space. In addition, there is a better way, that is, string type, how to change the length, how to record the length, and how to store the memory, these are not required, and all of them are automatically managed by the C ++ standard library.

How to convert different types of strings? For example, you can define char * cha; string STR; STR = CHA; // you can convert char * to string, CHA = Str. c_str (); can be converted from string to char *; what about wchar_t wcha; wstring wstr? Wstr = wcha; wcha = wstr. c_str (); // can this problem be solved ?!

After talking about the string representation and type conversion, let's look at fstream, ifstream, ofstream, and file stream I/O in the livestream I/O and C ++, the default is the bytes stream mode. Specifically, the ANSI bytes stream is For ANSI text. How can Unicode be read/written?

In C ++, there is really a wfsteam stream. Unfortunately, it is strange to use it. When I use wifstream to read Unicode text, the result is actually reading a byte, plus a 0x00, reading the next byte! For example, the text is still stored in "Beijing 2008". As mentioned earlier, Unicode encoding is 0x1753 ac4e 3200 3000 3000; the characters read from the memory using the wifstream are actually 0x1700 5300 ac00 4e00... what is Unicode? I don't know how to use wfstream correctly. If you know it, please leave it blank!

Since wftream does not work, how can we read Unicode? Here we can refer to the binary stream read/write method. When reading and writing binary streams, we must understand the data structure of the storage unit and define it as a struct, then read data by n Bytes (n is the structure length) in binary format. You can use wchar_t directly without defining the structure. The Code is as follows:
Ifstream fin;
Fin. Open (filename, IOS: Binary );
// Skip the Unicode text and start with two bytes 0 xfffe (called Bom, used to identify unicode encoding)
Fin. Seek (2, IOS: Beg );
While (! Fin. EOF ())
{
Wchar_t wch;
Fin. Read (char *) (& wch), 2 );
}

What should I do if I want to read data by row? Well, the Getline (CHA, size) member function with ifstream, And the Getline (FIN, STR) member function with string class ). Can you try Unicode? The answer is no! Why? Because the Getline function is used in ANSI by default, it determines the Line Break Based on the ASCII code line break (0x0d) and line start mark (0x0a). If it is used in Unicode encoding, for example, the Unicode code is 0x0d4e. When the Getline function is executed here, it means that the new line is broken, so it will become invalid! So what is the binary of Unicode line breaks and line prefixes? The double byte is 0x0d00 and 0x0a00. At this time, the Getline function becomes invalid. What should I do? manually judge:
Ifstream fin;
Fin. Open (filename, IOS: Binary );
Size_t Index = 2;
While (! Fin. EOF ())
{
Fin. seekg (index, IOS: Beg );
Wchar_t wch;
Fin. Read (char *) (& wch), 2 );
If (wch = 0x000d) // judge the carriage return
{
Strlineansi = ws2s (wstrline );
Wstrline. Erase (0, wstrline. Size () + 1 );
Iline ++;
Index + = 4; // skip the carriage return and line start
}
Else
{
Wstrline. append (1, wch );
Index + = 2;
}
}

The above program can read Unicode, so how can we understand Unicode after reading it? This requires conversion between char * And wchar_t *. There is no simple method, the conversion between ANSI and Unicode encoding can only be achieved through Table query. c ++ provides two functions: wcstombs (_ DEST, _ source, _ dsize) converts unicode encoding to ANSI encoding. mbstowcs (_ DEST, _ source, _ dsize). Conversely, the parameter corresponds to const char *, const wchar_t *, and length. Here we provide an online function to convert string and wstring:
STD: String ws2s (const STD: wstring & ws)
{
STD: String curlocale = setlocale (lc_all, null); // curlocale = "C ";
Setlocale (lc_all, "CHS ");
Const wchar_t * _ source = ws. c_str ();
Size_t _ dsize = 2 * ws. Size () + 1;
Char * _ DEST = new char [_ dsize];
Memset (_ DEST, 0, _ dsize );
Wcstombs (_ DEST, _ source, _ dsize );
STD: String result = _ DEST;
Delete [] _ DEST;
Setlocale (lc_all, curlocale. c_str ());
Return result;
}

STD: wstring s2ws (const STD: string & S)
{
Setlocale (lc_all, "CHS ");
Const char * _ source = S. c_str ();
Size_t _ dsize = S. Size () + 1;
Wchar_t * _ DEST = new wchar_t [_ dsize];
Wmemset (_ DEST, 0, _ dsize );
Mbstowcs (_ DEST, _ source, _ dsize );
STD: wstring result = _ DEST;
Delete [] _ DEST;
Setlocale (lc_all, "C ");
Return result;
}

Here, you can use C ++ to read Unicode text. The writing method is similar.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C ++ Unicode file read/write

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support