Reading ANSI, Unicode, Unicode big endian, and UTF-8 text files by row in vc ansi Environment

Source: Internet
Author: User
Reading ANSI, Unicode, Unicode big endian, and UTF-8 text files by row in vc ansi Environment

1. Question proposal
The file class cstdiofile provided by MFC. One of the functions readstring implements row-based reading of files, but it cannot meet the needs of reading different types of text files by row. To solve this problem, I Preliminary StudySome coding knowledge is provided. Based on some online materials, cstdiofile extension class cstdiofileex is implemented, and common text files are read by row (note: does not include documents of other forms such as Doc and PDF ).
I would like to express my gratitude to the online users who share their coding experience. At the same time, the classes I have written have not been strictly tested. If any errors or methods are too complicated, please let me know.
2. Problem Solving
(1) four common text file encoding methods
ANSI, Unicode, Unicode big endian, UTF-8There are differences between the four formats of encoding, which are briefly described as follows:
ANSI code:
No File Header (symbolic bytes starting with the file encoding)
ANSI-encoded letters and numbers take up one byte, and Chinese characters take up two bytes,
Enter the line break in hexadecimal notation of a single byte as 0d 0a

Unicode encoding:
File Header, expressed in hexadecimal format as FF fe
Each character is encoded in two bytes.
Enter the line break in double byte 000d 000a

Unicode big endian encoding:
The file header in hexadecimal format is Fe ff,
The subsequent encoding places the character's high position in front and the low position in the back, which is exactly the same as the Unicode encoding.
Enter the line break, double byte, in hexadecimal format: 0d00 0a00

UTF-8 code:
The file header. The hexadecimal format is ef bb bf.
UTF-8 is a unicode Variable Length character encoding, numbers, letters, carriage return, line feed are expressed in one byte, Chinese characters accounted for three bytes.
Enter a line break in a single byte. The hexadecimal format is 0d 0a.

Take the Chinese "hello" as an example. The hexadecimal format corresponding to various types of codes (which can be viewed by editplus) is shown in:

The above discussion is correct.
(2) solutions for reading text files in the preceding four formats by row
Based on the encoding characteristics of different files, the system first checks the file header to determine the file encoding type, and then calls different reading functions based on the file type to implement row-based file reading. Shows the row-based read process:

During implementation, write the cstdiofileex class, which inherits from the cstdiofile class and overwrites the bool readstring (cstring & rstring) method of the cstdiofile class, thus implementing row-based file reading.
(3) implementation code of the cstdiofileex class
Code List:

// Stdiofileex. h: interface for the cstdiofileex class. //////////////////////////////////////// /// // If! Defined (Future _) # define future _ # If _ msc_ver> 1000 # pragma once # endif // _ msc_ver> 1000 // program purpose: reading text files in common (including ANSI, Unicode, Unicode big endian, UTF-8) formats by row // Author: wang dingqiao, School of Computer Science and Technology, Hubei Normal University // core algorithm: CSTD Iofileex inherits from cstdiofile and overwrites the bool readstring (cstring & rstring) method of cstdiofile. // Based on the encoding characteristics of different files, find the file and press enter to return the line break to determine whether the read row ends, file Terminator determines the end of a file. // checks the encoding headers of different files. After obtaining the file type, call different reading functions. // test results: in the Windows 7 vc6.0 environment, the TXT files in the preceding four formats are tested. // not completed: the virtual lptstr readstring (lptstr lpsz, uint Nmax) method of the cstdiofile is not reloaded. // The writestring method is not completed, tested in the VC Unicode environment // production time: 2012-04-19 // code copyright: the code is open for learning and communication. Welcome to correct the error and improve the algorithm //--------------------------------------------- --------------------------------------------- # Include "stdafx. H "// The enumerated value typedef Enum textcodetype {utf8 = 0, Unicode = 1, unicodebigendian = 2, ANSI = 3, fileerror = 4} textcode; Class cstdiofileex: public cstdiofile {public: cstdiofileex (); cstdiofileex (File * popenstream); cstdiofileex (lpctstr lpszfilename, uint nopenflags); Virtual ~ Cstdiofileex (); Virtual bool open (lpctstr lpszfilename, uint nopenflags, cfileexception * perror = NULL); Public: // convert the file type value to the string cstring filetypetostring (); // obtain the file type textcode getfiletype (); // read the file bool readstring (cstring & rstring) by row; // obtain the file type Static textcode getfiletype (lpctstr lpszfilename) using static methods ); protected: textcode m_filetype; // save the file type const static int predefinedsize; // pre-define the space required for a row of files protected: // read bool R from the UTF-8 file row by row Eadstringfromutf8file (cstring & rstring); // read bool readstringfromansifile (cstring & rstring) by row from the ANSI file ); // read bool readstringfromunicodefile (cstring & rstring); // convert the UTF-8 string to the Unicode string cstring utf8tounicode (byte * szutf8 ); // uint processflags (maid, uint & nopenflags, textcode & TC);}; # endif //! Defined (afx_stdiofileex_h1_c1f1f96b_91__4388_8d24_892edfa2a6161_encoded _)
// Stdiofileex. CPP: Implementation of the cstdiofileex class. //////////////////////////////////////// //// // # include "stdafx. H "# include" stdiofileex. H "# ifdef _ debug # UNDEF this_filestatic char this_file [] =__ file __; # define new debug_new # endif ////////////////////////////////// /// // construction/ destruction /////////////////////////////////////// /// // * Static */const int cstdiofileex:: predefinedsize = 1024; cstdiofileex: cstdiofileex (): cstdiofile () {m_filetype = ANSI; // specify the default type} encoding: cstdiofileex (File * popenstream): cstdiofile (popenstream) {cstring filepath = popenstream-> _ tmpfname ;//? It is unclear that the file * structure m_filetype = getfiletype (filepath);} cstdiofileex )) {} cstdiofileex ::~ Cstdiofileex () {}/// cstdiofileex: getfiletype static method for detecting the text file type // optional/* Static */textcode cstdiofileex: getfiletype (lpctstr lpszfilename) {cfile file; byte Buf [3]; // unsigned chartextcode TC; try {If (file. open (lpszfilename, cfil E: moderead | cfile: sharedenynone | cfile: typebinary) {file. read (BUF, 3); If (BUF [0] = 0xef & Buf [1] = 0xbb & Buf [2] = 0xbf) Tc = utf8; elseif (BUF [0] = 0xff & Buf [1] = 0xfe) Tc = Unicode; elseif (BUF [0] = 0xfe & Buf [1] = 0xff) Tc = unicodebigendian; elsetc = ANSI;} elsetc = fileerror;} catch (cfileexception ex) {cstring errormsg; errormsg. format (_ T ("an exception occurred when operating the file % s! "), Ex. m_strfilename); afxmessagebox (errormsg);} return TC;} // returns // cstdiofileex :: readstring reads text files by line // call different reading functions based on different file types // export bool cstdiofileex: readstring (cstring & rstring) {bool flag = false; Switch (m_filetype) {Case ANSI: Flag = readstringfromansifile (rstring); break; Case UNICODE: Case unicodebigendian: Flag = delimiter (rstring); break; Case utf8: Flag = readstringfromutf8file (rstring); break; Case fileerror: flag = false; break; default: break;} return flag;} // returns // optional // cstdiofileex: readstringfromansifile read string from ANSI file // Export bool cstdiofileex: readstringfromansifile (cstring & rstring) {bool flag; try {flag = cstdiofile: readstring (rstring); rstring + = "\ r \ n ";} catch (cfileexception ex) {cstring errormsg; errormsg. format (_ T ("an exception occurred when operating the file % s! "), Ex. m_strfilename); afxmessagebox (errormsg);} return flag;} // returns // cstdiofileex :: readstringfromutf8file read from the utf8 file by row // because the UTF-8 encoding multi-byte encoding and different character lengths, judge the carriage return line needs to judge two consecutive bytes // -------------------------------------------------------------------------------------------- bool cstdiofileex :: readstringfromutf8file (cstring & rstr ING) {long index; byte Cr = 0x0d; // enter the line break byte LF = 0x0a; byte temp [2]; byte tempbyte; byte * pbuf = new byte [predefinedsize + 1]; memset (pbuf, 0, (predefinedsize + 1) * sizeof (byte); uint readlen; try {// skip the file header to move the Object Pointer if (m_pstream & (getposition () = 0) {cstdiofile: Seek (3 * sizeof (byte), cfile :: begin);} Index = 0; do {memset (temp, * sizeof (byte); readlen = cfile: Read (temp, 2); // cstdiofile :: if the read effect is different, the carriage return character 0x0dif (! Readlen) return false; // The elements are stored in the byte array. pbuf [index ++] = temp [0]; pbuf [index ++] = temp [1]; tempbyte = temp [1]; // determines whether to press enter to wrap the line if (tempbyte = CR & temp [0] = lf) | (temp [0] = CR & temp [1] = lf) break;} while (readlen = 2 & index <predefinedsize ); pbuf [Index] = 0; rstring = utf8tounicode (pbuf); // convert utf8 encoding to Unicode} catch (cfileexception ex) {cstring errormsg; errormsg. format (_ T ("an exception occurred when operating the file % s! "), Ex. m_strfilename); afxmessagebox (errormsg);} Delete [] pbuf; return true ;} // bytes // read from Unicode and Unicode big endian files by row // when the read bytes are smaller than the request value (end of the file) or exit the loop unconditionally when the pre-defined space is exceeded // wchline stores each line of characters, and wchtemp stores temporarily read characters // when the encoding is Unicode big endian, high and low bytes are exchanged, convert it to a unicode string //------------------------------------------------------------------------------ -------------- Bool cstdiofileex: readstringfromunicodefile (cstring & rstring) {long index; uint readlen; wchar_t wchcr = makeword (0x0d, 0x00 );; // carriage return makeword (low-byte order) wchar_t wchlf = makeword (0x0a, 0x00); wchar_t * wchline = new wchar_t [predefinedsize + 1]; memset (wchline, 0, (predefinedsize + 1) * sizeof (wchar_t); wchar_t wchtemp [2]; bool flag = true; try {// skip the file header and move the Object Pointer if (m_pstream & (getposition () = 0) {seek (2 * sizeof (Byte), cfile: Begin);} Index = 0; do {memset (wchtemp, * sizeof (wchar_t); readlen = cfile: Read (wchtemp, sizeof (wchar_t) * 2); // cstdiofile: The read effect is different if (! Readlen) break; // Unicode big endian swap high and low bytes if (unicodebigendian = m_filetype) {unsigned char high, low; high = (wchtemp [0] & 0xff00)> 8; low = wchtemp [0] & 0x00ff; wchtemp [0] = (low <8) | high; high = (wchtemp [1] & 0xff00)> 8; low = wchtemp [1] & 0x00ff; wchtemp [1] = (low <8) | high;} wchline [index ++] = wchtemp [0]; wchline [index ++] = wchtemp [1]; // determines whether to press enter to wrap the line if (wchtemp [0] = wchcr & wchtemp [1] = wchlf) break ;} while (( Readlen = sizeof (wchar_t) * 2) & index <predefinedsize); wchline [Index] = 0; cstring strtext (wchline, index); rstring = strtext; If (rstring. isempty () Flag = false;} catch (cfileexception ex) {cstring errormsg; errormsg. format (_ T ("an exception occurred when operating the file % s! "), Ex. m_strfilename); afxmessagebox (errormsg);} Delete [] wchline; return flag;} // character // cstdiofileex: utf8tounicode UTF-8 string converted to Unicode string // -------------------------------------------------------------------------------------------- cstring cstdiofileex :: utf8tounicode (byte * szutf8) {cstring strret; strret = _ T (""); If (! Szutf8) return strret; // get the length of the converted string space int wcslen = multibytetowidechar (cp_utf8, 0, (lpstr) szutf8, strlen (char *) szutf8), null, 0); lpwstr lpw = new wchar [wcslen + 1]; If (! Lpw) return strret; memset (lpw, 0, (wcslen + 1) * sizeof (wchar_t); // implement multibytetowidechar (cp_utf8, 0, (lpstr) szutf8, strlen (char *) szutf8), (lpwstr) lpw, wcslen); cstring STR (lpw); Delete [] lpw; return STR ;}// returns // cstdiofileex :: getfiletype: get the file type //---------------------------------------------------------------------- Export textcode cstdiofileex: getfiletype () {return m_filetype;} // convert // cstdiofileex: filetypetostring file type enumerated value to string value // convert cstring cstdiofileex: filetypetostring () {cstring strtype; Switch (m_filetype) {case Si: strtype. format ("% s", _ T ("ANSI"); break; Case utf8: strtype. format ("% s", _ T ("utf8"); break; Case UNICODE: strtype. format ("% s", _ T ("Unicode"); break; Case unicodebigendian: strtype. format ("% s", _ T ("Unicode big endian"); break; Case fileerror: strtype. format ("% s", _ T ("fileerror"); break; default: break;} return strtype;} // delimiter ;}//------------------------------------------------------------------------------------------- -// Cstdiofileex: Open reload the parent class file opening operation to change the open mode of different types of files // javasbool cstdiofileex: open (lpctstr lpszfilename, uint nopenflags, cfileexception * perror) {processflags (lpszfilename, nopenflags, m_filetype); // return cstdiofile: open (lpszfilename, nopenflags, perror );}//------------------------------------------------- Examples // cstdiofileex: processflags handle different file opening Methods // ANSI file read by text, Unicode, unicdoe big endian, UTF-8 read in binary mode // optional uint cstdiofileex :: processflags (lpctstr lpszfilename, uint & nopenflags, textcode & TC) {Tc = cstdiofileex: getfiletype (lpszfilename); If (nopenflags & cfile: modereadwrit E) | (nopenflags & cfile: moderead) {Switch (TC) {Case ANSI: nopenflags | = cfile: typetext; nopenflags & = ~ Cfile: typebinary; break; Case utf8: nopenflags | = cfile: typebinary; nopenflags & = ~ Cfile: typetext; break; Case UNICODE: nopenflags | = cfile: typebinary; nopenflags & = ~ Cfile: typetext; break; Case unicodebigendian: nopenflags | = cfile: typebinary; nopenflags & = ~ Cfile: typetext; break; Case fileerror: break; default: break; }} nopenflags | = cfile: sharedenynone; return nopenflags ;}

3. Running result
(1) test some core code
// Open the file

Void creadstringdlg: onbtnopen () {// todo: add your control notification handler code herechar szfilter [] = "text files (*. TXT) | *. TXT | all files (*. *) | *. * | "; cfiledialog filedlg (true," TXT ", null, ofn_hidereadonly | ofn_overwriteprompt, szfilter, this); If (idok = filedlg. domodal () {m_strpath = filedlg. getpathname (); updatedata (false); m_ctrledit.setsel (0,-1); m_ctrledit.clear (); If (m_stdiofileex.open (m_st Rpath, cfile: moderead) {m_strfiletype = m_stdiofileex.filetypetostring (); updatedata (false);} else {MessageBox (_ T ("failed to read the file! ") ;}}// Read the file void creadstringdlg: onbtnread () {// todo: add your control notification handler code hereif (! Validateinput () return; cstring strread, strtemp; m_ctrledit.getwindowtext (strread); m_ctrledit.setsel (0,-1); m_ctrledit.clear (); If (m_stdiofileex.m_pstream) {int CNT = 0; strread ++ = "\ r \ n"; while (CNT <m_llinecnt) {If (m_stdiofileex.readstring (strtemp) {strread ++ = strtemp; CNT ++ ;} else {afxmessagebox (_ T ("Read has reached the end of the file! "); Break ;}}m_ctrledit.setsel (0,-1); m_ctrledit.replacesel (strread);} else {MessageBox (_ T (" failed to read the file! ") ;}}// Verify that the input bool creadstringdlg: validateinput () {updatedata (); If (m_strpath.isempty () {MessageBox (" file path is blank. Please enter it! "); Return false;} If (m_llinecnt <= 0) return false; return true ;}

(2) Test Results
The program passes the test in Windows 7 vc6.0. The test result is shown in:

ANSI file test results:

Unicode file test results:


Unicode big endian file test results:

UTF-8 file test results:

4. Unresolved Issues
(1) The cstdiofileex class does not overload the virtual lptstr readstring (lptstr lpsz, uint Nmax) method of cstdiofile.
(2) The writestring method of the cstdiofileex class is not completed.
(3) The cstdiofileex class is not tested in the VC Unicode environment.
(4) Is there a simpler and safer method? Because the author has not studied encoding in depth, it is still unable to get a better answer.
5. Thanks
Thanks for sharing the materials on the Internet. For more information, see:
(1) On the coding of small test web site: http://www.live-in.org/archives/277.html
(2) cstdiofileex class
Download URL: http://download.pudn.com/downloads141/sourcecode/windows/file/21457326wutext.rar
(3) "C ++ string completely guide one of Win32 character encoding" web site: http://www.vckbase.com/document/viewdoc? Id = 1082

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.