Utf8 and Unicode Conversion

Source: Internet
Author: User
Tags 0xc0 format definition

The front-end time needs to convert utf8 and Unicode formats by yourself, and it is intended to encapsulate a class. I also thought about laziness. I tried to use it on the Internet, but later I found many problems. First, interfaces are different. Second, most methods on the Internet are unreliable, or the method is no longer applicable.

There are many online conversion principles for the two, so I will not go into details here. I have encountered the following problems:

The interface I want to write is like this:

Class cstrconvertor <br/>{< br/> Public: <br/> static int unicode2utf8 (lpstr cbuf, Int & icbuf, lpcwstr ubuf, int iubuf ); <br/> static int utf82unicode (lpwstr pdst, int ndstlen, lpstr psrc, int nsclen); <br/> }; 

That is, the space for storing the conversion result is allocated outside, and then assigned a value in the method.

 

Question 1: How to check the parameter validity. Check whether the result storage space is sufficient. Because utf8 is variable-length, it never knows how much space it occupies before the conversion ends, so it is difficult to judge in advance. Some people may also say: when unicode2utf8 is used, I directly open space 3 times + 1. Isn't utf82unicode enough to open 2 K + 2 directly? In the conversion function, it is OK to determine whether the result bucket reaches this size? Indeed, this is true. But as a perfect class, this is not rigorous! For example, when unicode2utf8 is used, the result string that may be provided only needs 2 K space, and the external result string is also opened 2 times space, however, when we come up with the function, we will judge that it is less than 3 times + 1 and then report an error. This is obviously not feasible! Therefore, it is difficult to judge with the maximum space!

Solution: 1: Some people do this on the Internet, that is, the result bucket is allocated within the conversion function, and the external only declares a pointer. In this way, a new space will be added along with the conversion! This solves the above problems, but the allocation of space itself is quite resource-consuming, and the efficiency of this function is low!

2: The method I used: perform a traversal before conversion. It can be seen that each character occupies several bytes after conversion and is accumulated. Determine the validity of the parameter based on the final value! In this way, although I have done more, it is more rigorous!

 

 

Question 2: for example, when utf82unicode is used, we need to check whether the first few bytes of each byte meet the format definition of utf8. This is used to retrieve the first/second digits from a byte. Many methods on the Internet are simple and direct shift operations, such as 0xff <5. However, this will obviously cause a very basic error!

Abstract The problem: Is 0xff <5 the same as 0xe0? Well, sometimes the same, sometimes different! When you directly use <or> to perform a shift operation: you need to understand how it works in the CPU. For example, for a 32-bit CPU, when 0xff is executed <5: The 5-bit CPU removed from the high position is not lost! After all, the 32-Bit Width is changed to 0x1fe0 after the shift, which is why 0xff <5 and 0xe are not necessarily equal. The two are equal and not equal depending on the CPU Bit Width! Many algorithms on the Internet did not notice this. I guess those algorithms are implemented in the 8-Bit Width era, and I didn't modify them when I posted them online. However, this difference only occurs when you move left, but not when you move right! This reminds us that the shift operator should be used as few as possible during use. There are too many things to consider and the code reading is not good.

 

Certificate ---------------------------------------------------------------------------------------------------------------------------------------------------

Basic supplement:

1: a unicode occupies 2 bytes in the memory. UTF is a storage format with a length of up to 6 bytes. However, basically three bytes already contain all the characters currently used. Therefore, when unicode2utf8 is used, the worst case is: Each Unicode occupies 3 bytes after conversion, so the maximum space must be 3 times, plus a/0 Terminator, therefore, a maximum of 3 k + 1 space is required! Utf82unicode: In the worst case, each utf8 is converted to a Unicode, that is, each original byte is first converted to two bytes, in addition, the last two bytes/0, so the maximum space needs to be 2 K + 2!

Certificate ---------------------------------------------------------------------------------------------------------------------------------------------------

Below is the encapsulated class code. If there is an error, you can correct it.

/** <Br/> * this function is used to convert Unicode arrays into utf8 format! <Br/> * @ Param out: Region pointer for storing the Conversion Result <br/> * @ Param outlength: region size for storing the Conversion Result <br/> * @ Param in source string <br/> * @ Param inlength the size of the region where the source string is stored <br/> * @ return qint the actual length of the Conversion Result in the target string, if the conversion fails,-1 is returned. <br/> */<br/> int cstrconvertor: unicode2utf8 (char * Out, Int & outlength, const wchar_t * In, int inlength) <br/>{< br/> // -------------------------------------------------- <br/> // parameter validity judgment <br/> If (out = NULL | in = NULL | INL Ength <0) <br/>{< br/> return-1; <br/>}< br/> int totalnum = 0; <br/> for (INT I = 0; I <inlength; I ++) // calculate the actual length required for the Conversion Result <br/>{< br/> wchar_t Unicode = in [I]; <br/> If (UNICODE >=0x0000 & Unicode <= 0x007f) <br/>{< br/> totalnum + = 1; <br/>}< br/> else if (UNICODE> = 0x0080 & Unicode <= 0x07ff) <br/>{< br/> totalnum + = 2; <br/>}< br/> else if (UNICODE> = 0x0800 & Unicode <= 0 xFFFF) <br/>{< br/> T Otalnum + = 3; <br/>}< br/> If (outlength <totalnum) // determine the parameter validity! <Br/>{< br/> return-1; <br/>}< br/> // ------------------------------------------------ </P> <p> int outsize = 0; // count the actual size of the output result! <Br/> char * TMP = out; <br/> int I = 0; <br/> for (I = 0; I <inlength; I ++) <br/>{< br/> If (outsize> outlength) // handle insufficient space! <Br/>{< br/> return-1; <br/>}< br/> wchar_t Unicode = in [I]; </P> <p> If (UNICODE> = 0x0000 & Unicode <= 0x007f) <br/>{< br/> * TMP = (char) Unicode; <br/> TMP + = 1; <br/> outsize + = 1; <br/>}< br/> else if (UNICODE> = 0x0080 & Unicode <= 0x07ff) <br/>{< br/> * TMP = 0xc0 | (UNICODE> 6); <br/> TMP + = 1; <br/> * TMP = 0x80 | (UNICODE & (0xff> 2); <br/> TMP + = 1; <br/> outsize + = 2; <br/>}< br/> Else if (UNICODE> = 0x0800 & Unicode <= 0 xFFFF) <br/> {<br/> * TMP = 0xe0 | (UNICODE> 12 ); <br/> TMP + = 1; <br/> * TMP = 0x80 | (UNICODE> 6 & 0x00ff); <br/> TMP + = 1; <br/> * TMP = 0x80 | (UNICODE & (0xff> 2); <br/> TMP + = 1; <br/> outsize + = 3; <br/>}< br/> return outsize; <br/>}< br/> ----------------------------------------------------- <br/>/** <br/> * this function is used to convert utf8 arrays into unicode format! <Br/> * The current return value of this function is the number of wchar_t occupied by Unicode data after conversion (not the total number of char )! <Br/> * @ Param out: Region pointer for storing the Conversion Result <br/> * @ Param outsize: region size for storing the Conversion Result <br/> * @ Param in source string <br/> * @ Param insize size <br/> * @ return qint length of the Conversion Result in the target string, if the conversion fails,-1 is returned. <br/> */<br/> qint cstrconvertor: utf82unicode (lpwstr out, qint outsize, lpstr in, qint insize) <br/> {<br/> // evaluate <br/> // determine the parameter validity <br/> If (out = nu Ll | in = NULL | insize <0) <br/>{< br/> return-1; <br/>}</P> <p> int totalnum = 0; <br/> char * P = In; <br/> for (INT I = 0; I <insize; I ++) <br/>{< br/> If (* P> = 0x00 & * P <= 0x7f) // indicates that the maximum bit is '0', which means that the UTF-8 encoding is only one byte! <Br/>{< br/> P ++; <br/> totalnum + = 1; <br/>}< br/> else if (* P & (0xe0) = 0xc0) // only the maximum three digits are retained. Check whether the maximum three digits are 110, if so, utf8 encoding contains 2 bytes! <Br/>{< br/> P ++; <br/> totalnum + = 1; <br/>}< br/> else if (* P & (0xf0) = 0xe0) // The maximum number of four digits is retained. Check whether the maximum number of three digits is 1110, if yes, it means that the UTF-8 encoding has three bytes! <Br/>{< br/> P ++; <br/> totalnum + = 1; <br/>}< br/> If (outsize <totalnum) // judge the parameter validity! <Br/>{< br/> return-1; <br/>}< br/> // -------------------------------------------------- <br/> int resultsize = 0; </P> <p> P = In; <br/> char * TMP = (char *) out; <br/> while (* P) <br/>{< br/> If (* P> = 0x00 & * P <= 0x7f) // The highest bit is '0 ', this means that the UTF-8 encoding is only one byte! <Br/> {<br/> * TMP = * P; <br/> TMP ++; <br/> // * TMP = '/0 '; <br/> TMP ++; <br/> resultsize + = 1; <br/>}< br/> else if (* P & 0xe0) = 0xc0) // only the maximum three digits are retained. Check whether the maximum three digits are 110. If yes, the UTF-8 encoding contains 2 bytes! <Br/>{< br/> wchar_t = 0; <br/> char T1 = 0; <br/> char t2 = 0; </P> <p> T1 = * P & (0x1f); // The last five digits of the high! (Remove the 110 flag of the header) <br/> P ++; <br/> T2 = * P & (0x3f); // The last 6 digits of the low position! (Remove the 10 flag in the header) </P> <p> * TMP = t2 | (T1 & (0x03) <6 ); <br/> TMP ++; <br/> * TMP = T1> 2; // leave the three retained items <br/> TMP ++; <br/> resultsize + = 1; <br/>}< br/> else if (* P & (0xf0) = 0xe0) // only the maximum four bits are retained. Check whether the maximum three bits are 1110. If yes, the UTF-8 encoding contains three bytes! <Br/> {<br/> wchar_t = 0; <br/> wchar_t T1 = 0; <br/> wchar_t t2 = 0; <br/> wchar_t T3 = 0; <br/> T1 = * P & (0x1f); <br/> P ++; <br/> T2 = * P & (0x3f); <br/> P ++; <br/> T3 = * P & (0x3f ); </P> <p> * TMP = (t2 & (0x03) <6) | T3; <br/> TMP ++; <br/> * TMP = (t1 <4) | (T2> 2); <br/> TMP ++; <br/> resultsize + = 1; <br/>}< br/> P ++; <br/>}< br/>/* do not consider the Terminator. If so, open this segment! <Br/> * TMP = '/0'; <br/> TMP ++; <br/> * TMP ='/0 '; <br/> resultsize + = 2; <br/> */<br/> return resultsize; <br/>} 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.