Phputf-8 to unicode functions page 12th. The UTF encoding UTF-8 is coded in 8 bits. The encoding method from UCS-2 to UTF-8 is as follows: UCS-2 encoding (hexadecimal) UTF-8 byte stream (binary) bytes -007f0xxxxxxx0080-07ff UTF encoding
The UTF-8 is coded in 8 bits. The encoding from UCS-2 to UTF-8 is as follows:
UCS-2 coding (hexadecimal)
UTF-8 byte stream (binary)
0000-007F
0 xxxxxxx
0080-07FF
110 xxxxx 10 xxxxxx
0800-FFFF
1110 xxxx 10 xxxxxx 10 xxxxxx
For example, the Unicode code of the Chinese character is 6C49. 6C49 is between 0800-FFFF, so it must use a 3-byte Template: 1110 xxxx 10 xxxxxx 10 xxxxxx. Write 6C49 as binary: 0110 110001 001001. use this bit stream to replace x in the template. The result is 11100110 10110001 10001001, that is, E6 B1 89.
Finally, unicode and UTF8.
If the UTF-8 encoded character ch is 3 bytes. Xx yy zz
Perform the xx AND 1F operations to obtain
Run the "yy" AND "7F" operations to obtain "B ".
Perform the zz AND 7F operations to obtain c
(64a + B) * 64 + c = ch (unicode encoding)
Echo. php is nothing. There are several functions.
");
// Write unicode files
$ Ucs2data = utf8ToUnicode ($ data, "little ");
$ Endian = chr (0xFE). chr (0xFF );
$ Endian = chr (0xFF). chr (0xFE );
$ Rt = file_put_contents ("ucs2.txt", $ endian. $ ucs2data );
// 19:32, utf8toUnicode function OK.
// 20: 09. Little endian and big endian problems are found. And solve the problem.
// Unicode string stored in the big endian mode. Neither ue nor editplus can be used.
// Recognition. Only notepad can be properly recognized.
$ Rt = file_put_contents ("usc2ys_data.txt", $ ucs2_ysdata );
// Write the utf8 file
$ Utf8data = unicodeToUtf8 ($ ucs2data); // 20:52. convert the string back to the utf8 code OK.
$ Rt = file_put_contents ("utf8.txt", $ utf8data );
Echo (urlencode ($ utf8data); echo ("");
$ Esc = utf8Escape ($ data );
Echot ($ esc );
$ Esc = phpEscape ($ data );
Echot ($ esc );
$ Unesc = phpUnescape ($ esc );
Echot ($ unesc );
/**
* This function converts utf8 encoded strings into unicode encoded strings.
* Str, a UTF-8 encoded string.
* The order parameter is used to store data in the format of big endian or little endian. the default unicode storage order is little.
* For example, the unicode code of "large" is 5927. Stored in little mode: 27 59. In the big mode, the order remains unchanged: 59 27.
* Ff fe is required at the beginning of a file in little storage format. Big files start with fe ff. Otherwise. This can cause serious confusion.
* This function only converts characters and does not add headers.
* The iconv conversion string is stored in big endian.
* Return ucs2string, the converted string.
* Xuzuning)
*/
Function utf8ToUnicode ($ str, $ order = "little ")
{
$ Ucs2string = "";
$ N = strlen ($ str );
For ($ I = 0; $ i0x80) {// 110 xxxxx 10 xxxxxx
$ A = (ord ($ str [$ I]) & 0x3F) 0x80 & ord ($ str [$ I + 2])> 0x80) {// 1110 xxxx 10 xxxxxx 10 xxxxxx
$ A = (ord ($ str [$ I]) & 0x1F) convert to utf8 encoded string
* The str parameter is a unicode-encoded string.
* The order parameter is used to store unicode strings. it can be big endian or little endian.
* Returns utf8string, a converted string.
*
*/
Function unicodeToUtf8 ($ str, $ order = "little ")
{
$ Utf8string = "";
$ N = strlen ($ str );
For ($ I = 0; $ I is returned.
$ I ++; // Two bytes represent a unicode character.
$ C = "";
If ($ val utf8string. = $ c;
}
Return $ utf8string;
} // End func
/*
* Encode the UTF-8 encoded string to a unicode code type, equivalent to escape
* Only utf8 codes are accepted, because only formulas are converted between utf8 codes and unicode codes, and other codes must be viewed in the code table for conversion.
* I do not know whether the regular expression of the utf8 code is completely correct. Confused ing
* Although utf2ucos is called to calculate the code value of each character. Low efficiency. However, the code is clear, and the computing process should be embedded.
* The code is not easy to read.
*/
Function utf8Escape ($ str ){
Preg_match_all ("/[\ xC0-\ xE0]. | [\ xE0-\ xF0] .. | [\ x01-\ x7f] +/", $ str, $ r );
// Prt ($ r );
$ Ar = $ r [0];
Foreach ($ ar as $ k => $ v ){
$ Ord = ord ($ v [0]);
If ($ ordutf8 code
$ Ar [$ k] = "% u". utf2ucs ($ v );
}
Elseif ($ ordutf8 code
$ Ar [$ k] = "% u". utf2ucs ($ v );
}
} // Foreach
Return join ("", $ ar );
}
/**
*
* Convert utf8 encoding characters into UCS-2 encoding
* UTF-8 encoded characters.
* Returns the unicode value of the character. Once you know the code value, you can use chr to get the characters out.
*
* Principle: the algorithm for converting unicode to UTF-8 is. The header is fixed or.
The inverse algorithm in this process is this function, and the head is fixed with the inverse bitwise AND.
*/
Function utf2ucs ($ str ){
$ N = strlen ($ str );
If ($ n = 3 ){
$ HighCode = ord ($ str [0]);
$ MidCode = ord ($ str [1]);
$ LowCode = ord ($ str [2]);
$ A = 0x1F & $ highCode;
$ B = 0x7F & $ midCode;
$ C = 0x7F & $ lowCode;
$ UcsCode = (64 * $ a + $ B) * 64 + $ c;
}
Elseif ($ n = 2 ){
$ HighCode = ord ($ str [0]);
$ LowCode = ord ($ str [1]);
$ A = 0x3F & $ highCode; // 0x3F is the complement of 0xC0
$ B = 0x7F & $ lowCode; // 0x7F is the complement of 0x80
$ UcsCode = 64 * $ a + $ B;
}
Elseif ($ n = 1 ){
$ Ucscode = ord ($ str );
}
Return dechex ($ ucsCode );
}
/*
* Usage: This function is used to reverse the characters encoded by the javascript escape function.
* I don't know if the key regular expression search is correct.
* Parameter: a javascript-encoded string.
* For example: unicodeToUtf8 ("% u5927") = Large
* 2005-12-10
*
*/
Function phpUnescape ($ escstr ){
Preg_match_all ("/% u [0-9A-Za-z] {4} | %. {2} | [0-9a-zA-Z. +-_] +/", $ escstr, $ matches); // prt ($ matches );
$ Ar = & $ matches [0];
$ C = "";
Foreach ($ ar as $ val ){
If (substr ($ val, 0, 1 )! = "%") {// If it is an ascii code of letters, numbers, + -_.
$ C. = $ val;
}
Elseif (substr ($ val, 1, 1 )! = "U") {// if it is a non-alphanumeric plus-_. ascii code
$ X = hexdec (substr ($ val, 1, 2 ));
$ C. = chr ($ x );
}
Else {// if it is a code greater than 0xFF
$ Val = intval (substr ($ val, 2), 16 );
If ($ val % u ". bin2hex (iconv ('gbk'," UCS-2 ", $ chars [$ I]. $ chars [$ I + 1]);
$ I ++;
}
} // Foreach
Return $ ar;
}
?>
The fuse UTF-8 is coded in 8 bits. The encoding method from UCS-2 to UTF-8 is as follows: UCS-2 encoding (hexadecimal) UTF-8 byte stream (binary) 0000-007F 0 xxxxxxx 0080-07FF...