UTF encoding
The UTF-8 is to encode the UCS as a 8-bit unit. The encoding from UCS-2 to UTF-8 is as follows:
UCS-2 encoding (16 binary)
UTF-8 byte stream (binary)
0000-007f
0xxxxxxx
0080-07ff
110xxxxx 10xxxxxx
0800-ffff
1110xxxx 10xxxxxx 10xxxxxx
For example, the Unicode encoding of the word "Han" is 6c49. 6c49 is between 0800-ffff, so I'm sure to use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. The 6c49 is written as binary: 0110 110001 001001, using this bitstream in turn instead of the template x, get: 11100110 10110001 10001001, that is, E6 B1 89.
Finally, the Unicode and the UTF8 each other to fix.
If the UTF-8 encoded character ch is 3 bytes. XX yy ZZ
Get the XX and 1F and operations a
Get yy and 7F and operation to B
Get the ZZ and 7F and Operations C
(64a+b) *64+c = CH (Unicode encoding)
echo.php nothing. is just a few functions.
");
Write Unicode file
$ucs 2data = Utf8tounicode ($data, "little");
$endian = Chr (0xFE). chr (0xFF);
$endian = Chr (0xFF). chr (0xFE);
$rt = file_put_contents ("Ucs2.txt", $endian. $ucs 2data);
19:32,utf8tounicode function OK.
20:09. Found little endian and big endian problems. and resolved.
The big endian means that the Unicode string, UE and EditPlus, cannot be
Recognition. Only Notepad is normally recognized.
$rt = file_put_contents ("Usc2ys_data.txt", $ucs 2_ysdata);
Write UTF8 file
$utf 8data = UnicodeToUtf8 ($ucs 2data); 20:52. Turn the string back to the UTF8 code OK.
$rt = file_put_contents ("Utf8.txt", $utf 8data);
Echo (UrlEncode ($utf 8data)); Echo ("");
$esc = Utf8escape ($data);
Echot ($ESC);
$esc = Phpescape ($data);
Echot ($ESC);
$unesc = Phpunescape ($ESC);
Echot ($UNESC);
/**
* This function converts UTF8 encoded strings to Unicode encoded string
* parameter str, UTF8 encoded string.
* Parameter order, storage data format, is big endian or little endian, the default Unicode storage order is little.
* For example: "Large" Unicode code is 5927. The little mode is stored as: 27 59. The big way is the same order: 59 27.
* FF FE is required at the beginning of the little storage format file. The file opening of the big storage mode is FE FF. Otherwise. will cause serious confusion.
* This function only converts characters and is not responsible for increasing the head.
* Iconv converted strings are stored by big endian.
* Returns the ucs2string, converted string.
* Thanks for nagging (xuzuning)
*/
function Utf8tounicode ($str, $order = "little")
{
$ucs 2string = "";
$n =strlen ($STR);
for ($i =0; $i 0x80) {//110xxxxx 10xxxxxx
$a = (ord ($str [$i]) & 0x3F) 0x80 && ord ($str [$i +2]) >0x80) {//1110xxxx 10xxxxxx 10xxxxxx
$a = (ord ($str [$i]) & 0x1F) converted to UTF8 encoded string
* parameter str, Unicode encoded string.
* Parameter order, Unicode string storage order, for big endian or little endian.
* Returns the utf8string, converted string.
*
*/
function UnicodeToUtf8 ($str, $order = "little")
{
$utf 8string = "";
$n =strlen ($STR);
for ($i =0; $i turn back.
$i + +; A two-byte representation of a Unicode character.
$c = "";
if ($val utf8string. = $c;
}
return $utf 8string;
}//End Func
/*
* Encode UTF8 encoded string into Unicode pattern, equivalent to escape
* Only accept UTF8 code, because there is only UTF8 code and Unicode between the formula conversion, the other code is to find the code table to convert.
* I don't know if it's exactly right to find the UTF8 code. Lost ing
* Although Utf2ucs is called to calculate the code value for each character. Too inefficient. However, the code is clear, if you embed that computational process.
* The code is not very easy to read.
*/
function Utf8escape ($STR) {
Preg_match_all ("/[\xc0-\xe0].| [\xe0-\xf0]..| [\x01-\x7f]+/], $STR, $r);
PRT ($R);
$ar = $r [0];
foreach ($ar as $k = = $v) {
$ord = Ord ($v [0]);
if ($ordutf 8 yards
$ar [$k] = "%u". Utf2ucs ($v);
}
ElseIf ($ordutf 8 yards
$ar [$k] = "%u". Utf2ucs ($v);
}
}//foreach
return join ("", $ar);
}
/**
*
* Convert UTF8 encoded characters to ucs-2 encoding
* parameter UTF8 encoded characters.
* Returns the Unicode code value for this character. Knowing the code value, you can use CHR to get the characters out.
*
* Principle: Unicode to Utf-8 code algorithm is. Head fixed bit or.
The inverse algorithm of this process is the function, the head fixed bit inverse and.
*/
function Utf2ucs ($STR) {
$n =strlen ($STR);
if ($n =3) {
$highCode = Ord ($str [0]);
$midCode = Ord ($str [1]);
$lowCode = Ord ($str [2]);
$a = 0x1F & $highCode;
$b = 0x7F & $midCode;
$c = 0x7F & $lowCode;
$ucsCode = (64* $a + $b) *64 + $c;
}
ElseIf ($n ==2) {
$highCode = Ord ($str [0]);
$lowCode = Ord ($str [1]);
$a = 0x3F & $highCode; 0x3f is the complement of 0XC0
$b = 0x7F & $lowCode; 0x7f is the complement of 0x80
$ucsCode = 64* $a + $b;
}
ElseIf ($n ==1) {
$ucscode = Ord ($STR);
}
Return Dechex ($ucsCode);
}
/*
* Usefulness: This function is used to reverse the encoded character of the escape function of JavaScript.
* Key Regular Lookup I don't know if there is a problem.
* Parameters: JavaScript-encoded strings.
* such as: UnicodeToUtf8 ("%u5927") = Large
* 2005-12-10
*
*/
function Phpunescape ($ESCSTR) {
Preg_match_all ("/%u[0-9a-za-z]{4}|%.{ 2}| [0-9a-za-z.+-_]+/], $ESCSTR, $matches); PRT ($matches);
$ar = & $matches [0];
$c = "";
foreach ($ar as $val) {
if (substr ($val, 0, 1)! = "%") {//If it is an alphanumeric +-_. ASCII code
$c. = $val;
}
ElseIf (substr ($val,)! = "U") {//If the ASCII code is non-alphanumeric +-_.
$x = Hexdec (substr ($val,));
$c. =CHR ($x);
}
else {//If the code is greater than 0xFF
$val = Intval (substr ($val, 2), 16);
if ($val%u ". Bin2Hex (Iconv (' GBK '," UCS-2 ", $chars [$i]. $chars [$i +1]);
$i + +;
}
}//foreach
return $ar;
}
?>
http://www.bkjia.com/PHPjc/319074.html www.bkjia.com true http://www.bkjia.com/PHPjc/319074.html techarticle the UTF encoding UTF-8 is to encode the UCS as a 8-bit unit. The encoding from UCS-2 to UTF-8 is as follows: UCS-2 encoding (16 binary) UTF-8 byte stream (binary) 0000-007f 0xxxxxxx 0080-07ff ...