I. Coding principle and implementation
Unicode encoding is the basis for implementing Utf-8 and GB series encoding (GB2312, GBK, GB18030), although we can also directly do a utf-8 to these coded comparison table, but very few people will do so, because the UTF-8 variable encoding has uncertainty, Therefore, the general use of Unicode and GB encoding of the comparison table, Unicode (UCS-2) is actually utf-8 the underlying encoding, Utf-8 is only one of its implementation, the two have the following correspondence:
Unicode Symbol Range | UTF-8 Encoding method
u0000 0000-u0000 007F | 0xxxxxxx
u0000 0080-u0000 07FF | 110xxxxx 10xxxxxx
u0000 0800-u0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
Since the characters used in Utf-8 are all UCS-2, it is not necessary to consider the 4-6-byte encoding, similarly, in the case of a reverse conversion, if more than four bytes of utf-8 characters appear, they can be directly considered garbled or converted to Unicode entity form (" a long int; " Format), and then to the browser or related parser to deal with, using PHP to convert Unicode to utf-8 encoding algorithm is as follows:
/* * 参数 $c 是unicode字符编码的int类型数值,如果是用二进制读取的数据,在php中通常要用 hexdec(bin2hex( $bin_unichar )) 这样转换 */ function uni2utf8( $c ) { if ( $c < 0x80) { $utf8char = chr ( $c ); } else if ( $c < 0x800) { $utf8char = chr (0xC0 | $c >> 0x06). chr (0x80 | $c & 0x3F); } else if ( $c < 0x10000) { $utf8char = chr (0xE0 | $c >> 0x0C). chr (0x80 | $c >> 0x06 & 0x3F). chr (0x80 | $c & 0x3F); } //因为UCS-2只有两字节,所以后面的情况是不可能出现的,这里只是说明unicode HTML实体编码的用法。 else { $utf8char = "&#{$c};" ; } return $utf8char ; } |
Within the current context, the Utf-8 character set ==unicode (UCS-2) can be considered, but theoretically, the containment relationship of the main character set is as follows:
utf-8 > unicode(UCS-2) > gb18030 > gbk > gb2312 |
Therefore, if the encoding is correct:
gb2312 => gbk => gb18030 => unicode(UCS-2) => utf-8 |
Such a process of transformation is essentially lossless, but in contrast, by
utf-8 => unicode(UCS-2) => gb18030=> gbk => gb2312 |
Such a transformation process, is likely to have unrecognized characters, so if the use of UTF-8 encoding system, try not to do the reverse conversion encoding operation.
Two. using PHP to convert Unicode to UTF-8 another way:
function unescape( $str ) { $str = rawurldecode( $str ); preg_match_all( "/(?:%u.{4})|&#x.{4};|&#\d+;|.+/U" , $str , $r ); $ar = $r [0]; //print_r($ar); foreach ( $ar as $k => $v ) { if ( substr ( $v ,0,2) == "%u" ){ $ar [ $k ] = iconv( "UCS-2BE" , "UTF-8" ,pack( "H4" , substr ( $v ,-4))); } elseif ( substr ( $v ,0,3) == "&#x" ){ $ar [ $k ] = iconv( "UCS-2BE" , "UTF-8" ,pack( "H4" , substr ( $v ,3,-1))); } elseif ( substr ( $v ,0,2) == "&#" ) { $ar [ $k ] = iconv( "UCS-2BE" , "UTF-8" ,pack( "n" , substr ( $v ,2,-1))); } } return join( "" , $ar ); } |
The UCS-2 encoding on a Linux server is inconsistent with WINODWS, and the following are the unspoken rules for UCS-2 encoding of two platforms:
1. UCS-2 is not equal to UTF-16. UTF-16 each byte is encoded with an ASCII character range, while UCS-2 encodes each byte beyond the ASCII character range. UCS-2 and UTF-16 account for up to two bytes per character, but their encoding is not the same.
2. For UCS-2, the default is Ucs-2le under Windows. The Unicode of Ucs-2le is generated with MultiByteToWideChar (or a2w). Windows Notepad can save text as Ucs-2be, which is equivalent to a layer conversion.
3. For UCS-2, the default is Ucs-2be under Linux. Use Iconv (Specify UCS-2) to convert the generated Unicode to UCS-2BE. If you convert the Windows platform over the UCS-2, you need to specify Ucs-2le.
4. The understanding of UCS-2 is different for multiple platforms such as Windows and Linux (UCS-2LE,UCS-2BE). MS advocates that Unicode has a boot flag (Ucs-2le FFFE, Ucs-2be FEFF) to indicate that the following characters are Unicode and discriminate Big-endian or Little-endian. So the data coming from the Windows platform is found to have this prefix, don't panic.
5. The encoded output of Linux, such as from the file output, from the printf output, requires the console to do the appropriate encoding matching (if the encoding mismatch, general and the program compile-time encoding has a number of relationships), and the console conversion input needs to view the current system encoding. For example, the current encoding of the console is UTF-8, then UTF-8 encoded things can be displayed correctly, GBK can not, similarly, the current code is GBK, you can display GBK encoding, and later the system should be more intelligent to handle more conversion. However, through the putty and other terminals still need to set a good terminal encoding conversion to remove garbled trouble.
Three. Provide a pair of PHP in the Unicode complete encoding and decoding of the implementation for reference:
//将内容进行UNICODE编码
function
unicode_encode(
$name
)
{
$name
= iconv(
‘UTF-8‘
,
‘UCS-2‘
,
$name
);
$len
=
strlen
(
$name
);
$str
=
‘‘
;
for
(
$i
= 0;
$i
<
$len
- 1;
$i
=
$i
+ 2)
{
$c
=
$name
[
$i
];
$c2
=
$name
[
$i
+ 1];
if
(ord(
$c
) > 0)
{
// 两个字节的文字
$str
.=
‘\u‘
.
base_convert
(ord(
$c
), 10, 16).
base_convert
(ord(
$c2
), 10, 16);
}
else
{
$str
.=
$c2
;
}
}
return
$str
;
}
// 将UNICODE编码后的内容进行解码
function
unicode_decode(
$name
)
{
// 转换编码,将Unicode编码转换成可以浏览的utf-8编码
$pattern
=
‘/([\w]+)|(\\\u([\w]{4}))/i‘
;
preg_match_all(
$pattern
,
$name
,
$matches
);
if
(!
empty
(
$matches
))
{
$name
=
‘‘
;
for
(
$j
= 0;
$j
<
count
(
$matches
[0]);
$j
++)
{
$str
=
$matches
[0][
$j
];
if
(
strpos
(
$str
,
‘\\u‘
) === 0)
{
$code
=
base_convert
(
substr
(
$str
, 2, 2), 16, 10);
$code2
=
base_convert
(
substr
(
$str
, 4), 16, 10);
$c
=
chr
(
$code
).
chr
(
$code2
);
$c
= iconv(
‘UCS-2‘
,
‘UTF-8‘
,
$c
);
$name
.=
$c
;
}
else
{
$name
.=
$str
;
}
}
}
return
$name
;
}
PHP implements Unicode and Utf-8 conversions to and from each other