A mapping with characters and A is index, we use U+XXXX to represent it.
Confuse with Unicode and UTF-8? Unicode is a standard char set, UTF-8 are one of implementation, just one of UCS-2, UCS-4 and so forth, but it becomes Stan Dard Way of encoding. But note one thing, when we are talking about some 中文版 characters, those two standard the are, it same
U-00000000-u-0000007f:0xxxxxxx
Sometimes, especially the programmer, since u-00000000-u-0000007f is enough for their dialy-use (Chinese and some symb OLS), so, there are no different between the character set standards (Unicode) and implementation Standard (UTF-8) for them. When the they are talking with you, your may confuse.
Why is UTF-8? May I ask why not use UCS-4 or UCS-2? Do people like 8 (in Cantonese, it means become rich)? The answer is no. The Using UCS-2 (or UCS-4) under Unix would leads to very severe. Strings with encodings can contain as parts of many wide characters-like ' ' or '/' bytes which a have special The ing in filenames and the other C library function parameters.
(an ASCII or Latin-1 file can is transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byt E. If we want to have a UCS-4 file and we have to insert three 0x00 bytes instead before every ASCII byte.)
In addition, the majority of UNIX tools expects ASCII files and can ' t read 16-bit words as characters without major Cations. (In utf-8u+0000 to u+007f (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility).
This means so files and strings which contain only 7-bit ASCII characters have the same the encoding under both ASCII and UT F-8)
------------prove the ASCII and UTF-8 are the same---------package Unicode;public class Chartest {public static void M Ain (string[] args) throws Exception {char[] chars = new char[]{' \u007f '}; String str = new string (chars); System.out.println ("Within 0000-007f:" + str); For the character whose Unicode less than u0080, it is no different with encode by//iso-8859-1 or UTF-8. They are compatiable. System.out.println ("UTF-8-UTF-8" + New String (Str.getbytes ("UTF-8"),
"UTF-8")); chars = new char[]{' \u00f2 '}; str = new String (chars); The above principle can not apply to the character lager than 007F System.out.println ("Out of 0000-007f:" + S TR); System.out.println ("UTF-8-UTF-8" + New String (Str.getbytes ("UTF-8"),
"Iso-8859-1")); System.out.println ("iso-8859-1-UTF-8" + New String (Str.getbytes ("iso-8859-
How long is the UTF-8 encoding? Theoretically, it can be 6 bytes, but actually, 3-is enough for us since BMP are not longer than 3 (the most commonly u Sed characters, including all those found in major older encoding standards, have been placed to the the-I-Plane (0x0000 To 0xFFFD), which is called the Basic multilingual Plane (BMP)
Important UTF-8 features: 1. UCS characters u+0000 to u+007f (ASCII) are encoded-simply as bytes 0x00 to 0x7F (ASCII compatibility). This means so files and strings which contain only 7-bit ASCII characters have the same the encoding under both ASCII and UT F-8. 2. All UCS characters >u+007f are encoded as a sequence to several bytes, each of the which has the most bit set. Therefore, no ASCII byte (0x00-0x7f) can appear as part of the other character. 3. The A multibyte sequence that represents a non-ascii character be always in the range 0xc0 to 0xFD and it Dicates how many bytes follow for this character. All further bytes in a multibyte sequence are into the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing. (?? The further investigate is necessary. Can explain this currently) 4. All possible 231 UCS codes can be encoded. 5. UTF-8 encoded characters may TheoRetically to six bytes long, however 16-bit BMP
Characters are only up to three bytes long. 6. The bytes 0xFE and 0xFF are never used in the UTF-8 encoding. ------------Prove The features (1,2,3)-----------------package Unicode;
public class Utf8features { public static void Main (string[] args) throws Exception { //why not write some no-ascii character in the src? / /since it would depends on your system rather than //a UTF-8 as your image & nbsp; char[] chars = new char[]{' \u007f '}; String str = new string (chars); System.out.println ("point 1:" + str); & nbsp; System.out.println (" UTF-8-utf-8 " + new String ( Str.getbytes ("UTF-8"), "iso-8859-1"); System.out.println (" Iso-8859-1-UTF-8 "        &Nbsp; + new String (Str.getbytes ("iso-8859-1"), "UTF-8"); System.out.println ();
chars = new char[]{' \ue840 '}; str = new String (chars); System.out.println ("point 2:" + str); Just a sample can use this method to the verify more characters System.out.println ("No less than 7F" + G Ethexstring (str));
chars = new char[]{' \u2260 '}; str = new String (chars); Just a sample can use this method to the verify more characters System.out.println ("point 3:" + str); System.out.println ("Range of 1st Byte" + gethexstring (str)); }
public static string gethexstring (String num) throws Exception { stringbuffer sb = new StringBuffer (); //you must specify Here, else it'll use the Defaul encoding //which depends on your ENVIROMENT&NB sp; byte[] bytes = num.getbytes ("UTF-8"); for (int i = 0; i < bytes.length i++) { SB.A Ppend (integer.tohexstring (bytes[i) >= 0 Bytes[i]: 256 + bytes[i]). toUpperCase () + ""); & nbsp; } return sb.tostring (); }}---- -----------------------------------------------------------------------------Pinciple of presenting a Unicode use UTF-8:
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.