Bytes and Unicode

Source: Internet
Author: User

One, byte, and Unicode

The Java kernel is Unicode, even the class file, but many media, including file/stream, are saved using a byte stream. So Java wants these bytes to flow through the line transformation. Char is Unicode, and byte is a byte. The functions of Byte/char in Java are in the middle of Sun.io's package. Where the Bytetocharconverter class is in dispatch, can be used to tell you that you use the convertor. Two of the most common static functions are:

public static Bytetocharconverter Getdefault ();
public static Bytetocharconverter Getconverter (String encoding);



If you do not specify converter, the system will automatically use the 8859_1 on the Gbk,en platform on the current ENCODING,GB platform.

Byte--〉char:
The GB code for "You" is: 0xc4e3, Unicode is 0x4f60
String encoding = "gb2312";
byte b[] = {(byte) ' \u00c4 ', (byte) ' \u00e3 '};
Bytetocharconverter converter = bytetocharconverter.getconverter (encoding);
Char c[] = Converter.convertall (b);
for (int i = 0; i < c.length; i++) {
System.out.println (integer.tohexstring (c[i));
}
What was the result? 0x4f60
If encoding = "8859_1", what is the result? 0x00c4,0x00e3



If the code changes to:

byte b[] = {(byte) ' \u00c4 ', (byte) ' \u00e3 '};
Bytetocharconverter converter = Bytetocharconverter. Getdefault ();
Char c[] = Converter.convertall (b);
for (int i = 0; i < c.length; i++) {
System.out.println (integer.tohexstring (c[i));
}



What will the result be?

This depends on the encoding of the platform.

Char--〉byte:
String encoding = "gb2312";
Char c[] = {' \u4f60 '};
Chartobyteconverter converter = chartobyteconverter.getconverter (encoding);
byte b[] = Converter.convertall (c);
for (int i = 0; i < b.length; i++) {
System.out.println (integer.tohexstring (b[i));
}
What was the result? 0x00c4,0x00e3
If encoding = "8859_1", what is the result? 0x3f
If the code changes to
String encoding = "gb2312";
Char c[] = {' \u4f60 '};
Chartobyteconverter converter = Chartobyteconverter.getdefault ();
byte b[] = Converter.convertall (c);
for (int i = 0; i < b.length; i++) {
System.out.println (integer.tohexstring (b[i));
}



What will the result be? Depending on the encoding of the platform.

Many Chinese problems are derived from the simplest of these two classes. But there are many classes do not directly support the encoding input, which brings us a lot of inconvenience. Many programs rarely use encoding, directly with the default encoding, which gives us a lot of difficulties in the transplant.

Second, Utf-8

The Utf-8 is compatible with Unicode one by one, and its implementation is simple:

7 Bits of Unicode:0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
11-bit Unicode:1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _
16-bit Unicode:1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _
21-bit unicode:1 1 1 1 0 _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _



In most cases, only Unicode with 16 bits or less is used:

The GB code for "You" is: 0xc4e3, Unicode is 0x4f60
0xc4e3 binary:
1100, 0100, 1110, 0011



Since only two of us are in the two-bit code, we find it unworkable because the 7th bit is not 0. So, return "?"

0x4f60 binary:
0100, 1111, 0110, 0000
We use Utf-8 to complement and become:
1110, 0100, 1011, 1101, 1010, 0000
e4--bd--A0
Then returned: 0xe4,0xbd,0xa0.



Iii. string and byte[]

String is actually the core of char[], but to convert a byte into a string, it must be encoded. String.Length () is actually the length of the char array, if the use of different encodings, it is likely to be wrong points, resulting in hashing and garbled characters. For example:

String encoding = "";
byte [] b={(byte) ' \u00c4 ', (byte) ' \u00e3 '};
String Str=new string (b,encoding);



If encoding=8859_1, there will be two words, but encoding=gb2312 only one word This problem occurs frequently in the processing of pagination.

Iv. Reader,writer/inputstream,outputstream

The core of reader and writer is char,inputstream and OutputStream core is byte. But the main purpose of reader and writer is to read/write Char inputstream/outputstream. For example:

File Test.txt only a "you" word, 0xc4,0xe3
String encoding = "gb2312";
InputStreamReader reader = new InputStreamReader (New FileInputStream (
"Text.txt"), encoding);
Char c[] = new CHAR[10];
int length = Reader.read (c);
for (int i = 0; i < length; i++) {
System.out.println (C[i]);
}



What was the result? Is "you". If encoding = "8859_1", what is the result? "??" Two characters, which means no recognition. The reverse example does it yourself.
We want to understand the Java compiler:

Javac? encoding



We often do not use the encoding parameter. In fact, encoding this parameter is important for cross-platform operations. If encoding is not specified, it is gb2312 on the system's default ENCODING,GB platform and is iso8859_1 on the English platform. Java compiler is actually called Sun.tools.Javac.main class, the file is compiled, the class has a compile function in the middle of a encoding variable,-encoding parameters are actually directly passed to the encoding variable. The compiler reads the Java file according to this variable, and then compiles it into a class file in the form of utf-8. Example code:

String str = "You";
FileWriter writer = new FileWriter ("Text.txt");
Write.write (str);
Writer.close ();

If you compile with gb2312, you will find the fields of E4 BD A0;
If compiled with 8859_1, the binary of 00c4 00e3:
0000,0000, 1100,0100, 0000,0000, 1110,0011
Because each character is greater than 7 bits, it is encoded with 11 bits:
1100,0001,1000,0100,1100,0011,1010,0011
c1--84--c3--A3
You'll find C1 c3 A3



But we tend to ignore this parameter, so there are often cross-platform issues:

Sample code compiled on the Chinese platform, generated Zhclass

Sample code compiled on an English platform, output enclass

(1) Zhclass performs OK on the Chinese platform, but not on the English platform

(2) Enclass performs OK on the English platform, but not on the Chinese platform.

The reasons are:

(1) in the Chinese platform compiled, in fact, str in the running state of the char[] is 0x4f60, in the Chinese platform to run, FileWriter the default encoding is gb2312, so Chartobyteconverter will automatically call gb2312 Converter, the str into a byte input into the FileOutputStream, so 0xc4,0xe3 put in the file. However, if the default value of Chartobyteconverter is 8859_1 in the English platform, FileWriter will automatically call 8859_1 to convert str, but he cannot explain, so he will output "?"

(2) in the English platform compiled, in fact, str in the running state of the char[] is 0x00c4 0x00e3, in the Chinese platform to run, Chinese is not recognized, so it will appear?? On the English platform, 0x00c4-->0xc4,0x00e3->0xe3, so 0xc4,0xe3 was put into the file.

Vi. Other reasons:

<%@ page contenttype= "text/html; CHARSET=GBK "%>



Set the browser's display encoding, if the response data is UTF8 encoded, the display will be garbled, but garbled and the above reasons are not the same.

Seven, where the code occurs:

1. From database to Java program Byte--〉char

2. From the Java program to the database Char--〉byte

3. From file to Java program Byte--〉char

4. From Java program to file Char--〉byte

5. From Java program to page display Char--〉byte

6. Submit data from the page form to the Java program Byte--〉char

7. From stream to Java program Byte--〉char

8. From Java program to stream char--〉byte

You can use the configuration filter method to solve the Chinese garbled:

<web-app>
<filter>
<filter-name>RequestFilter</filter-name>
<filter-class>net.golden.uirs.util.RequestFilter</filter-class>
<init-param>
<param-name>charset</param-name>
<param-value>gb2312</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>RequestFilter</filter-name>
<url-pattern>*. Jsp</url-pattern>
</filter-mapping>
</web-app>


public void DoFilter (ServletRequest req, servletresponse Res,
Filterchain Fchain) throws IOException, Servletexception {
HttpServletRequest request = (httpservletrequest) req;
HttpServletResponse response = (httpservletresponse) res;
HttpSession session = Request.getsession ();
String userId = (string) session.getattribute ("UserId");
Req.setcharacterencoding (This.filterConfig.getInitParameter ("CharSet"));
Set character sets?
is actually setting up the Byte--〉char encoding.
try {
if (userId = = NULL | | userid.equals (")") {
if (!request.getrequesturl (). ToString (). Matches (
". */uirs/logon/logon (Controller) {0,1}\\x2ejsp$")) {
Session.invalidate ();
Response.sendredirect (Request.getcontextpath () +
"/uirs/logon/logon. Jsp ");
}
}
else {
See if you have permission to escalate the information system
if (!net.golden.uirs.util.uirschecker.check (userId, "information escalation system",
Net.golden.uirs.util.UirsChecker.ACTION_DO)) {
if (!request.getrequesturl (). ToString (). Matches (
". */uirs/logon/logon (Controller) {0,1}\\x2ejsp$")) {
Response.sendredirect (Request.getcontextpath () +
"/uirs/logon/logoncontroller.jsp");
}
}
}
}
catch (Exception ex) {
Response.sendredirect (Request.getcontextpath () + "/uirs/logon/logon. Jsp ");
}
Fchain.dofilter (req, res);

Bytes and Unicode

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.