Because it has been not believed that Java can not mix to display a number of languages of the bug, this weekend to study the servlet, JSP in the multinational language display problem, that is, the servlet's multiple character set problem, because I am not very clear on the concept of character set, so the writing is not necessarily accurate, This is how I understand the character set in Java: At run time, each string object is stored in encoded Unicode code (I think all languages are encoded, because inside the computer the string is always expressed in code, but String encoding in a generic computer language is platform-dependent, while Java uses platform-independent Unicode.
When Java reads a string from a byte stream, it converts the platform-dependent byte into a platform-independent Unicode string. In the output, Java converts the Unicode string into a platform-dependent byte stream, and if a Unicode character does not exist on a platform, a ´?´ is output. For example: In Chinese windows, Java reads a "GB2312" encoded file (which can be any stream) into memory to construct a string object that converts GB2312 encoded text into a Unicode-encoded string, as The output of this string will also convert the Unicode string into a GB2312 byte stream or array: "Chinese test"-----> "U4E2DU6587U6D4BU8BD5"-----> "Chinese test".
byte[] bytes = new byte[]{(byte)0xd6, (byte)0xd0, (byte)0xce,
(byte)0xc4, (byte)0xb2, (byte)0xe2, (byte)0xca, (byte)0xd4};//GBK编码的"中文测试 "
java.io.ByteArrayInputStream bin = new java.io.ByteArrayInputStream(bytes);
java.io.BufferedReader reader =new java.io.BufferedReader(new java.io. InputStreamReader (bin,"GBK"));
String msg = reader.readLine();
System.out.println(msg)
This program is placed in a system (such as the Chinese system) containing the four words "Chinese test", which can be printed correctly. The MSG string contains the correct "Chinese test" Unicode Encoding: "U4e2du6587u6d4bu8bd5", which is converted to the operating system's default character set when printed, and whether the character set that relies on the operating system can be displayed correctly, only in systems that support the corresponding character set Our information can be correctly exported, otherwise the resulting will be rubbish.
Let's take a look at the multilingual problem in servlet/jsp. Our goal is that any country's clients send information to the server via form, server stores the information in the database, and the client can still see the correct information it sends when retrieving it. In fact, we want to make sure that the SQL statements in the final server contain the correct Unicode encoding of the client-sent text, and that the encoding used to communicate with the database can contain the text messages sent by the client, and in fact, it is best to let JDBC use the DBC directly unicode/ UTF8 and Database Communication! This ensures that the information is not lost, and that the server sends the message to the client with the encoding of not losing information or Unicode/utf8.
If you do not specify the Enctype property of the form, the form will submit the input according to the encoded character set UrlEncode the current page, and the server will get the urlencoding string. The urlencoding string that is encoded is related to the encoding of the page, such as GB2312 encoded pages submit "Chinese test", get "%d6%d0%ce%c4%b2%e2%ca%d4", each "%" followed by a 16-string , while the UTF8 encoding is "%e4%b8%ad%e6%96%87%e6% b5%8b%e8%af%95", because one of the characters in GB2312 encoding is 16 digits, while in UTF8 one Chinese character is 24 digits. China, Japan and South Korea ie4 above the browser support UTF8 encoding, this scheme must contain the three languages, so if we let HTML pages using UTF8 encoding will be able to support at least the three languages.
However, if we html/jsp the page using UTF8 encoding, because the application server may not know this, because if the browser sends a message that does not contain charset information, up to server knows to read Accept-language request bids, we know that this is the only Bidding is not known by the browser code, so the application server can not correctly parse the content submitted, why? Because all strings in Java are UNICODE16-bit encoded, the function of Httpservletrequest.request (string) is to convert the UrlEncode encoded information submitted by the client into a Unicode string. Some servers can only assume that the client's encoding is the same as the server platform, simply using the Urldecoder.decode (string) method to decode directly, if the client code is exactly the same as the server, you can get the correct string, otherwise, If the local character is included in the submit string, it will result in garbage information.