Garbled characters encountered during Java Development

Source: Internet
Author: User
Garbled characters encountered during Java Development

If you want to understand why Chinese garbled characters are generated during JSP development, let's take a look at unicode encoding.
Unicode (unified code), as its name implies, is a collection of all types of text in the world. The Unicode strategy will be promoted by major U.S. computer manufacturers. The purpose is to promote a universal coding system in the world, so as to reduce the problems encountered by computer vendors in developing foreign markets.
In order to collect tens of thousands of texts into a common encoding mechanism, every word in Unicode is represented by two bytes, regardless of whether it is Eastern or Western text, taking into account the economic principle, in this way, there are at least two 16 power 65536 different combinations, enough to meet the needs of most of the current occasions.
Basically, a computer only processes numbers. They specify a number to store letters or other characters. Before Unicode is created, there are hundreds of encoding systems that specify these numbers. No encoding can contain enough characters: for example, the European Community alone needs several different encodings to include all languages. Even in a single language, such as English, no encoding can be used for all letters, punctuation marks, and common technical symbols.
These encoding systems also conflict with each other. That is to say, the two types of encoding may use the same number to represent two different characters, or use different numbers to represent the same characters. Any particular computer (especially a server) needs to support many different encodings. However, no matter when the data passes through different encodings or platforms, the data may be damaged.
Unicode provides a unique number for each character, regardless of the platform, program, or language. Unicode standards have been adopted by industry leaders, such as Apple, HP, IBM, justsystem, Microsoft, Oracle, SAP, sun, Sybase, Unisys, and many other companies. Unicode is required for the latest standards, such as XML, Java, ecmascript (JavaScript), LDAP, CORBA 3.0, and WML. Unicode is a formal method for implementing ISO/IEC 10646. Many operating systems, all the latest browsers and many other products support it. The emergence of Unicode standards and the existence of tools supporting it have become the most important development trend of software technology in the world recently.
Unicode can be combined with client servers or multi-tier applications and websites to save costs than traditional character sets. Unicode enables a single software product or website to run across multiple platforms, languages, and countries without reconstruction. It can transmit data to many different systems without corruption.
The terms ISO 10646 and UCS are often seen in Unicode-related technical documents.
ISO is short for the Swiss International Bureau of Standards.
The 10,646th standard universal character set promulgated by UCS is a universal character set in the world.
The general character set of the UCS uses four bytes for encoding. All the official and commercial code sizes in the world can be used in one network. Unicode has been working closely with the ISO's UCS team since 1991 to ensure consistency between Unicode and ISO 10646. Therefore, since version 2.0, Unicode uses the same encoding as ISO 10646-1.
There are 40 thousand Chinese characters in the Kangxi Dictionary. If there are no simplified Chinese characters in the dictionary and Japanese characters in different ways, the unicode6 will allocate more than space, the distribution of Chinese characters seems to be banned, let alone Thai or Arabic. To address this problem, Unicode and UCS adopt the [Chinese-Japanese-Korean integration] (CJK unification) solution, which expresses detailed Chinese characters in the Chinese and Japanese strokes in the same single code.
Unicode after [Chinese and Japanese Korean integration] is called unihan.
The complete unicode4.0 version can be downloaded by the http://www.unicode.org/Public/UNIDATA/Unihan.txt.

UTF (UNICODE/UCOS Transformation format), Unicode recommends two formats, UTF-8 and UTF-16, where 8 and 16 refer to the number of bits rather than the number of bytes.
UTF-16 is basically Unicode dubyte implementation, coupled with an extended encoding mechanism to meet future needs (rarely used)
UTF-8 is an unequal encoding method, english numbers (ASCII code) remain intact, completely unaffected (so do not need to do conversion), and other Chinese character data must be converted through the program, it will [become fat], because each word requires one or two additional bytes for encoding.
In the UCS character set, there are UCS-2 and UCS-4 encoding methods where 2 and 4 refer to the number of bytes, corresponding to the UTF-8 and UTF-16.
The UCS-2 is similar to the Unicode Double Byte encoding.
UCS-4 4 Byte encoding represents a word, plus two blank bytes in front of each UCS-2 to get the corresponding UCS-4.

Unicode space allocation:
The following Unicode location codes are in hexadecimal notation
The first 256 characters of Unicode are exactly the same as the ISO-8859-1 (Western European letters), where the first half is ASCII (U + 0000 to U + 00FF ). Each ISO-8859-1 code is followed by an empty byte (0x00) before the corresponding Unicode code.
Unihan is mainly distributed between U + 3400 to U + f9fff, while gb2312 and big5 are mainly distributed between U + 4e00 to U + 9fff.

The encoding principle and characteristics of UTF-8:
After knowing the location of Western European characters and Chinese characters in UNICODE, let's take a look at the UTF-8.

U + 0000 ~ U + 007e 1 _ (7 bits)
U + 0080 ~ U + 07ff 1 1 0 _ 1 0 _ (11 bits)
U + 0800 ~ U + FFFF 1 1 0 _ 1 0 _ 1 0 _ (16 bits)
Check whether the free bits (underline space) proposed in the unicode format is sufficient to represent Unicode codes in the location.
So when the program processes UTF-8 encoding files, how do we know where a character's boundary falls? In the end, is it in three forms?
Each character encoded in a UTF-8, whether in one, two, or three bytes, the first byte front end clearly indicates the total number of bytes of the character. For example, there are two types of 1 in 110, which indicates that this character is displayed in the second mode and consists of two bytes. In contrast, 1110 has three 1 characters, indicating that such a character may appear in one way, which consists of three bytes.
Each multi-byte UTF-8 code has a common feature, that is, the second and third byte, always starts with 10 two bits. Since the highest bit is set to 1, it is easy to distinguish from those ASCII characters that only use one byte in the UTF-8 for troubleshooting.
Because of the above design features, between the UTF-8 and Unicode, it is easy to do two-way free conversion, without losing any information.

Solution: I have not encountered the problem of garbled characters in the NT operating system, but there are many errors in Unix or Linux systems.
Because the operating system and the use environment are different, garbled characters are generated in different ways. However, if you have mastered the Unicode encoding principles above, you can analyze them carefully to solve many problems.
Below are several common examples.
1. If some website servers such as Tomcat encounter Chinese garbled characters, you can modify server. XML in the conf directory.
<Connector Port = "8080" maxthreads = "150" minsparethreads = "25" maxsparethreads = "75"
Enablelookups = "false" redirectport = "8443" acceptcount = "100" DEBUG = "0" connectiontimeout = "20000" disableuploadtimeout = "true" uriencoding = "GBK"/>
Set uriencoding to GBK or gb2312
2. in the form or pass the string: The originally entered Chinese character is normal, but it is garbled after submission. Because it is generally submitted in iso8859 encoding, it must be converted to gb2312 encoding during display:

String S = new string (Rs. getstring ("news"). getbytes ("gb2312"), "iso8859_1 ");
// RS is the string to be converted
Then use the value of the S string.
3. Some server-side language environments can solve this problem if they are set to simplified Chinese.
4. The characters inserted into the database are garbled characters.
Let's see what encoding methods are supported in the database, and use the method similar to 2 for conversion.
5. in short, JSP-based development involves garbled code. You need to analyze whether garbled code occurs during reading or writing. Using the conversion in step 2, the problem can be basically solved, sometimes write a conversion, for example:
String S = new string (Rs. getstring ("news"). getbytes ("gb2312"), "iso8859_1 ");
// Switch back when reading
String S = new string (Rs. getstring ("news"). getbytes ("iso8859_1"), "gb2312 ");
Or replace the ISO8859-1 and gb2312 bit, try more, you can find a solution to the problem.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.