Analysis and solution of Chinese character problem in Java programming technology

Source: Internet
Author: User
Tags character set

In the Java language programming, we often encounter the problem of Chinese character processing and display. A large pile of garbled reading is certainly not what we would like to see the display effect, how can we make those Chinese characters correctly displayed? The default encoding for the Java language is Unicode, and the files and databases that we Chinese usually use are encoded based on GB2312 or BIG5, how can we properly select the encoding method and correctly handle Chinese character encoding? This article will start with encoding common sense, combine Java programming example, analyze above two questions and propose solution to them.

Now that the Java programming language has been widely used in the Internet world, it has taken into account the support of non-English characters as long as Sun has developed the Java language. Sun's published Java Runtime Environment (JRE) itself is in English and international editions, but only international editions support non-English characters. However, in the application of the Java programming language, support for Chinese characters is not as perfect as is claimed in the Java Soft Standard specification, because there are more than one character set, and different operating systems support the characters differently. So there are a lot of problems with encoding handling that are bothering us in our application development. There are many answers to these questions, but are relatively trivial, can not meet the urgent need to solve the problem of the desire, on the Java Chinese problem of the system research is not much, this article from encoding common sense, analysis of Java Chinese problems, I hope to solve this problem to help.

Encoding's common sense

As we know, English characters are generally expressed in one byte, and the most common encoding method is ASCII. But a byte can only distinguish between 256 characters, and thousands of Chinese characters, so now all the Chinese characters in Double-byte, in order to be able to separate from the English characters, the highest bit of each byte must be 1, so that the double byte can represent up to 64K characters. We often encounter the coding methods are GB2312, BIG5, UNICODE and so on. For detailed information on the specific coding methods, interested readers can access the relevant information. I have a superficial talk about GB2312 and UNICODE, which are closely related to us. GB2312 code, the People's Republic of China's national standard Chinese character Information exchange code, is a national standard of the People's Republic of China issued by the General Administration of Simplified Chinese character encoding, access to mainland China and Singapore, referred to as GB code. In two bytes, the value of the first byte (high byte) is the area code value Plus (20H), and the second byte (low byte) has a bit value plus (20H), which is used to represent the encoding of a Chinese character. UNICODE code is a multi-byte equal-length code proposed by Microsoft to solve the problem of multi-country characters, which is compatible with the policy of "0" byte before the English character. If the ASCII code for "A" is 0x41,unicode, it is 0x00,0x41. Using special tools, various encodings can be converted to each other.

A preliminary understanding of Java Chinese problems

When we develop applications based on the Java programming language, we inevitably have to deal with Chinese. The default encoding of the Java programming language is Unicode, and we usually use the database and files are based on GB2312 encoding, we often encounter such a situation: browsing based on JSP technology site to see is garbled, the file opened after the read is garbled, by the Java The contents of the modified database cannot continue to provide information correctly when applied on other occasions.

String senglish = "Apple";

String Schinese = "Apple";

String s = "Apple apples";

The length of the senglish is 5,schinese length is 4, and s default length is 14. For Senglish, the various classes in Java are very well supported and must be displayed correctly. But for Schinese and S, although the Java Soft declares that Java's base class has taken into account support for multinational characters (the default Unicode encoding), the default encoding for the operating system is not Unicode, but GB code. From Java source code to get the correct result, go through the process of "Java source code-> Java bytecode->, virtual machine-> operating system-> display device". In each step of the process, we must correctly handle the encoding of Chinese characters, so that the final display results can be correct.

"Java source code-> java bytecode", the standard Java compiler javac use of the character set is the system default character set, such as on the Chinese Windows operating system is GBK, and on the Linux operating system is iso-8859-1, So you will find that in the Linux operating system compiled classes in the source file in the Chinese characters are problematic, the solution is to compile the time to add encoding parameters, so as to be independent of the platform. Usage is

Javac? Cencoding GBK.

"Java bytecode-> virtual machine-> operating System", the Java Runtime Environment (JRE) is in English and international editions, but only international editions support non-English characters. The Java Development Kit (JDK) certainly supports multinational characters, but not all computer users have JDK installed. Many operating systems and applications in order to better support Java, are embedded in the international version of the JRE, for their support of multinational characters to provide convenience.

"Operating system-> display device", for Chinese characters, the operating system must support and be able to display it. If the English operating system does not match the special application software, it is definitely not able to display Chinese.

Another problem is that in the Java programming process, the Chinese characters in the correct encoding conversion. For example, when you export a Chinese string to a Web page, whether you are using the

Out.println (string);

<%=string%>, all must be converted to UNICODE to GBK, either manually, or automatically. In JSP 1.0, you can define an output character set to enable automatic conversion of the inner code. Usage is

<% @page contenttype= "text/html;charset=gb2312"%>

However, there is no support for the output character set in some JSP versions (for example, JSP 0.92), which requires manual encoding of the output, and there are many methods. The most common method is

String S1 = request.getparameter ("keyword");

String s2 = new String (S1.getbytes ("iso-8859-1"), "GBK");

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.