Analysis of JSP page encoding problems

Last Update:2017-01-13 Source: Internet

Author: User

Tags character set tomcat

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Jsp tutorial page encoding problem analysis
<% @ Page contenttype = "text/html; charset = utf-8" %>
<Html>
<Head>
<Meta http-equiv = "content-type" content = "text/html; charset = utf-8">
</Head>
<Body>
China
</Body>
</Html>
Why does the "China" page become garbled during operation?
Analysis
Key step
For the analysis of the above problem, we need to look at the lifecycle of the jsp page request. Generally, we need to go through the following stages:
1. The application server generates a java file based on the jsp page.
2. The application server calls java.exe to compile the java file into a class file corresponding to the servlet.
3. The user's browser requests the servlet corresponding to the jsp. The web container starts a thread to execute the servlet and returns the data to the client browser.
4. The user's ie displays the result to the user based on the returned data.
Key step analysis
To better understand the coding problem, we will analyze the problem step by step from the above four links and obtain the final solution based on the analysis results.
1. The application server generates a java file based on the jsp page.

The application server reads the code of the entire jsp page and writes it to a new java file. Encoding is involved in reading and writing files, how does the app server solve this encoding problem? I studied the source code of the tomcat application server and found that the pageencoding parameter in tomcat is very important. parsercontroller reads this parameter from the jsp file (if not, read charset from contenttype in the first line, and save it. If this parameter is not read, a default pageencoding parameter is read from jspconfig, if these two parameters are not set, the system will default to the ISO8859-1 encoding to read the original jsp file.
From the above analysis, we have basically understood the encoding method for the application server to read jsp files. Because the java underlying layer stores characters based on unicode encoding, when writing files, are output as unicode encoding.
2. When jdk compiles a java file into a class file
You can use the-encoding parameter to specify the source file encoding, which is very important during manual compilation, because this determines the encoding method used by the java virtual machine to read java files, however, we can ignore this link in web applications because the application server can solve this encoding well. Take tomcat as an example. Because the generated java file is fixed with UTF-8 encoding, tomcat also uses UTF-8 encoding to read the file. You can see reader = new inputstreamreader (hconn. getinputstream (), charset); charset = utf-8. Therefore, the application server can be well grasped in this step without leading to coding problems.
3. The user's browser requests the servlet phase corresponding to the jsp.

If there is no encoding problem in the previous step, that is to say, when running in the Java virtual machine, the "China" can be obtained normally ", in this case, the servlet execution process will not always be "China", which is stored in unicode. Therefore, the third step should focus on how jspwriter returns data to the client browser. You can try to use new string (str. getbytes ("encoding"), "encoding") does not cause garbled characters during execution. That is to say, different codes can be used for a string to getbytes () generate byte array (underlying i18n. jar to provide byte2char and char2byte conversion ).
If you can understand this, we need to understand the encoding method used by jspwriter to output strings? Browse response. the java class can understand that the tomcat application server is the writer encoding method obtained based on contenttype. That is to say, the byte stream returned to the client is the byte array obtained from the charset corresponding to contenttype.
4. ie displays data based on the returned data.

Through the previous analysis, we can see that the "China" returned by the application server is displayed based on the charset in contenttype. As long as ie knows that the encoding should be used to receive byte streams and convert them into strings, and recommend the appropriate encoding for the user's browser to view the results, the user can browse the correct "China" word. I'm glad that the current ie and other browsers are officially doing this.
Conclusion
Through the above analysis, we can see that during the coding process of the jsp page, what we really want to solve is the encoding problem in the process from jsp files to java files, that is, the pageencoding parameter settings. Since the pageencoding parameter is a parameter specified in the servlet2.3 specification, the following methods are common in many application servers. In this case, I have basically obtained the following methods in my work:
1. Add the pageencoding parameter to the jsp page, for example, <% @ page contenttype = "text/html; charset = utf-8" pageencoding = "gbk" %>, in this way, the page can be stored in ansi. That is to say, when the encoding method of the page storage is different from the charset in chtenttype, you can add the pageencoding parameter.
2. Some application servers (such as weblogic) do not obtain the pageencoding parameter first, instead of obtaining the encoding type from charset, but from other configuration files, such as weblogic. the following code is added to the xml file:
<Jsp-descriptor>
<Jsp-param>
<Param-name> compilersupports </param-name>
<Param-value> true </param-value>
</Jsp-param>
<Jsp-param>
<Param-name> encoding </param-name>
<Param-value> gbk </param-value>
</Jsp-param>
</Jsp-descriptor>
(Similar processing is also available in tomcat5x. Add the following configuration items in the web. xml file of the application)
</Jsp-config>
<Jsp-property-group>
<Url-pattern> *. jsp </url-pattern>
<El-ignored> true </el-ignored>
</Jsp-property-group>
</Jsp-config>
Common encoding method decomposition

Gb2312
The Chinese character Information Interchange Code of the People's Republic of China, full name: "Chinese character encoding character set for information exchange-Basic Set", published by the State Administration of standards and standards, implemented in May 1, 1981 and passed through the Chinese mainland. This encoding is also used in Singapore and other places ., Gb2312-80 is made in the initial stage of the development of Chinese character information technology in China, which contains most of the commonly used first-and second-level Chinese characters, and 9-area symbols. This character set is supported by almost all Chinese systems and international software. It is also the most basic Chinese character set. The encoding range is high: 0xa1-0xfe; low: 0xa1-0xfe; Chinese characters starting from 0xb0a1, ending at 0xf7fe; contains more than 6000 Chinese characters (excluding special characters ).
Gbk
Is an extension of the gb2312-80 and is upward compatible. The encoding range is 0x8140-0xfefe, excluding the 0x80 characters at the top. All its characters can be mapped to unicode 2.0 one-to-one. That is to say, java actually supports the gbk character set. This is the default character set for windows and some other Chinese operating systems at present, but not all international software support this character set. It seems that they do not fully understand what gbk is. It is worth noting that it is not a national standard, but a standard. With the launch of the gb18030-2000 national mark, it will fulfill its historical mission in the near future.
Gbk encoding is a new Chinese encoding developed in mainland China and equivalent to the ucs to expand national standards. Contains (including special characters) 22014 characters in total
Unicode
Using a 16-bit encoding system, the character set is the same as the basic multilingual plane of iso000046. Unicode passed the dis (draf international standard) in June 1992. The current version is 1996, which contains 6811 symbols, 20902 Chinese characters, 11172 Korean pinyin characters, and 6400 word-building areas, 20249 retained, totaling 65534.
UTF-8
Commonly known as Wanguo code, it is committed to using unified coding principles to express the texts of countries. To express more texts, UTF-8 adopts the 2/3 mixed encoding method. Currently, the range of Chinese characters is less than gbk encoding. Processing Chinese characters in a 3-byte manner brings about compatibility issues. The original gbk, gb2312, and gb18030 encoding files cannot be processed normally. Programming becomes difficult and complex because of its wide encoding, because even the most basic character processing functions need to examine each byte separately to distinguish character boundary. This reduces the processing speed and requires complex and error-prone code.
Appendix: encoding methods and relationships
Unicode:

The encoding mechanism developed by unicode.org should include common texts all over the world.
In 1.0, it is a 16-bit code, from u + 0000 to u + ffff. each 2byte code corresponds to one character. At the beginning of 2.0, the 16-bit limit was abandoned. The original 16-bit is used as the basic bit plane, and the 16-bit plane is added, which is equivalent to 20-bit encoding, the encoding range is 0 to 0x10ffff.

Ucs:

The universal character set defined in iso000046 according to iso, which adopts 4 byte encoding.

Unicode:

Iso and unicode.org are two different organizations, so different standards were initially developed. However, since unicode2.0, unicode adopts the same font and word code as iso 10646-1, iso also promises that the iso000046 will not assign a value to the UCS-4 code that exceeds 0x10ffff, so that the two are consistent.

The encoding method of ucs:

UCS-2, which is basically the same as the 2 byte encoding of unicode.
UCS-4, 4 byte encoding, is currently added in front of the UCS-2 2 fully zero byte.

Utf: unicode/UCOS transformation format
UTF-8, 8-bit encoding, ascii is not converted, and other characters are variable-length encoding. Each character is 1-3 bytes. It is usually used as an external code. It has the following advantages:
* It is irrelevant to the cpu byte sequence and can communicate with each other on different platforms.
* High fault tolerance. If any one byte is damaged, only one encoding bit will be lost at most, and no chainlock error will occur (for example, if one byte is incorrect, the entire line will be garbled)
UTF-16, 16-bit encoding, is a variable length code, roughly equivalent to 20-bit encoding, the value between 0 and 0x10ffff, basically is the implementation of unicode encoding. it is a variable length code, which is related to the cpu order, but because it saves the most space, it is often used as an external code for network transmission.
The UTF-16 is unicode preferred encoding.
UTF-32, uses only 32-bit encoding in the unicode range (0 to 0x10ffff), equivalent to a subset of the UCS-4.

Utf and unicode:
Unicode is a character set and can be viewed as an internal code.
Utf is a encoding method because unicode is not suitable for direct transmission and processing in some scenarios. UTF-16 is unicode encoding directly, no transformation, but it contains 0x00 in the encoding, the first byte of the first 256 bytecode is 0x00, in the operating system (C language) it has special significance and may cause problems. UTF-8 encoding can be used to convert unicode directly to avoid this problem and bring some advantages.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More