Application of UTF-8 Character Processing in Web Development

Source: Internet
Author: User
Web applications must meet the needs of multiple languages. Users in different countries should be able to enter characters in their own languages, and Web applications should be able to display pages in multiple languages according to different regional settings. Currently, different languages have different encoding methods to display the corresponding language information. For example, Chinese can be displayed using gb2312 encoding, and Japanese can be displayed using shift-JIS encoding. But the UTF-8 encoding method can include almost all the language characters. Using UTF-8 coding to process the input and display of Web application information standardizes information interaction between different web applications and simplifies the application development process.

Introduction to UTF-8 coding UTF-8 coding is a widely used code that is committed to incorporating global languages into a unified code that has been incorporated into several Asian languages. UTF stands for the ucstransformation format.

The UTF-8 uses variable-length bytes to indicate characters. Theoretically, it can be up to 6 bytes in length. The UTF-8 code is compatible with asc ii (0-127), that is, the UTF-8 code for asc ii characters is the same as asc ii. For a character exceeding the length of one byte, the following encoding specifications are used: the number of the first byte 1 on the left represents the number of digits of the character encoding byte. For example, the two byte character encoding styles are: 110 XXXXX 10 xxxxxx;

The encoding style of the Three-byte characters is 1110 XXXX 10 xxxxxx 10xxxxxx. Similarly, the six-byte character encoding style is 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx. The xxx value is filled in by characters encoded binary characters. Use only the shortest one character-encoded multi-byte string. For example: Unicode Character: 00 A9 (copyright) = 1010 1001, UTF-8 encoded as: 11000010 10101001 = 0x C2 0xa9; character 22 60 (not equal to the symbol) = 0010 0010 0110 0000, UTF-8 code: 11100010 10001001 10100000 = 0xe2 0x89 0xa0 HTTP communication protocol HTTP requests in HTTP Communication, In the request message sent by the client, the first set is the method. The method is used to tell the Server Client to initiate an action request. In the request header, the client can also send additional information, such as the browser used by the client and the content type that the client can interpret. This information can be used by server applications to generate responses. The following is an example of an HTTP request message:Figure 1. HTTP Request Message Header

GET/intro.html HTTP/1.0user-AGENT: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95) accept: image/GIF, image/JPEG, text /*, */* accept-language: zhaccept-charset: iso-8859-1
This request uses the get method to obtain the resource/intro.htm. The User-Agent provides information about the client browser and the accept provides the acceptable media types. Accep-language indicates the preferred language for the client browser. Accept-charset provides the preferred character set for the browser, and the server program can generate the desired response according to the client's requirements. You can configure your browser to set the preferred language. Take ie as an example: Figure 2. browser preferred language settings

After the browser sends a request, you can use the following code to read the preferred language and country code of the client browser. Figure 3. The preferred country and language for the server-side servlet to read from the browser
Protected void insertproc (httpservletrequest req, httpservletresponse resp) throws servletexception, ioexception {locale reqlocal = req. getlocale (); system. out. println ("the country is:" + reqlocal. getcountry (); system. out. println ("the language is:" + reqlocal. getlanguage ());
The server output result is: Figure 4. server-side servlet reading browser preferred country and language results
[06-3-10 14: 56: 32: 516 CST] 6ce078f9 systemout o the country is: CN [06-3-10 14: 56: 32: 516 CST] 6ce078f9 systemout o the language is: ZH
HTTP response when the server receives the request, it will process the request and respond concurrently. The server uses the response header to specify information such as the server software and the corresponding content type. The following is an example of a response message header: Figure 5. HTTP Response Message Header
Date: Saturday, 23-may-98 03:25:12 gmtserver: javawebserver/1.1.1mime-version: 1.0content-type: text/html; charset = UTF-8Content-length: 1029last-modified: Thursday, 7-may-98 12:15:35 GMT
Content-Type indicates the MIME type of the Response Message and the character set of the Response Message Body. The browser uses the corresponding character set to display the message content. For example, in the preceding example, the character set is a UTF-8, And the browser uses UTF-8 encoding to parse and actually return the message body. The page input is also coded in UTF-8. You can set the content type in the following ways for the web page display encoding. Set the page encoding method in HTML if a static html page is accessed. You can set the page encoding method in the following ways. Figure 6. Set static html file for page Encoding
<! Doctype HTML public "-// W3C // dtd html 4.01 transitional // en"> <HTML>
Mark "<meta http-equiv =" Content-Type "content =" text/html; charset = UTF-8 ">" set content_type in the response message header to "text/html; charset = UTF-8 ". set the page display encoding method in servlet. We can set the content type of the response message in the following ways. Figure 7. Set the servlet snippet of the page Encoding
Protected void insertproc (httpservletrequest req, httpservletresponse resp) throws servletexception, ioexception {resp. setcontenttype ("text/html; charset = UTF-8 ");
Line of code "resp. setcontenttype (" text/html; charset = UTF-8 ");" set content_type in the response message header to "text/html; charset = UTF-8 ". The following example shows how to set the page encoding format in JSP. Figure 8. jsp page instructions for setting industry-level codes
<% @ Page Language = "Java" contenttype = "text/html; charset = UTF-8" pageencoding = "UTF-8" %>
In the page directive of this line, "text/html; charset = UTF-8" sets "Content-Type" in the Response Message to "text/html; charset = UTF-8 ". "Pageencoding" only specifies the encoding format of the JSP page and the encoding method used to save the JSP page. When the container reads the file, it converts it to the Unicode used internally. When the response is sent back to the browser, the container converts the Unicode used internally to the character set specified in Content-Type. If pageencoding is not specified, you can use the character set specified by Content-Type to explain the JSP page bytes.

To properly display characters encoded by the UTF-8, the following two conditions must be met: 1. the character set used to notify the browser to respond to messages. 2. Configure the browser so that it can properly display UTF-8-encoded fonts. The web page input HTML-encoded form can accept non-Western European language characters. To create a form that requires receiving non-Western European language characters, you must notify the browser of the character set used for user input. You can set this by setting the contenttype attribute of the page command.

When a form is submitted, the browser converts the form field value to the byte value corresponding to the specified character set, and then encodes the result byte according to the HTTP standard URL encoding scheme. When using ISO-8859-1 encoding, any character from A to Z, A to Z, and 0 to 9 will be converted to a hexadecimal byte value, add a percent sign (%) to the front ). for example, if the form character set is set to UTF-8, the character "Chinese" is passed in the encoding format: "% E4 % B8 % ad % E6 % 96% 87 ". To process the input information, the container must know the character set used by the browser to encode the input. The problem is that most browsers do not provide such information. Therefore, you must provide this information and tell the container which character set to decode the input.

Page input encoding settings in part 1 of this article describes how to set the display encoding of the page, while setting the page encoding, it also specifies the page input mode. If the display of the page is set to a UTF-8, all the page inputs of the user are encoded according to the UTF-8.

Page input and output process encoding settings the server program must set the input encoding before reading form input. Let's take a look at the example. The following is a JSP page that prompts the user to enter:Figure 9. jsp page used for interface input

<% @ Page contenttype = "text/html; charset = UTF-8" pageencoding = "UTF-8" %> <HTML>
The property Content-Type value of the page directive element is "text/html; charset = UTF-8", which indicates to the browser that the page is encoded by UTF-8, and all user input through the page will also be encoded by UTF-8. The servlet triggered by action is shown in the following example. Figure 10. servlet for reading input and output by UTF-8
Protected void insertproc (httpservletrequest req, httpservletresponse resp) throws servletexception, ioexception {string test1 = req. getparameter ("col2"); printwriter out = resp. getwriter (); resp. setcontenttype ("text/html; charsets = UTF-8"); out. println ("<HTML>"); out. println ("the input is" + test1); out. println ("
Enter "Chinese" on the form page and submit the form. The result is: Figure 11. Correctly display the result page

If we comment out the statement: resp. setcontenttype ("text/html; charset = UTF-8"), for example: Figure 12. Servlet used to read input and output by non-UTF-8
Protected void insertproc (httpservletrequest req, httpservletresponse resp) throws servletexception, ioexception {string test1 = req. getparameter ("col2"); printwriter out = resp. getwriter (); // resp. setcontenttype ("text/html; charsets = UTF-8") out. println ("<HTML>"); out. println ("the input is" + test1); out. println ("
Enter "Chinese" and submit the form. The result is: Figure 13. Error display page

The page does not correctly display characters encoded by the UTF-8. In the triggered servlet, set resp. setcontenttype ("text/html; charset = UTF-8") to indicate to the browser that the output encoding character set is a UTF-8 and the browser displays the output with the correct character set. If the servlet does not display the call resp. setcontenttype ("text/html; charset = UTF-8") to set the output character set, the browser will not correctly decode and display the output. Conclusion This paper presents some methods for displaying and inputting UTF-8 characters in Web application development. It is easy for readers to refer to in development practices. Yin Jian, a software engineer at ibm csdl, is currently engaged in the development of enterprise e-commerce applications.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.