UTF-8 is what stuff?

Source: Internet
Author: User

Introduction to UTF-8 Coding

UTF-8 coding is a widely used code, which is committed to incorporating global languages into a unified code and has already incorporated several Asian languages. UTF stands for the ucstransformation format.

The UTF-8 uses variable-length bytes to indicate characters. Theoretically, it can be up to 6 bytes in length. The UTF-8 code is compatible with asc ii (0-127), that is, the UTF-8 code for asc ii characters is the same as asc ii. The following encoding rules are used only for characters exceeding the length of one byte:

The number of the first byte 1 indicates the number of digits of the character encoding bytes. For example, the two-byte character encoding format is 110 XXXXX 10 xxxxxx;
The encoding style of the Three-byte characters is 1110 XXXX 10 xxxxxx 10xxxxxx. Similarly, the six-byte character encoding style is 1111110x.
10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx. Xxx
The value is filled in by characters encoded binary characters. Use only the shortest one character-encoded multi-byte string. For example:

Unicode Character: 00
A9 (copyright) = 1010 1001, UTF-8 encoded as: 11000010 10101001 = 0x C2 0xa9; character 22
60 (not equal to symbol) = 0010 0010 0110 0000, UTF-8 encoded as: 11100010 10001001 10100000
= 0xe2 0x89 0xa0

HTTP Communication Protocol

HTTP Request

In communication, the method is first set in the request message sent by the client. The method is used to tell the Server Client to initiate an action request. In the request header, the client can also send additional information at the same time,
For example, the browser used by the client and the content type that the client can interpret. This information can be used by server applications to generate responses. The following is an example of an HTTP request message:

GET/intro.html HTTP/1.0
User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)
Accept: image/GIF, image/JPEG, text /*,*/*
Accept-language: ZH
Accept-charset: iso-8859-1
Requests are obtained through the get method. The User-Agent provides information about the client browser and the accept
Provides acceptable media types for the client. Accep-language indicates the preferred language of the client browser, accept-charset
Provides the preferred character set for the browser. The server program can generate the desired response according to the requirements of the client. You can configure your browser to set the preferred language.

After the browser sends a request, you can use the following code to read the preferred language and country code of the client browser:
[Code] protected void insertproc (httpservletrequest req, httpservletresponse resp)
Throws servletexception, ioexception {
Locale reqlocal = Req. getlocale ();
System. Out. println ("the country is:" + reqlocal. getcountry ());
System. Out. println ("the language is:" + reqlocal. getlanguage ());

The server output result is:

[06-3-10 14: 56: 32: 516 CST] 6ce078f9 systemout o the country is: CN
[06-3-10 14: 56: 32: 516 CST] 6ce078f9 systemout o the language is: ZH

HTTP Response

When the server receives the request, it will process the request and respond concurrently. The server uses the response header to specify information such as the server software and the corresponding content type. The following is an example of a response message header:

Date: Saturday, 23-may-98 03:25:12 GMT
Server: javawebserver/1.1.1
Mime-type: 1.0
Content-Type: text/html; charset = UTF-8
Content-Length: 1029
Last-modified: Thursday, 7-may-98 12:15:35 GMT
Content-Type indicates the MIME type of the Response Message and the character set of the Response Message Body. The browser uses the corresponding character set to display the message content. For example, in the preceding example, the character set is a UTF-8, And the browser uses UTF-8 encoding to parse and actually return the message body. The page input is also coded in UTF-8.

Code displayed on the web page

You can set the content type in the following ways.

Set the page encoding method in HTML

If a static html page is accessed. You can set the page encoding method in the following ways.

Set static html file for page Encoding
<! Doctype HTML public "-// W3C // dtd html 4.01 transitional // en">
<Meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8">
<Title> example.html </title>

Note "<meta http-equiv =" Content-Type "content =" text/html;
Charset = UTF-8 ">" set content_type in the response message header to "text/html; charset = UTF-8 ".

Set the page display encoding method in Servlet

In servlet, we can set the content type of the response message in the following ways.

Figure 7. Set the servlet snippet of the page Encoding
Protected void insertproc (httpservletrequest req, httpservletresponse resp)
Throws servletexception, ioexception {
Resp. setcontenttype ("text/html; charset = UTF-8 ");

Line of code "resp. setcontenttype (" text/html; charset = UTF-8 ");" set content_type in the response message header to "text/html; charset = UTF-8 ".

Set the page display encoding method in JSP

The following example shows how to set the page encoding format in JSP.
JSP page instructions for setting industry-level code
<% @ Page Language = "Java" contenttype = "text/html; charset = UTF-8" pageencoding = "UTF-8" %>

In the page directive of this line, "text/html; charset = UTF-8" sets "Content-Type" in the Response Message to "text/html; charset = UTF-8 ".

It only specifies the encoding format of the JSP page and the encoding method used to save the JSP page. When the container reads the file, it converts it to the Unicode used internally.
When the response is sent back to the browser, the container will convert the Unicode used internally to the character set specified in Content-Type.

If pageencoding is not specified, you can use the character set specified by Content-Type to explain the JSP page bytes.

To properly display characters encoded by the UTF-8, the following two conditions must be met:

1. the character set used to notify the browser to respond to a message.

2. Configure the browser so that it can properly display UTF-8-encoded fonts.

Web page input Encoding

HTML forms can accept non-Western European characters. To create a form that requires receiving non-Western European language characters, you must notify the browser of the character set used for user input. You can set this by setting the contenttype attribute of the page command.

When the form is submitted, the browser converts the form field value to the byte value corresponding to the specified character set, and then according to the HTTP standard URL
The encoding scheme encodes the result byte. When using ISO-8859-1 encoding, any A to Z, A to Z and 0 to 9
Other characters are converted to hexadecimal byte values with a percent sign (%) added before. For example, if the form character set is set to UTF-8,
The character "Chinese" is passed in the encoding format: "% E4 % B8 % ad % E6 % 96% 87 ". To process the input information, the container must know the character set used by the browser
Encoding. The problem is that most browsers do not provide such information. Therefore, you must provide this information and tell the container which character set to decode the input.

Enter encoding settings on the page

The Section 3rd explains how to set the display encoding of the page. When setting the page encoding, the page input mode is also specified. If the display of the page is set to a UTF-8, all the page inputs of the user are encoded according to the UTF-8.

Encoding settings for the input and output processes on the page

The server program must set the input encoding before reading the form input. Let's take a look at the example.

The following is a JSP page that prompts the user to enter:

JSP page used for interface input
<% @ Page contenttype = "text/html; charset = UTF-8" pageencoding = "UTF-8" %>
<Title> insertdb. jsp </title>
<Form method = post action = "./insertdbprocds">
<TD> name: </TD>
<TD> <input type = "text" name = "col2" value = ""> </TD>
<TD> <input type = "Submit" value = "Submit"> </TD>

The property Content-Type value for the surface directive element is "text/html; charset = UTF-8", which instructs the browser that the page is based on the UTF-8
Encoded, and all user input through the page will also be encoded according to the UTF-8. The servlet triggered by action is shown in the following example.

Servlet used to read input and output by UTF-8
Protected void insertproc (httpservletrequest req, httpservletresponse resp)
Throws servletexception, ioexception {
String test1 = Req. getparameter ("col2 ");
Printwriter out = resp. getwriter ();
Resp. setcontenttype ("text/html; charset = UTF-8 ");
Out. println ("<HTML> ");
Out. println ("the input is" + test1 );
Out. println ("

In the triggered servlet, by setting resp. setcontenttype ("text/html; charset = UTF-8 ")
To indicate to the browser that the output encoding character set is UTF-8, the browser displays the output with the correct character set. If no call is displayed in the Servlet
Resp. setcontenttype ("text/html; charset = UTF-8 ")
To set the output character set. The browser cannot correctly decode and display the output.

This paper presents some methods for displaying and inputting UTF-8 encoded characters in Web application development. It is easy for readers to refer to in development practices.

Related Article

Cloud Intelligence Leading the Digital Future

Alibaba Cloud ACtivate Online Conference, Nov. 20th & 21st, 2019 (UTC+08)

Register Now >

Starter Package

SSD Cloud server and data transfer for only $2.50 a month

Get Started >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.