"In-depth analysis of javaweb Technology Insider" Reading notes--Chinese code

Source: Internet
Author: User
Tags control characters

Why to encode

The smallest unit of information stored in a computer is 1 bytes (8bit), so the range of characters that can be represented is 0-255. Humans have too many characters to express, and cannot be fully represented by 1 bytes. To solve this problem you need to use the new data structure char, which must be encoded from char to byte.

encoding Format

ASCII: A total of 128, with a byte of the low 7-bit representation, 0-31 control characters, 32-126 printing characters.

Iso-8859-1: Extended from ASCII code, covering most Western European language characters, single byte encoding, a total of 256 bytes.

GB2312: Double-byte encoding, containing 6,763 kanji.

GBK: Expanded from GB2312, and GB2312 compatible, can represent 21,003 characters.

GB18030: May be single-byte, double-byte, or four-byte, not widely applied.

UTF-16: Defines how Unicode characters are accessed in a computer. Fixed length, using two bytes to represent any character. Java takes UTF-16 as the character storage format for memory.

UTF-8: Using variable-length technology, each coding area has a different loadline length. Chinese generally accounts for three bytes.

encoding operations in JavaI/O operations:

The InputStreamReader class is an associative byte-to-character bridge that is responsible for processing read byte-to-character conversions during I/O, and its delegate Streamdecoder implements byte-to-character section codes, which must be specified by the user in the CharSet encoding format during the decoding process. The default character set in the local environment is used by default (the Chinese environment is GBK).

OutputStreamWriter is responsible for converting characters to bytes, encoding format and default encoding rules are consistent with decoding.

It is recommended that you do not use the default encoding of the operating system, which binds the encoding format of the application and the running environment, which can cause garbled problems across the environment.

@Test Public voidTestcopyfile () {Try(BufferedReader reader =NewBufferedReader (NewInputStreamReader (NewFileInputStream ("D:\\test.txt"), "UTF-8")); BufferedWriter writer=NewBufferedWriter (NewOutputStreamWriter (NewFileOutputStream ("D:\\test_.txt"), "UTF-8")); )
{String str=NULL; inti = 0; while(str = reader.readline ())! =NULL){ if(I! = 0) Writer.write ("\ r \ n"); I++; Writer.write (str); } }Catch(IOException e) {e.printstacktrace (); } }
Memory Operation:
String str = "Chinese string"new string (Str.getbytes (), "UTF-8");
String string = "Chinese string"= Charset.forname ("UTF-8"= Charset.encode (string);   

Add:

    • String→bytebuffer:charset.encode ()
    • Bytebuffer→string:charset.decode (). ToString ()
    • Charbuffer→string:tostring ()
    • Bytebuffer→byte[]:array ()
    • Byte[]→bytebuffer:bytebuffer.wrap ()
    • Charbuffer→char[]:array ()
    • Char[]→charbuffer:charbuffer.wrap ()
Comparison of several encoding formats

UTF-16 encoding is more efficient, from character to byte conversion is simpler, suitable for use between local disk and memory, the character and byte can be quickly switched, but not suitable for transmission between the network. It uses sequential encoding, can not verify the encoding of a single character, if the middle of a character code value is damaged, all subsequent code values are affected, in contrast, UTF-8 more suitable for network transmission.

UTF-8 coding is different from GBK and GB2312, so UTF-8 coding efficiency is more efficient, so it is more ideal to use UTF-8 encoding in storing Chinese characters. The single character corruption in UTF-8 does not affect the subsequent characters, encoding efficiency is between GBK and UTF-16, which is the ideal Chinese encoding method.

coding and decoding in Javaweb design

1. Data is transmitted over the network in bytes, and all data must be able to be serialized into bytes. In Java, the data is serialized and must inherit the serializable interface.

2. The integer number 1234567 is stored as a character, then the UTF-8 encoding will take up 7 bytes, the UTF-16 encoding will take up 14 bytes, but the number as an int is only 4 bytes. Therefore, it is meaningless to see only the length of the character itself, even if the same character, the size of the final storage with different encoding will be different, so from the character to the byte to see the encoding type.

Encoding and decoding of URLs

The character set that decodes the URI portion of the URL is defined in the <connector uriencoding= "UTF-8"/>, and is parsed with the default encoding iso-8859-1 if no definition is defined. When you have a Chinese URL, it is best to set uriencoding to UTF-8 encoding.

For QueryString parsing: QueryString's decoding character set is either the charset defined in the header contenttype, or iso-8859-1 to use the encoding defined in ContentType, Set Usebodyencodingforuri:<connector uriencoding= "UTF-8" usebodyencodingforuri= "true"/>. This configuration item does not use bodyencoding encoding for the entire URI, only for QueryString using bodyencoding decoding.

HTTP Header Codec

The client-initiated HTTP request may also pass other parameters (such as cookies) in the header, in addition to the URL. Decoding the entries in the header by default using Iso-8859-1, and cannot set the other decoder format header, if the header has non-ASCII characters, decoding will certainly appear garbled. If you must pass, call the Urlencoder encoding in Tomcat and add it to the header.

Encoding and decoding of post forms

The parameters of the post form are passed through the body of the HTTP to the server, and the character set encoding can be set by Request.setcharacterencoding (CharSet) when the commit is first decoded according to the character set in ContentType.

It is also important to note that the decoding of post form submission parameters occurs at GetParameter, so it is necessary to set the request.setcharacterencoding before the first call to the Request.getparameter method ( CharSet) method.

About uploaded file encoding: Also uses contenttype defined character set encoding, but the upload file is a byte stream to the server's local temporary directory, this process does not involve character encoding, only when the content of the file is added to parameters encoding.

Encoding and decoding of HTTP body

The codec charset is set by response.setcharacterencoding and returned to the client through the Content-type of the header. If there is no content-type in the header, the browser will be based on <meta http-equiv= "Content-type" content= "text/html"; Charset=utf-8 "> in CharSet to decode, if the attribute is still not available, the browser uses the default encoding.

Add: When using JDBC to access data, it is consistent with the built-in encoding of the data, and you can set the JDBC URL to specify:jdbc:mysql://localhost:3306/db?useunicode=true& Characterencoding=utf-8

"In-depth analysis of javaweb Technology Insider" Reading notes--Chinese code

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.