Scenarios in Java that need to be coded

Source: Internet
Author: User

Encoding present in I/O operation

We know that the place where the coding is involved is usually in the character to byte or byte-to-character conversion, while the scenario that requires this conversion is primarily in the I/O, this I/O includes disk I/O and network I/O, and the network I/O section will be followed primarily as an example of Web applications. is the interface for handling I/O issues in Java:

The reader class is the parent of the read characters in Java I/O, and the InputStream class is the parent class of the read Byte, and the InputStreamReader class is the association of byte-to-character bridges, which is responsible for processing read byte-to-character conversions during I/O processes. and the specific byte-to-character decoding implementation It is implemented by Streamdecoder, in the Streamdecoder decoding process must be specified by the user Charset encoding format. It is important to note that if you do not specify Charset, the default character set in the local environment will be used, such as GBK encoding in the Chinese environment.

Write the case is similar to the character of the parent class is Writer, the parent of the byte is OutputStream, by OutputStreamWriter converts the characters to bytes. As shown in the following:

The same Streamencoder class is responsible for encoding characters into bytes, and the encoding format and default encoding rules are consistent with decoding.

As the following section of code, the implementation of the file read and write function:

Examples of encodings involved in listing 1.I/O

tbody>
 String file = "C:/stream.txt";  String charset = "UTF-8";  Write character swap to byte stream fileoutputstream outputstream = new FileOutputStream (file);  OutputStreamWriter writer = new OutputStreamWriter (OutputStream, CharSet);  try {writer.write ("This is the Chinese character to be saved");  } finally {writer.close ();  }//Read bytes converted into characters fileinputstream InputStream = new FileInputStream (file);  InputStreamReader reader = new InputStreamReader (InputStream, CharSet);  StringBuffer buffer = new StringBuffer ();  char[] buf = new char[64];  int count = 0;     try {while ((count = Reader.read (BUF))! =-1) {buffer.append (buffer, 0, count);  }} finally {Reader.close (); } 

In our application is involved in the I/O operation as long as the attention of the specified unified codec Charset character set, generally do not appear garbled problem, some applications if you do not pay attention to the specified character encoding, Chinese environment to take the operating system default encoding, if the codec is in the Chinese environment, usually no problem, However, it is strongly not recommended to use the operating system's default encoding, because in this way, your application's encoding format is tied to the running environment and is likely to be garbled in a cross-environment situation.

Encoding in in-memory operations

In Java development, in addition to the I/O involves coding, the most commonly used is the conversion of the character to byte data type in memory, in Java, string representation of strings, so the string class provides a way to convert to bytes, and also supports the conversion of bytes to a string constructor. The following code example:

string s = "This is a Chinese string";  Byte[] B = s.getbytes ("UTF-8");  

The other is the Bytetocharconverter and Chartobyteconverter classes that have been discarded, respectively, which provide convertall methods for byte[] and char[]. As shown in the following code:

Bytetocharconverter Charconverter = Bytetocharconverter.getconverter ("UTF-8");  Char c[] = Charconverter.convertall (ByteArray);  Chartobyteconverter Byteconverter = Chartobyteconverter.getconverter ("UTF-8");  

These two classes have been replaced by the Charset class, Charset provide encode and decode respectively corresponding to char[] to byte[] encoding and byte[] to char[] decoding. As shown in the following code:

Charset Charset = Charset.forname ("UTF-8");  Bytebuffer Bytebuffer = Charset.encode (string);  

Both encoding and decoding are done in a single class, and it is easier to unify the encoding format, which is more convenient than the Bytetocharconverter and Chartobyteconverter classes, by forname setting the codec character set.

There is also a Bytebuffer class in Java that provides a soft conversion between char and Byte, where the conversion between them does not need to be encoded and decoded, except that a 16bit char format is split into 2 8bit byte representations whose actual values are not modified , only the type of data was converted. The following code is therefore:

Bytebuffer Heapbytebuffer = bytebuffer.allocate (1024);  

These provide the conversion between the characters and bytes as long as we set the codec format unification generally does not appear the problem.

How to encode and decode in Java

Here are a few common coding formats, here will be a practical example of how to implement encoding and decoding in Java, the following we take "I am Junshan" This string as an example of how to use it in Java iso-8859-1, GB2312, GBK, UTF-16, UTF-8 Encoded in the encoding format.

Listing 2.String encoding

  public static void Encode () {         String name = "I am Junshan";         Tohex (Name.tochararray ());         try {             byte[] iso8859 = name.getbytes ("iso-8859-1");             Tohex (iso8859);             byte[] gb2312 = name.getbytes ("GB2312");             Tohex (gb2312);             byte[] GBK = name.getbytes ("GBK");             Tohex (GBK);             byte[] Utf16 = name.getbytes ("UTF-16");             Tohex (UTF16);             byte[] UTF8 = name.getbytes ("UTF-8");             Tohex (UTF8);         } catch (Unsupportedencodingexception e) {             e.printstacktrace ();         }  

We encode the name string into a byte array according to the previous encoding format and then output it in 16, so let's look at how Java is encoded.

Here is the class diagram that is required for coding in Java

Figure 1. Java coded class Diagram

The Charset class is set by Charset.forname (CharsetName) based on the specified CharsetName, then Charset objects are created according to Charsetencoder, and then Charsetencoder.enc Ode encodes a string, each encoding type corresponds to a class, and the actual encoding process is done in these classes. The following is a String. Sequence diagram of the getBytes (charsetname) encoding process

Figure 2.Java Encoding timing Diagram

As can be seen according to CharsetName find Charset class, and then based on this character set encoding generation Charsetencoder, this class is all character encoding of the parent class, for different character encoding set in its subclasses defined how to implement the encoding, with After charsetencoder the object, you can call the Encode method to implement the encoding. This is the String.getbytes encoding method, and the others are similar in the same way as in Streamencoder. Let's see how the different character sets encode the preceding string into a byte array.

As the string "I am Junshan" of the char array is 6d 541b 5c71, below it according to different encoding format into the corresponding byte.

Code by Iso-8859-1

The string "I am Junshan" is encoded with iso-8859-1 and the following is the result of the encoding:

From the 7 char characters are iso-8859-1 encoded into 7 byte arrays, Iso-8859-1 is a single byte encoding, Chinese "Junshan" is converted to a value of 3f byte. 3f is also "? "character, so often will appear in Chinese into"? "It's probably the wrong use of the iso-8859-1 code." Chinese characters are iso-8859-1 encoded to lose information, often referred to as "black holes", which absorb characters they don't know. Since most of the basic Java framework or system is now the default character set encoding is iso-8859-1, so it is easy to garbled problem, will later analyze the different garbled form is how to appear.

Code by GB2312

The string "I am Junshan" is encoded with GB2312 and the following is the result of the encoding:

GB2312 corresponds to the Charset is Sun.nio.cs.ext. EUC_CN and the corresponding Charsetdecoder encoding class is sun.nio.cs.ext. The doublebyte,gb2312 character set has a char-to-byte code table, and the different character encodings are to look up this code table to find the corresponding bytes for each character, and then assemble it into a byte array. The rules of tabular are as follows:

If the code value found is greater than oxff, then it is double-byte, otherwise it is single-byte. The double-byte high 8 bits as the first byte, the lower 8 bits as the second byte, as shown in the following code:

if (bb > 0xFF) {    //Doublebyte             if (DL-DP < 2)                 return coderresult.overflow;             da[dp++] = (byte) (BB >> 8);             da[dp++] = (byte) bb;  } else {                      //Singlebyte             if (DL-DP < 1)                 return coderresult.overflow;             da[dp++] = (byte) bb;  

It can be seen that the first 5 characters are still 5 bytes after encoding, while the Chinese characters are encoded in double byte, in the first section, the GB2312 only supports 6,763 Chinese characters, so not all Chinese characters can be encoded with GB2312.

Code by GBK

The string "I am Junshan" is encoded with GBK and the following is the result of the encoding:

You may have found that the result is the same as the GB2312 encoding, yes GBK and GB2312 coding results are the same, so that GBK encoding is compatible with GB2312 encoding, their encoding algorithm is the same. The difference is that their code table length is not the same, GBK contains more Chinese characters. So as long as the GB2312 encoded Chinese characters can be decoded with GBK, in turn, otherwise.

Code by UTF-16

The string "I am Junshan" is encoded with UTF-16 and the following is the result of the encoding:

The char array is amplified by a UTF-16 encoding by one-fold, a single-byte range of characters, the high-order 0 becomes two bytes, and the Chinese character becomes two bytes. Judging from the UTF-16 encoding rules, only the high and low characters are split into two bytes. The coding efficiency is very high, the rule is very simple, because different processors on the 2-byte processing mode, Big-endian (high byte in front, low byte in the back) or Little-endian (low byte in front, high byte in the back) encoding, So in order to encode a string of strings you need to indicate whether it is Big-endian or Little-endian, so there are two bytes in front to hold the Byte_order_mark value, UTF-16 is a fixed length 16 bits (2 bytes) to represent the UCS-2 or Unico The de conversion format is encoded by proxy for characters other than the access BMP.

Code by UTF-8

The string "I am Junshan" is encoded with UTF-8 and the following is the result of the encoding:

UTF-16 Although the coding efficiency is very high, but to the single byte range of characters also magnified one times, this intangible also wasted storage space, in addition UTF-16 in sequential encoding, can not be used to verify the encoding value of a single character, if the middle of a character code value is damaged, the subsequent code values will be affected. and UTF-8 None of these problems, UTF-8 the characters in a single byte range are still represented by a byte, and the characters are represented by three bytes. It has the following coding rules:

Listing 3.UTF-8 Encoding Code snippet

  Private Coderresult Encodearrayloop (charbuffer src, bytebuffer DST) {char[] sa = Src.array ();             int sp = Src.arrayoffset () + src.position ();             int sl = Src.arrayoffset () + src.limit ();             byte[] da = Dst.array ();             int dp = Dst.arrayoffset () + dst.position ();             int dl = Dst.arrayoffset () + dst.limit ();             int DLASCII = DP + math.min (SL-SP, DL-DP); ASCII only Loop while (DP < DLASCII && SA[SP] < ' \u0080 ') da[dp++] = (byte) s             A[sp++];                 while (SP < SL) {char c = sa[sp];                         if (c < 0x80) {//has at most seven bits if (DP >= DL)                     return overflow (SRC, SP, DST, DP);                 da[dp++] = (byte) c;                        } else if (C < 0x800) {//2 bytes, bits if (DL-DP < 2) return overflow (SRC, SP, DST, DP); da[dp++] = (byte) (0xc0 |                     (c >> 6)); da[dp++] = (byte) (0x80 |                 (C & 0x3f));                         } else if (Character.issurrogate (c)) {//have a surrogate pair if (SGP = = null)                     SGP = new Surrogate.parser ();                     int UC = Sgp.parse (c, SA, SP, SL);                         if (UC < 0) {updatepositions (src, SP, DST, DP);                     return Sgp.error ();                     } if (DL-DP < 4) return overflow (SRC, SP, DST, DP); da[dp++] = (byte) (0xf0 |                     (UC >> 18)); da[dp++] = (byte) (0x80 |                     (UC >> N) & 0x3f); da[dp++] = (byte) (0x80 |                     (UC >> 6) & 0x3f)); da[dp++] = (byte) (0x80 |                     (UC & 0x3f));  sp++;             2 chars    } else {//3 bytes, bits if (DL-DP < 3) retur                     n Overflow (src, SP, DST, DP); da[dp++] = (byte) (0xe0 |                     ((c >> 12)); da[dp++] = (byte) (0x80 |                     ((c >> 6) & 0x3f)); da[dp++] = (byte) (0x80 |                 (C & 0x3f));             } sp++;             } updatepositions (src, SP, DST, DP);  return coderresult.underflow;  }

UTF-8 coding is different from GBK and GB2312, so it is more efficient to UTF-8 in coding efficiency, so UTF-8 encoding is ideal when storing Chinese characters.

Comparison of several encoding formats

The following four encoding formats for Chinese characters can be processed, GB2312 is similar to GBK encoding rules, but the GBK range is larger, it can handle all Chinese characters, so GB2312 and GBK should choose GBK. Both UTF-16 and UTF-8 are processing Unicode encodings, which are not the same encoding rules, and are relatively efficient at UTF-16 encoding, with a simpler character-to-byte conversion and better string manipulation. It is suitable for use between local disks and memory, and can be quickly switched between characters and bytes, such as Java's memory encoding, which is UTF-16 encoded. However, it is not suitable for transmission between the network, because the network transmission can easily damage the byte stream, once the byte stream damage will be difficult to recover, in order to compare the UTF-8 more suitable for network transmission, the ASCII character with single-byte storage, the other single character damage will not affect the other characters, encoding efficiency between GBK and UTF-16, so UTF-8 in the coding efficiency and coding security balance, is the ideal Chinese encoding method.

Scenarios in Java that need to be coded

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.