Author: gnuhpc
Source: http://www.cnblogs.com/gnuhpc/
1. ASCII code
In the 1960s s, the United States developed a set of character codes to define the relationship between English characters and binary characters. This is called ASCII code, which has been used till now.
The ASCII code consists of a total of 128 characters. For example, the space is 32 (Binary 00100000), and the uppercase letter A is 65 (Binary 01000001 ). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of one byte, and the first one digit is set to 0.
2. Unicode
If there is an encoding, all the symbols in the world will be included. Every symbol is given a unique encoding, so the garbled problem will disappear. This is Unicode, as its names all represent. This is the encoding of all symbols.
Unicode is, of course, a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. For example, U + 0639 represents the Arabic letter ain, U + 0041 represents the English capital letter A, and U + 4e25 represents the Chinese character "strict ". You can query a specific symbol table at unicode.org or a special Chinese character table.
3. UTF-8
With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used Unicode implementation method on the Internet. Other implementations also include UTF-16 and UTF-32, but are basically not needed on the Internet. Repeat, the relationship here is that UTF-8 is one of the Unicode implementation methods.
The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.
UTF-8 coding rules are very simple, only two:
1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.
2) for the n-byte symbol (n> 1), the first N bits of the first byte are set to 1, and the N + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all Unicode codes of this symbol.
4. Application:
In Java, if we use Java. Io. filereader or Java. Io. filewriter to read and write files, we will find that in filereader and filewriter we can only get the encoding method, but cannot set it. In this way, the encoding settings in filereader and filewriter can only be subject to some lower-layer settings, so it is easy to see garbled characters when reading and writing files encoded in multiple languages. The solution is to use Java. Io. fileinputstream/Java. Io. inputstreamreader and Java. Io. fileoutputstream/Java. Io. outputstreamwriter. In inputstreamreader and outputstreamwriter, you can read and write UTF-8 files by specifying the encoding method. Of course, we can improve the efficiency through java. Io. bufferedreader and Java. Io. bufferedwriter.
For example:
Java. Io. bufferedwriter writer = NULL;
Java. Io. fileoutputstream writerstream = new java. Io. fileoutputstream (filename );
Writer = new java. Io. bufferedwriter (New java. Io. outputstreamwriter (writerstream, "UTF-8 "));
// Do something
// Writing File
Writer. Close ();
You can also use the following methods,
For example:Use Java to convert the file encoding from GBK to utf8
Private Static void transferfile (string srcfilename, string destfilename) throws ioexception {
String line_separator = system. getproperty ("line. separator ");
Fileinputstream FCM = new fileinputstream (srcfilename );
Stringbuffer content = new stringbuffer ();
Datainputstream in = new datainputstream (FCM );
Bufferedreader d = new bufferedreader (New inputstreamreader (in, "GBK "));
String line = NULL;
While (line = D. Readline ())! = NULL)
Content. append (LINE + line_separator );
D. Close ();
In. Close ();
FCM. Close ();
Writer ow = new outputstreamwriter (New fileoutputstream (destfilename), "UTF-8 ");
Ow. Write (content. tostring ());
Ow. Close ();
}
Author: gnuhpc
Source: http://www.cnblogs.com/gnuhpc/