Java processing character set-Part 2-file Character Set

Source: Internet
Author: User

The previous article mentioned the generation of garbled code: http://blog.csdn.net/xieyuooo/article/details/6919007

The main reason is that the encoding and decoding methods are different. Sometimes, if we know the encoding method, decoding will naturally be very helpful, for example, the output contenttype will tell the browser what encoding format the output content is, otherwise the browser will use the current default character set encoding for processing; this article describes how Java handles character sets that do not have a normal protocol header.

This is about the file character set. Before learning about the character set, go back to the previous article and talk about the default Character Set, custom character set, and system character set. What character set is used in the current environment?

System.out.println(Charset.defaultCharset());

List of all character set encodings currently supported by Java applications:

Set<String> charsetNames = Charset.availableCharsets().keySet();for(String charsetName : charsetNames) {System.out.println(charsetName);}

Because the Java Stream does not show how to know the character set of the file by default, it is amazing that some editors, similar to notepad, editplus, and ultraedit of window, they can recognize strings of various character sets. How can they be achieved? If the file to be uploaded needs to be parsed, what should I do now?


First, there are two types of text files: BOM and Bom. the GBK character set does not contain Bom,UTF-8, UTF-16LE, 16.016be, UTF-32And so on. The so-called BOM refers to a file 【The header contains several bytes.Is used to indicate the character set of the file, for example:

UTF-8The header contains three Bytes: 0xef, 0xbb, and 0xbf.

UTF-16BEThe header has two bytes: 0xfe and 0xff.

UTF-16LEThe header has two bytes: 0xff and 0xfe.

UTF-32BEThe header contains four bytes: 0x00, 0x00, 0xfe, and 0xff.

Seemingly common character set we can answer this again, because the commonly used for our program is mostly UTF-8 or GBK, the rest of the character set is relatively compatible (such as gb2312, gb18030 is a special character ).

Let's first consider the case where the file has a header, because in this case, we don't need to read the entire file, we can get the character set of the file for convenience, and we continue to write the code:

Through the above description, it is not difficult to write a class for processing. It is processed through inputstream, And we write a class ourselves:

Import Java. io. ioexception; import Java. io. inputstream; import Java. io. pushbackinputstream; public class unicodeinputstream extends inputstream {pushbackinputstream internalin; Boolean isinited = false; string defaultenc; string encoding; private byte [] bytes; Private Static final int bom_size = 4; public unicodeinputstream (inputstream in) {internalin = new pushbackinputstream (in, bom_size); this. defaultenc = "GBK"; // if the default character set is GBK try {Init ();} catch (ioexception ex) {illegalstateexception ise = new illegalstateexception ("init method failed. "); Ise. initcause (ISE); throw Ise;} public unicodeinputstream (inputstream in, string defaultenc) {internalin = new pushbackinputstream (in, bom_size); this. defaultenc = defaultenc;} Public String getdefaultencoding () {return defaultenc;} Public String getencoding () {return encoding;}/*** read-ahead four bytes and check for BOM marks. extra bytes are unread * back to the stream, only BOM bytes are skipped. */protected void Init () throws ioexception {If (isinited) return; byte BOM [] = new byte [bom_size]; int N, unread; n = internalin. read (Bom, 0, bom. length); inputstreambombytes = BOM; If (BOM [0] = (byte) 0x00) & (BOM [1] = (byte) 0x00) & (BOM [2] = (byte) 0xfe) & (BOM [3] = (byte) 0xff) {encoding = "UTF-32BE "; unread = N-4;} else if (BOM [0] = (byte) 0xff) & (BOM [1] = (byte) 0xfe) & (BOM [2] = (byte) 0x00) & (BOM [3] = (byte) 0x00) {encoding = "UTF-32LE "; unread = N-4;} else if (BOM [0] = (byte) 0xef) & (BOM [1] = (byte) 0xbb) & (BOM [2] = (byte) 0xbf) {encoding = "UTF-8"; unread = N-3 ;} else if (BOM [0] = (byte) 0xfe) & (BOM [1] = (byte) 0xff) {encoding = "UTF-16BE "; unread = N-2;} else if (BOM [0] = (byte) 0xff) & (BOM [1] = (byte) 0xfe )) {encoding = "UTF-16LE"; unread = N-2;} else {// character set not captured // encoding = defaultenc; // The default Character Set unread = N is not used here; // inputstreambombytes = new byte [0];} // system. out. println ("read =" + N + ", unread =" + unread); If (unread> 0) internalin. unread (Bom, (n-unread), unread); isinited = true;} public byte [] getinputstreambombytes () {return inputstreambombytes;} public void close () throws ioexception {isinited = true; internalin. close ();} public int read () throws ioexception {isinited = true; return internalin. read ();}}


Okay. Let's see if it is OK. Let's test a file, use NotePad to open a file, write some Chinese characters, and save the file as several character sets, as shown in:


Custom |. SQL and so on, all are manually defined;

Well, without a header, we will talk about it later. If there is a header, we will use the following code to see if it is correct (it is OK to save it with the notepad and UE tool that comes with windows, editplus does not contain headers. for test purposes, you can use the first two tools to save the settings ):

Here we will write a component class to facilitate calls elsewhere. If we define a component class called fileutils, we will define a method: getfilestringbyinputstream to pass in the input stream, and whether to disable the input stream (because sometimes you want to disable it temporarily, but it is disabled by an external framework), and then define an overload method. The second parameter is not passed, the first method to call is to pass in true (that is, it is considered to be disabled by default ).

The Code is as follows (whereClosestreamIs a self-written closeCloseableImplementation class method, which is not mentioned here ):

Public static string getfilestringbyinputstream2 (inputstream, Boolean iscloseinputstream) throws ioexception {If (inputstream. available () <2) Return ""; try {unicodeinputstream in = newunicodeinputstream (inputstream); string encoding = in. getencoding (); int available = inputstream. available (); byte [] bombytes = in. getinputstreambombytes (); int bomlength = bombytes. length; byte [] Last = New byte [available + bomlength]; system. arraycopy (bombytes, 0, last, 0, bomlength); // copy the header to inputstream. read (last, bombytes. length, available); // read string result = new string (last, encoding) from the header; If (encoding! = NULL & encoding. startswith ("GB") {return result;} else {return result. substring (1) ;}} finally {If (iscloseinputstream) closestream (inputstream );}}


At this time, I found a few files and it was OK. No matter what character set I changed, I was so happy that another person gave me an editplus file. Then I found that there was no header,The default ouputstream output file in Java does not have a header, unless you write it in, or if you randomly write the header into another character set, through the above aspects, it is a direct tragedy.;


However, if Bom is not included, this method cannot be determined because there is no header. You can say that,At present, there is no editor that can be used without garbled characters under any circumstances (we will prove it later), Similar to editplus, which saves files without a header, why do notepad, ue, and editplus all know it (Note: this refers to the vast majority of cases, not all cases.);

First of all, if there is no header, there is no way to judge the character set. There is only one way to read the file's character stream and match it according to the encoding of the character stream and various character sets, to complete character set matching, it seems to be OK, but there is a conflict between character sets. If there is a conflict, then this is done.

Make an experiment:

Write a notepad or editplus to open the file. At the beginning of the file, enter the following two words:China Unicom", And save it as GBK format. Note that in windows, asni is in GBK format, or some default values are. At this time, you can use any editor to open it with garbled characters, as shown below:


Open the file again and use Notepad:


Open with editplus:


Open with ue:


It's a tragedy. Here is just an example. It's not just this character, but some other characters are also possible. It's just a result. If you write more Chinese characters (not new ones ), at this time, it will be recognized because there are not many conflicts between the vast majority of Chinese characters. For example, China Unicom now expresses OK, which is no problem.


Back to our problem, how to deal with Java, since there is nothing that can fully parse the character set, how much java can handle? Can we parse the encoding like notepad, yes, there is a framework is based on: Mozilla's something called: chardet, download this package can go to the http://sourceforge.net/projects/jchardet/files/ to download, download the corresponding jar package and the source code, A large number of character sets are processed internally.


So how to use it? He needs to scan the entire file (note that we didn't consider files larger than 2 GB ).

In a simple example, there is a file named htmlcharsetdetector in his package. java testing class, which can be run using the main method. I have probably tested this. The character set parsing of most text files is OK, and it has been slightly adjusted; I will not post the code for it. here we will describe the combination of the two methods based on this class and the original header-based determination;

First, write a processing method based on the third package:

/*** Use chardet to parse text content * @ Param inputstream input stream * @ Param bombytes header byte, because after obtaining it, you need to add the data back because the header is determined first, therefore, the four bytes in the header are passed in and need to be determined, and the inputstream pointer has already indicated the length of the header * @ Param bomlength, even if it is defined as 4 bits, it may be because the program runs, not necessarily four characters long. bombytes is not used here. length is obtained directly, but is passed in from the outside, mainly for external purposes * @ Param last added data * @ return returns the parsed string * @ throws ioexception when the input and output exceptions occur, throw, for example, the file is not found. */Private Static string processencodingbychardet (inputstream, B Yte [] bombytes, int bomlength, byte [] Last) throws ioexception {byte [] Buf = new byte [1024]; nsdetector det = new nsdetector (nspsmdetector. all); final string [] findcharset = new string [1]; // This is a bit clever. When you find the character set, write it to the external variable, it can also be Det. init (New nsicharsetdetectionobserver () {public void Policy (string charset) {If (charset_convert.containskey (charset) {findcharset [0] = charset_convert.get (charset ); }}}); Int Len, alllength = bomlength; system. arraycopy (bombytes, 0, last, 0, bomlength); Boolean isascii = det. isascii (bombytes, bomlength); Boolean done = det. doit (bombytes, bomlength, false); bufferedinputstream buff = new bufferedinputstream (inputstream); While (LEN = buff. read (BUF, 0, Buf. length)> 0) {system. arraycopy (BUF, 0, last, alllength, Len); alllength + = Len; If (isascii) {Isascii = det. isascii (BUF, Len);} If (! Isascii &&! Done) {done = det. doit (BUF, Len, false) ;}} Det. done (); If (isascii) {// The default character set return new string (last, charset. defaultcharset ();} If (findcharset [0]! = NULL) {return new string (last, findcharset [0]);} string encoding = NULL; For (string charset: Det. getprobablecharsets () {// list of possible character sets under the record, get available, jump out of encoding = charset_convert.get (charset); If (encoding! = NULL) {break;} If (encoding = NULL) encoding = charset. defaultcharset (); // set to the default value return new string (last, encoding );}


Charset_convertThe definition is as follows, that is, the returned character set is only a character set that can be parsed. The rest of the character sets are not considered, because sometimes chardet is not easy to use:

private final static Map<String , String> CHARSET_CONVERT = new HashMap<String , String>() {        {            put("GB2312" , "GBK");            put("GBK" , "GBK");            put("GB18030" , "GB18030");            put("UTF-16LE" , "UTF-16LE");            put("UTF-16BE" , "UTF-16BE");            put("UTF-8" , "UTF-8");            put("UTF-32BE" , "UTF-32BE");            put("UTF-32LE" , "UTF-32LE");        }    };

After writing this method, we will merge the original method with this method:

/*** Get the file content, including character set filtering * @ Param inputstream input stream * @ Param iscloseinputstream whether to disable the input stream * @ throws ioexception Io exception * @ return string the string in the file, result obtained */public static string getfilestringbyinputstream (inputstream, Boolean iscloseinputstream) throws ioexception {If (inputstream. available () <2) Return ""; unicodeinputstream in = new unicodeinputstream (inputstream); try {string encoding = in. getencoding (); // first obtain the character set int available = inputstream. available (); // you can view the number of inputstream files that can be read at one time (no more than 2 GB files can be considered as the number of remaining files) byte [] bombytes = in. getinputstreambombytes (); // retrieves the bytecode int bomlength = bombytes that has read the header. length; // extract the length of the header byte [] Last = new byte [available + bomlength]; // define the total length if (encoding = NULL) {// if the character set is not obtained, chardet is called to process return processencodingbychardet (inputstream, bombytes, bomlength, last);} else {// if the character set is obtained, system. arraycopy (bombytes, 0, last, 0, bomlength); // copy the header to inputstream. read (last, bombytes. length, available); // read string result = new string (last, encoding) from the header; If (encoding. startswith ("GB") {return result;} else {return result. substring (1) ;}}finally {If (iscloseinputstream) closestream (in );}}


The method is reloaded externally to determine whether to disable the input stream;

In this way, the vast majority of files can be parsed through testing;

Note that there is a substring (1) operation on the top, because if the file with the BOM header contains the first character (may contain 2-4 bytes), but the first character is converted to one after the character, remove the GBK with no header.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.