Use java. nio. charset. CharsetDecoder to automatically recognize character sets, charsetdecoder

Source: Internet
Author: User

Use java. nio. charset. CharsetDecoder to automatically recognize character sets, charsetdecoder

The methods for automatically recognizing character sets that can be found on the Internet are studied. The effective method is to use the third-party class library jchardet. Cpdetector is also used, but jchardet is also used. It was accidentally discovered that jdk java. nio. charset. CharsetDecoder can be used to identify character sets.

I. Principles

Generally, two methods are used to construct InputStreamReader:

InputStreamReader reader = new InputStreamReader(in, charsetName);

Or

InputStreamReader reader = new InputStreamReader(in, charset);

If charset does not match, garbled characters are output.

There is also a construction method, that is, using CharsetDecoder:

CharsetDecoder cd = charset.newDecoder();InputStreamReader reader = new InputStreamReader(in, cd);

If this parameter does not match, an exception is thrown:

java.nio.charset.MalformedInputException: Input length = 1    at java.nio.charset.CoderResult.throwException(CoderResult.java:277)    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)        ....

In this way, it can be used for character set detection.

Ii. Use of AutoCharsetReader

AutoCharsetReader is a class written by referring to InputStreamReader based on the above principles. It inherits from Reader and can be viewed as the Charset adaptive InputStreamReader.

AutoCharsetReader ar= new AutoCharsetReader(in);char c = ar.read();...char[] cbuf = new char[2000];ar.read(cbuf);...BufferedReader br = new BufferedReader(ar);br.readLine();...

For example, if the TextField used by Lucene to create a full-text index requires the Reader parameter, you can directly use this class:

Field field = new TextField("content", new AutoCharsetReader(file));

After reading the file, you can get the charset of the file. Note: It is after reading.

Charset charset = ar.charset();

Iii. Alternative Character Set

Because multiple attempts are used to determine the character set, it is necessary to provide alternatives. The default alternative character set provided by the current Code is as follows:

    private final static String[] _defaultCharsets = {                    "US-ASCII",            "UTF-8",            "GB2312",             "BIG5",            "GBK",            "GB18030",                            "UTF-16BE",             "UTF-16LE",             "UTF-16",            "UNICODE"};

You can also change the alternative character set. For example:

AutoCharsetReader ar = new AutoCharsetReader(in).setCharset("ascii", "utf-8", "gbk");

The order affects the test results. For example, if GBK is earlier than GB2312, the detection result can only be GBK, not GB2312, because GBK contains GB2312.

4. Perform Character Set detection only

It can be used only for character set Detection:

charset = AutoCharsetReader.quickDetect(file.toURI().toURL(), charsets);or:charset = AutoCharsetReader.deepDetect(file.toURI().toURL(), charsets, stops);

QuickDetect is read-only and applicable to single character set files. For html, you may need to read it all before you know charset. deepDetect is used. The charsets parameter can be null.

For a group of files, possible character sets are known to include "ascii", "UTF-8", "gb2312", and "gbk ", when detecting that the character set of a file is "UTF-8" or "gbk", you can immediately return the result without continuing to read the file. In this case, the stops parameter can be assigned to {"UTF-8", "gbk "}. If it is null, all data must be read.

V. Others

To improve efficiency, this class has a buffer. The primary character set fails to be decoded without re-reading io. The default buffer size is 8192. The buffer size can be customized during object construction. If the parameter is smaller than 16, it is set to 16.

Vi. Source Code

Http://download.csdn.net/detail/u012994553/9777709

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.