Use java. nio. charset. CharsetDecoder to automatically recognize character sets, charsetdecoder
The methods for automatically recognizing character sets that can be found on the Internet are studied. The effective method is to use the third-party class library jchardet. Cpdetector is also used, but jchardet is also used. It was accidentally discovered that jdk java. nio. charset. CharsetDecoder can be used to identify character sets.
I. Principles
Generally, two methods are used to construct InputStreamReader:
InputStreamReader reader = new InputStreamReader(in, charsetName);
Or
InputStreamReader reader = new InputStreamReader(in, charset);
If charset does not match, garbled characters are output.
There is also a construction method, that is, using CharsetDecoder:
CharsetDecoder cd = charset.newDecoder();InputStreamReader reader = new InputStreamReader(in, cd);
If this parameter does not match, an exception is thrown:
java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:277) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) ....
In this way, it can be used for character set detection.
Ii. Use of AutoCharsetReader
AutoCharsetReader is a class written by referring to InputStreamReader based on the above principles. It inherits from Reader and can be viewed as the Charset adaptive InputStreamReader.
AutoCharsetReader ar= new AutoCharsetReader(in);char c = ar.read();...char[] cbuf = new char[2000];ar.read(cbuf);...BufferedReader br = new BufferedReader(ar);br.readLine();...
For example, if the TextField used by Lucene to create a full-text index requires the Reader parameter, you can directly use this class:
Field field = new TextField("content", new AutoCharsetReader(file));
After reading the file, you can get the charset of the file. Note: It is after reading.
Charset charset = ar.charset();
Iii. Alternative Character Set
Because multiple attempts are used to determine the character set, it is necessary to provide alternatives. The default alternative character set provided by the current Code is as follows:
private final static String[] _defaultCharsets = { "US-ASCII", "UTF-8", "GB2312", "BIG5", "GBK", "GB18030", "UTF-16BE", "UTF-16LE", "UTF-16", "UNICODE"};
You can also change the alternative character set. For example:
AutoCharsetReader ar = new AutoCharsetReader(in).setCharset("ascii", "utf-8", "gbk");
The order affects the test results. For example, if GBK is earlier than GB2312, the detection result can only be GBK, not GB2312, because GBK contains GB2312.
4. Perform Character Set detection only
It can be used only for character set Detection:
charset = AutoCharsetReader.quickDetect(file.toURI().toURL(), charsets);or:charset = AutoCharsetReader.deepDetect(file.toURI().toURL(), charsets, stops);
QuickDetect is read-only and applicable to single character set files. For html, you may need to read it all before you know charset. deepDetect is used. The charsets parameter can be null.
For a group of files, possible character sets are known to include "ascii", "UTF-8", "gb2312", and "gbk ", when detecting that the character set of a file is "UTF-8" or "gbk", you can immediately return the result without continuing to read the file. In this case, the stops parameter can be assigned to {"UTF-8", "gbk "}. If it is null, all data must be read.
V. Others
To improve efficiency, this class has a buffer. The primary character set fails to be decoded without re-reading io. The default buffer size is 8192. The buffer size can be customized during object construction. If the parameter is smaller than 16, it is set to 16.
Vi. Source Code
Http://download.csdn.net/detail/u012994553/9777709