Reasons for the Java IO Stream basics are garbled

Last Update:2018-07-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The generation of code: our basic unit of storage is a byte byte, but human language is too much to be represented by a basic unit, but in order to split or some corresponding translation work, resulting in the generation of coding

2. Common encoding formats

ASCLL: With a byte of the low 7-bit total of 128 to represent, 0-31 for the control character, 32-126 for the print character, the ASCLL binary first is the highest bit, the purpose is to do parity, so-called parity check, Refers to a method used in the process of code transfer to verify whether an error occurred, generally divided into two parity check and parity. Odd check rules: The correct code in one byte of the number of 1 must be odd, if not odd, the highest bit is added 1; Parity rule: The correct code in one byte of the number of 1 must be even, if not even, the highest bit add 1; see this is seems to parity or relatively vague, not clear, For example, the ASCII code of the letter A is 1000001, if the result of using parity is: 01000001, the result of the odd test is 11000001; parity is set up beforehand, when the CPU reads the stored data, it adds the data stored in the first 8 bits again. Calculates whether the result is consistent with the check phase. To a certain extent can detect memory errors, parity can only detect errors and can not be corrected, while the probability of double-bit error is quite low, but parity can not detect double-bit fault, when the error is detected, the receiving data will require the sender to re-transmit the data

Example:

Serial data in the transmission process, due to interference, may make the bit into 1, (why does not change 0? pulse) This situation, called "error" occurred. Errors in the transmission, called "Error Checking". Removing errors after they are found is called "Error correction". The simplest method of error checking is "parity check", generally used in synchronous transmission mode of odd check, and in asynchronous transmission mode is often used even check

Iso-8859-1: Because 128 characters are not enough, ISO organizes a set of standards to expand ASCLL Coding (ISO-8859-1~ISO-8859-15) based on ASCLL, and Iso-8859-1 is also a Java Default parsing methods in the Web

GB2312: The full name is "Chinese character coding Fu Chi Basic set of information interchange", accounting for two bytes, of which a1~a9 is the symbol area, B0~f7 is the Chinese character area

GBK: The full name of "Chinese character Code expansion Code", is the expansion of GB2312, its code is compatible with GB2312, that is, GB2312 encoded Chinese characters can be encoded with GBK without garbled

GB18030: China's mandatory standard, may be single-byte double byte four bytes, and GB2312 encoding is compatible with our standard, but not widely used

UTF-16: Defines Unicode (all languages of the world can be translated using this dictionary, which is the basis of Java and XML) in the computer access method, with two bytes to represent the conversion format of Unicode, is relatively simple and convenient on the presentation, But the memory space is much larger than the UTF-8.

UTF-8: There are different loadline lengths for each coded area, and different types of characters can have 1-6 bytes

Coding Principles for UTF-8:

If it is a byte, the highest bit is 0,

If a byte starts with 11, it indicates the corresponding first byte

If starting with 10, it means that he is not a first byte and needs to be searched forward to get the first character of the current character

For example:

1 byte 0xxxxxxx

2 bytes 110xxxxx 10xxxxxx

3 bytes 1110xxxx 10xxxxxx 10xxxxxx

4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

5 bytes 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

6 bytes 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Description: UTF-8 is used to denote a maximum of 31 bits of the actual number of characters encoded, which is the bit represented by X in the previous table. By removing those control bits (10 per byte), the x represents a bit that corresponds to the Unicode encoding of one by one and the same level and height sequence. Converting Unicode to UTF-8 encoding should first remove the high 0, and then determine the minimum number of UTF-8 encoded bits (ASCLL standard encoding 7 bits) based on the number of digits remaining encoded.

3.io Stream encoding and decoding

1) Read the file

If the ASCII write is not garbled (this is also Unicode (\u5f20\u4e09 in the form of the conversion to arbitrary encoding (in the case can not be garbled)), if the use of other coding (if it is utf-8) file, Java in the absence of set decoding, Java will default to (GBK) to decode into Unicode code (Java JVM will be GBK to parse into utf-16 storage), and our source files are written using utf-8, so cause garbled, So when using, we will specify the encoding set to decode the guaranteed Unicode (Char) Save mode, so that the correct codec, local files can use the Encodingdetect jar package to obtain, which is the use of statistical conclusions, not hundred correct , but the accuracy rate is more than 99, the web crawler can use Cpdetector_1.0.10.jar, also is based on statistics

Instance

How to use shorthand encodingdetect
" F:/test/file1/file2.txt " ;         // get the encoding format        String  encode = encodingdetect.getjavaencode (pathname
System.  out . println (encode);
        Encodingdetect.readfile (pathname, encode);

The method of obtaining coding for Cpdetector network crawler
  Public Staticstring Getfileencode (string path) {/** Detector is a detector that gives the detection task to a specific instance of the probe implementation class. * Cpdetector contains a number of commonly used probe implementation classes, which can be added through the Add method, such as Parsingdetector, * jchardetfacade, Asciidetector, Unicodedetector         。 * Detector returns the detected * Character set encoding in accordance with the "who first returns non-null probe results, whichever is the result".         Use three third-party jar packages: Antlr.jar, Chardet.jar, and Cpdetector.jar * Cpdetector are based on statistical principles and are not guaranteed to be completely correct. */Codepagedetectorproxy Detector=codepagedetectorproxy.getinstance (); /** Parsingdetector can be used to check the encoding of HTML, XML and other files or character streams, and the parameters in the construction method are used to indicate whether the details of the probing process are displayed, and false is not displayed. */        //Detector.add (new Parsingdetector (false));        /** The Jchardetfacade encapsulates the jchardet provided by the Mozilla organization, which can be used to encode and measure most files.         Therefore, generally with this detector can meet the requirements of most projects, if you are not at ease, you can * add a few more detectors, such as the following asciidetector, Unicodedetector and so on. */Detector.add (Jchardetfacade.getinstance ());//used to Antlr.jar, Chardet.jar//Asciidetector for ASCII code determination//Detector.add (Asciidetector.getinstance ()); //Unicodedetector for the determination of Unicode family codes//Detector.add (Unicodedetector.getinstance ());Java.nio.charset.Charset charset =NULL; File F=NewFile (path); Try{CharSet=Detector.detectcodepage (F.touri (). Tourl ()); } Catch(Exception ex) {ex.printstacktrace (); }        if(CharSet! =NULL)            returnCharset.name (); Else            return NULL; };

4.url encoding and decoding principles written in the Web core

Reasons for the Java IO Stream basics are garbled

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Reasons for the Java IO Stream basics are garbled

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Reasons for the Java IO Stream basics are garbled

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support