1. The generation of code: our basic unit of storage is a byte byte, but human language is too much to be represented by a basic unit, but in order to split or some corresponding translation work, resulting in the generation of coding
2. Common encoding formats
ASCLL: With a byte of the low 7-bit total of 128 to represent, 0-31 for the control character, 32-126 for the print character, the ASCLL binary first is the highest bit, the purpose is to do parity, so-called parity check, Refers to a method used in the process of code transfer to verify whether an error occurred, generally divided into two parity check and parity. Odd check rules: The correct code in one byte of the number of 1 must be odd, if not odd, the highest bit is added 1; Parity rule: The correct code in one byte of the number of 1 must be even, if not even, the highest bit add 1; see this is seems to parity or relatively vague, not clear, For example, the ASCII code of the letter A is 1000001, if the result of using parity is: 01000001, the result of the odd test is 11000001; parity is set up beforehand, when the CPU reads the stored data, it adds the data stored in the first 8 bits again. Calculates whether the result is consistent with the check phase. To a certain extent can detect memory errors, parity can only detect errors and can not be corrected, while the probability of double-bit error is quite low, but parity can not detect double-bit fault, when the error is detected, the receiving data will require the sender to re-transmit the data
Example:
Serial data in the transmission process, due to interference, may make the bit into 1, (why does not change 0? pulse) This situation, called "error" occurred. Errors in the transmission, called "Error Checking". Removing errors after they are found is called "Error correction". The simplest method of error checking is "parity check", generally used in synchronous transmission mode of odd check, and in asynchronous transmission mode is often used even check
Iso-8859-1: Because 128 characters are not enough, ISO organizes a set of standards to expand ASCLL Coding (ISO-8859-1~ISO-8859-15) based on ASCLL, and Iso-8859-1 is also a Java Default parsing methods in the Web
GB2312: The full name is "Chinese character coding Fu Chi Basic set of information interchange", accounting for two bytes, of which a1~a9 is the symbol area, B0~f7 is the Chinese character area
GBK: The full name of "Chinese character Code expansion Code", is the expansion of GB2312, its code is compatible with GB2312, that is, GB2312 encoded Chinese characters can be encoded with GBK without garbled
GB18030: China's mandatory standard, may be single-byte double byte four bytes, and GB2312 encoding is compatible with our standard, but not widely used
UTF-16: Defines Unicode (all languages of the world can be translated using this dictionary, which is the basis of Java and XML) in the computer access method, with two bytes to represent the conversion format of Unicode, is relatively simple and convenient on the presentation, But the memory space is much larger than the UTF-8.
UTF-8: There are different loadline lengths for each coded area, and different types of characters can have 1-6 bytes
Coding Principles for UTF-8:
If it is a byte, the highest bit is 0,
If a byte starts with 11, it indicates the corresponding first byte
If starting with 10, it means that he is not a first byte and needs to be searched forward to get the first character of the current character
For example:
1 byte 0xxxxxxx
2 bytes 110xxxxx 10xxxxxx
3 bytes 1110xxxx 10xxxxxx 10xxxxxx
4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 bytes 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 bytes 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Description: UTF-8 is used to denote a maximum of 31 bits of the actual number of characters encoded, which is the bit represented by X in the previous table. By removing those control bits (10 per byte), the x represents a bit that corresponds to the Unicode encoding of one by one and the same level and height sequence. Converting Unicode to UTF-8 encoding should first remove the high 0, and then determine the minimum number of UTF-8 encoded bits (ASCLL standard encoding 7 bits) based on the number of digits remaining encoded.
3.io Stream encoding and decoding
1) Read the file
If the ASCII write is not garbled (this is also Unicode (\u5f20\u4e09 in the form of the conversion to arbitrary encoding (in the case can not be garbled)), if the use of other coding (if it is utf-8) file, Java in the absence of set decoding, Java will default to (GBK) to decode into Unicode code (Java JVM will be GBK to parse into utf-16 storage), and our source files are written using utf-8, so cause garbled, So when using, we will specify the encoding set to decode the guaranteed Unicode (Char) Save mode, so that the correct codec, local files can use the Encodingdetect jar package to obtain, which is the use of statistical conclusions, not hundred correct , but the accuracy rate is more than 99, the web crawler can use Cpdetector_1.0.10.jar, also is based on statistics
Instance
How to use shorthand encodingdetect
" F:/test/file1/file2.txt " ; // get the encoding format String encode = encodingdetect.getjavaencode (pathname
System. out . println (encode);
Encodingdetect.readfile (pathname, encode);
The method of obtaining coding for Cpdetector network crawler
Public Staticstring Getfileencode (string path) {/** Detector is a detector that gives the detection task to a specific instance of the probe implementation class. * Cpdetector contains a number of commonly used probe implementation classes, which can be added through the Add method, such as Parsingdetector, * jchardetfacade, Asciidetector, Unicodedetector 。 * Detector returns the detected * Character set encoding in accordance with the "who first returns non-null probe results, whichever is the result". Use three third-party jar packages: Antlr.jar, Chardet.jar, and Cpdetector.jar * Cpdetector are based on statistical principles and are not guaranteed to be completely correct. */Codepagedetectorproxy Detector=codepagedetectorproxy.getinstance (); /** Parsingdetector can be used to check the encoding of HTML, XML and other files or character streams, and the parameters in the construction method are used to indicate whether the details of the probing process are displayed, and false is not displayed. */ //Detector.add (new Parsingdetector (false)); /** The Jchardetfacade encapsulates the jchardet provided by the Mozilla organization, which can be used to encode and measure most files. Therefore, generally with this detector can meet the requirements of most projects, if you are not at ease, you can * add a few more detectors, such as the following asciidetector, Unicodedetector and so on. */Detector.add (Jchardetfacade.getinstance ());//used to Antlr.jar, Chardet.jar//Asciidetector for ASCII code determination//Detector.add (Asciidetector.getinstance ()); //Unicodedetector for the determination of Unicode family codes//Detector.add (Unicodedetector.getinstance ());Java.nio.charset.Charset charset =NULL; File F=NewFile (path); Try{CharSet=Detector.detectcodepage (F.touri (). Tourl ()); } Catch(Exception ex) {ex.printstacktrace (); } if(CharSet! =NULL) returnCharset.name (); Else return NULL; };
4.url encoding and decoding principles written in the Web core
Reasons for the Java IO Stream basics are garbled