Objective
Today in the Test crawler project, found a very serious problem, when the crawl page encoding format for gb2312, according to the general method of conversion to UTF-8 encoding is always garbled, PS: Crawl all pages regardless of the encoding format, are converted to UTF-8 format for storage.
First, the problem arises
Use the method in this article can crawl page information directly and save to local use HttpClient to implement the crawl and save to local , when crawling this page http://stock.10jqka.com.cn/ zhuanti/hlw_list/, it was found that the conversion method before using (unknown encoding--utf-8 encoding) is always garbled. So I looked up a lot of information, found that is not too applicable. Finally, I found out a solution, also hereby recorded.
Second, the solution
1. Convert gb2312 format to GBK format
2. Convert GBK format to utf-8 format
The conversions here need to use GBK as an intermediate format as a bridge to transform.
Third, specific ideas
1. When opening http://stock.10jqka.com.cn/zhuanti/hlw_list/ This link, we look at the source code will find the encoding format as gb2312, as shown in
2. Since the conversion scheme has been used prior to this project, the conversion scheme is not valid for the gb2312 format of the Web page, and the core source code for the conversion scheme prior to this project is:
Public voidgetcontent (String url) { This. get =Newhttpget (URL); HttpResponse Response= Client.execute ( This. Get); Httpentity Entity=response.getentity (); byte[] bytes =Entityutils.tobytearray (entity); String content=NewString (bytes); //The default is UTF-8 encodingString charset = "Utf-8"; //character encoding between matching Pattern pattern = pattern.compile ("); Matcher Matcher=Pattern.matcher (Content.tolowercase ()); if(Matcher.find ()) {CharSet= Matcher.group (4); } //Convert the target character encoding to UTF-8 encodingString temp =NewString (bytes, CharSet); byte[] Contentdata = Temp.getbytes ("Utf-8"); returnContentdata; }
View Code
This scheme for gb2312 conversion or garbled, after the solution core source code is:
Public voidgetcontent (String url) { This. get =Newhttpget (URL); HttpResponse Response= Client.execute ( This. Get); Httpentity Entity=response.getentity (); byte[] bytes =Entityutils.tobytearray (entity); String content=NewString (bytes); //The default is UTF-8 encodingString charset = "Utf-8"; //character encoding between matching Pattern pattern = pattern.compile ("); Matcher Matcher=Pattern.matcher (Content.tolowercase ()); if(Matcher.find ()) {CharSet= Matcher.group (4); if(Charset.equals ("gb2312")) { byte[] Gbkbytes =NewString (Bytes, "GBK"). GetBytes (); return NewString (Gbkbytes, "Utf-8"). GetBytes (); } } //Convert the target character encoding to UTF-8 encodingString temp =NewString (bytes, CharSet); byte[] Contentdata = Temp.getbytes ("Utf-8"); returnContentdata; }
View Code
In this way we can solve the problem of garbled characters when gb2312 encoding is converted into UTF-8 encoding.
Iv. Summary
More thinking, more brain, here just give an engineering solution, and did not go deep into the principle, the problem can be derived from a lot of interesting questions, such as, Utf-8, GBK, gb2312 How to encode the way? Why is this transformation going to solve the problem? These questions deserve our deep research. As this article is mainly about engineering solutions, interested readers can go in-depth understanding. Thank you all friends to watch ~
With an article on the crawler garbled problem solved a very good article, web crawler garbled processing, speak very good, have to do Reptile Park friends encounter this part of the problem can be a good reference reference.
Crawler Web page garbled solution (gb2312-Utf-8)