Crawler Web page garbled solution (gb2312-Utf-8)

Source: Internet
Author: User

Objective

Today in the Test crawler project, found a very serious problem, when the crawl page encoding format for gb2312, according to the general method of conversion to UTF-8 encoding is always garbled, PS: Crawl all pages regardless of the encoding format, are converted to UTF-8 format for storage.

First, the problem arises

Use the method in this article can crawl page information directly and save to local use HttpClient to implement the crawl and save to local , when crawling this page http://stock.10jqka.com.cn/ zhuanti/hlw_list/, it was found that the conversion method before using (unknown encoding--utf-8 encoding) is always garbled. So I looked up a lot of information, found that is not too applicable. Finally, I found out a solution, also hereby recorded.

Second, the solution

1. Convert gb2312 format to GBK format

2. Convert GBK format to utf-8 format

The conversions here need to use GBK as an intermediate format as a bridge to transform.

Third, specific ideas

1. When opening http://stock.10jqka.com.cn/zhuanti/hlw_list/ This link, we look at the source code will find the encoding format as gb2312, as shown in

  

2. Since the conversion scheme has been used prior to this project, the conversion scheme is not valid for the gb2312 format of the Web page, and the core source code for the conversion scheme prior to this project is:

 Public voidgetcontent (String url) { This. get =Newhttpget (URL); HttpResponse Response= Client.execute ( This. Get); Httpentity Entity=response.getentity (); byte[] bytes =Entityutils.tobytearray (entity); String content=NewString (bytes); //The default is UTF-8 encodingString charset = "Utf-8"; //character encoding between matching Pattern pattern = pattern.compile ("); Matcher Matcher=Pattern.matcher (Content.tolowercase ()); if(Matcher.find ()) {CharSet= Matcher.group (4); }         //Convert the target character encoding to UTF-8 encodingString temp =NewString (bytes, CharSet); byte[] Contentdata = Temp.getbytes ("Utf-8"); returnContentdata; }
View Code

This scheme for gb2312 conversion or garbled, after the solution core source code is:

 Public voidgetcontent (String url) { This. get =Newhttpget (URL); HttpResponse Response= Client.execute ( This. Get); Httpentity Entity=response.getentity (); byte[] bytes =Entityutils.tobytearray (entity); String content=NewString (bytes); //The default is UTF-8 encodingString charset = "Utf-8"; //character encoding between matching Pattern pattern = pattern.compile ("); Matcher Matcher=Pattern.matcher (Content.tolowercase ()); if(Matcher.find ()) {CharSet= Matcher.group (4); if(Charset.equals ("gb2312")) {                byte[] Gbkbytes =NewString (Bytes, "GBK"). GetBytes (); return NewString (Gbkbytes, "Utf-8"). GetBytes (); }        }         //Convert the target character encoding to UTF-8 encodingString temp =NewString (bytes, CharSet); byte[] Contentdata = Temp.getbytes ("Utf-8"); returnContentdata; }
View Code

In this way we can solve the problem of garbled characters when gb2312 encoding is converted into UTF-8 encoding.

Iv. Summary

More thinking, more brain, here just give an engineering solution, and did not go deep into the principle, the problem can be derived from a lot of interesting questions, such as, Utf-8, GBK, gb2312 How to encode the way? Why is this transformation going to solve the problem? These questions deserve our deep research. As this article is mainly about engineering solutions, interested readers can go in-depth understanding. Thank you all friends to watch ~

With an article on the crawler garbled problem solved a very good article, web crawler garbled processing, speak very good, have to do Reptile Park friends encounter this part of the problem can be a good reference reference.

Crawler Web page garbled solution (gb2312-Utf-8)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.