Crawler Web page garbled solution (gb2312-Utf-8)

Last Update:2016-03-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

Today in the Test crawler project, found a very serious problem, when the crawl page encoding format for gb2312, according to the general method of conversion to UTF-8 encoding is always garbled, PS: Crawl all pages regardless of the encoding format, are converted to UTF-8 format for storage.

First, the problem arises

Use the method in this article can crawl page information directly and save to local use HttpClient to implement the crawl and save to local , when crawling this page http://stock.10jqka.com.cn/ zhuanti/hlw_list/, it was found that the conversion method before using (unknown encoding--utf-8 encoding) is always garbled. So I looked up a lot of information, found that is not too applicable. Finally, I found out a solution, also hereby recorded.

Second, the solution

1. Convert gb2312 format to GBK format

2. Convert GBK format to utf-8 format

The conversions here need to use GBK as an intermediate format as a bridge to transform.

Third, specific ideas

1. When opening http://stock.10jqka.com.cn/zhuanti/hlw_list/ This link, we look at the source code will find the encoding format as gb2312, as shown in

2. Since the conversion scheme has been used prior to this project, the conversion scheme is not valid for the gb2312 format of the Web page, and the core source code for the conversion scheme prior to this project is:

 Public voidgetcontent (String url) { This. get =Newhttpget (URL); HttpResponse Response= Client.execute ( This. Get); Httpentity Entity=response.getentity (); byte[] bytes =Entityutils.tobytearray (entity); String content=NewString (bytes); //The default is UTF-8 encodingString charset = "Utf-8"; //character encoding between matching Pattern pattern = pattern.compile ("); Matcher Matcher=Pattern.matcher (Content.tolowercase ()); if(Matcher.find ()) {CharSet= Matcher.group (4); }         //Convert the target character encoding to UTF-8 encodingString temp =NewString (bytes, CharSet); byte[] Contentdata = Temp.getbytes ("Utf-8"); returnContentdata; }

View Code

This scheme for gb2312 conversion or garbled, after the solution core source code is:

 Public voidgetcontent (String url) { This. get =Newhttpget (URL); HttpResponse Response= Client.execute ( This. Get); Httpentity Entity=response.getentity (); byte[] bytes =Entityutils.tobytearray (entity); String content=NewString (bytes); //The default is UTF-8 encodingString charset = "Utf-8"; //character encoding between matching Pattern pattern = pattern.compile ("); Matcher Matcher=Pattern.matcher (Content.tolowercase ()); if(Matcher.find ()) {CharSet= Matcher.group (4); if(Charset.equals ("gb2312")) {                byte[] Gbkbytes =NewString (Bytes, "GBK"). GetBytes (); return NewString (Gbkbytes, "Utf-8"). GetBytes (); }        }         //Convert the target character encoding to UTF-8 encodingString temp =NewString (bytes, CharSet); byte[] Contentdata = Temp.getbytes ("Utf-8"); returnContentdata; }

View Code

In this way we can solve the problem of garbled characters when gb2312 encoding is converted into UTF-8 encoding.

Iv. Summary

More thinking, more brain, here just give an engineering solution, and did not go deep into the principle, the problem can be derived from a lot of interesting questions, such as, Utf-8, GBK, gb2312 How to encode the way? Why is this transformation going to solve the problem? These questions deserve our deep research. As this article is mainly about engineering solutions, interested readers can go in-depth understanding. Thank you all friends to watch ~

With an article on the crawler garbled problem solved a very good article, web crawler garbled processing, speak very good, have to do Reptile Park friends encounter this part of the problem can be a good reference reference.

Crawler Web page garbled solution (gb2312-Utf-8)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawler Web page garbled solution (gb2312-Utf-8)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Crawler Web page garbled solution (gb2312-Utf-8)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support