Java grab any page title garbled Jsoup Solution One example

Source: Internet
Author: User

Colleagues use Java to do a crawl of any page title, because the HTML of any page in the head of the meta-specified charset a variety of, such as commonly used utf-8,gbk,gb2312.

Write their own code processing, in a short time, found a variety of situations too difficult to think about, always crawl garbled. Challenges: There may also be meta or meta, even if there is meta may also be uppercase or lowercase, even if the case is done with white space characters, in short, a variety of unexpected. However, the search engine crawler caught the page can not garbled it?

Baidu Check this problem basically no solution, Bing Check this problem is not diligent majority, had to open on the blue light on Google, three options:

1. StackOverflow to see if there is the best answer

Http://stackoverflow.com/questions/10996726/encoding-of-response-is-incorrect-using-apache-httpclient

StackOverflow said that if the HTTP client component is not supported and common HTTP is not supported, Spring ' sRESTTemplate能干这事。我查了查有点玄。

2. Model the HTML elements and extract the models.

Http://docs.oracle.com/cd/B28359_01/appdev.111/b28394/adx_j_parser.htm

With Oralce XML Developer ' s kit,example contains an operation XML DOM of the Autodetectencoding.java class, very happy, but download the XDK and this example a bit laborious. But then compared to the XML and HTML encoding elements and the way is really different, although HTML can be considered to be special XML, are to follow the DOM model, but the level of the DOM, the water is very deep, the discovery is Lu Xian crooked.

3. Use a crawler or component that looks like a diligent search, and it has to be java.

http://www.huqiwen.com/2012/05/03/use-jsoup-analytics-html-document/

The original author of this post also said that the original is Htmlparser, and then all Niaoqianghuanpao with Jsoup. The cannon is better than a gun. In the middle also from Csdn find a netizen's post, willing to provide their own in Gitbub Open source Crawler, survey Web page said to be able to do, it will crash, let me how to use, can not bury mine, rather do not solve. Try Jsoup and find out that it's a favorite.

Java grab any page title garbled Jsoup Solution One example

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.