Colleagues use Java to do a crawl of any page title, because the HTML of any page in the head of the meta-specified charset a variety of, such as commonly used utf-8,gbk,gb2312.
Write their own code processing, in a short time, found a variety of situations too difficult to think about, always crawl garbled. Challenges: There may also be meta or meta, even if there is meta may also be uppercase or lowercase, even if the case is done with white space characters, in short, a variety of unexpected. However, the search engine crawler caught the page can not garbled it?
Baidu Check this problem basically no solution, Bing Check this problem is not diligent majority, had to open on the blue light on Google, three options:
1. StackOverflow to see if there is the best answer
Http://stackoverflow.com/questions/10996726/encoding-of-response-is-incorrect-using-apache-httpclient
StackOverflow said that if the HTTP client component is not supported and common HTTP is not supported, Spring ' sRESTTemplate能干这事。我查了查有点玄。
2. Model the HTML elements and extract the models.
Http://docs.oracle.com/cd/B28359_01/appdev.111/b28394/adx_j_parser.htm
With Oralce XML Developer ' s kit,example contains an operation XML DOM of the Autodetectencoding.java class, very happy, but download the XDK and this example a bit laborious. But then compared to the XML and HTML encoding elements and the way is really different, although HTML can be considered to be special XML, are to follow the DOM model, but the level of the DOM, the water is very deep, the discovery is Lu Xian crooked.
3. Use a crawler or component that looks like a diligent search, and it has to be java.
http://www.huqiwen.com/2012/05/03/use-jsoup-analytics-html-document/
The original author of this post also said that the original is Htmlparser, and then all Niaoqianghuanpao with Jsoup. The cannon is better than a gun. In the middle also from Csdn find a netizen's post, willing to provide their own in Gitbub Open source Crawler, survey Web page said to be able to do, it will crash, let me how to use, can not bury mine, rather do not solve. Try Jsoup and find out that it's a favorite.
Java grab any page title garbled Jsoup Solution One example