Crawler4j to the existing code of the page crawl effect is good, with Jsoup parsing, a lot of jquery programmers can operate. However, crawler4j to response did not specify the encoding of the page, parsing into garbled, very annoying. In the depression, inadvertently found a long time Bowen, can solve the problem, modify Page.load () in the Contentdata code can, this makes my heart suddenly comfortable a lot, the next question is the blade and solution.
Copy Code code as follows:
public void Load (httpentity entity) throws Exception {
ContentType = null;
Header type = Entity.getcontenttype ();
if (type!= null) {
ContentType = Type.getvalue ();
}
ContentEncoding = null;
Header encoding = entity.getcontentencoding ();
if (encoding!= null) {
ContentEncoding = Encoding.getvalue ();
}
Charset Charset = Contenttype.getordefault (entity). Getcharset ();
if (charset!= null) {
Contentcharset = Charset.displayname ();
}else{
Contentcharset = "Utf-8";
}
Source
Contentdata = Entityutils.tobytearray (entity);
The modified code
Contentdata = entityutils.tostring (Entity, Charset.forname ("GBK")). GetBytes ();
}