conn = jsoup.connect (URL) when crawling a Web site. Timeout. get (); Directly with the Get method, some sites can crawl normally.
However, some websites report 403 errors, and 403 is a common error message during site access. Indicates that the resource is not available, the server understands the client's request, but refuses to handle it, usually caused by a Web access error due to the permissions set on the file or directory on the server.
The solution is just from these angles: Useragent,referer,token,cokkie
So we add the header of the simulation browser to the connection:
[Java]View PlainCopy
- <span style="WHITE-SPACE:PRE;" > </span>conn = Jsoup.connect (URL). Timeout (5000);
- Conn.header ("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
- Conn.header ("accept-encoding", "gzip, deflate, sdch");
- Conn.header ("Accept-language", "zh-cn,zh;q=0.8");
- Conn.header ("user-agent", "mozilla/5.0" (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36 ");
Then execute Conn.get ()
We can get the data.
Org.jsoup.HttpStatusException:HTTP error fetching URL. status=403