Recently began to learn Java crawler, online a lot of tutorials, their own time spent a long time to understand other people's ideas.
I intend to make a little progress in my recent study and clarify my thinking.
The main tool uses Jsoup: The concrete usage looks http://blog.csdn.net/u012315428/article/details/51135640
Here's how to get all the hyperlinks in a Web page:
Package Com.sohu;
Import Org.jsoup.Jsoup;
Import java.io.IOException;
Import java.util.*;
Import org.jsoup.*;
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element;
Import org.jsoup.select.Elements;
*
* AUTHOR:CJ
* Find all hyperlinks/public
class Findallurl {public
static void Main (string[] args) {
//TODO auto-generated Method stub
try {
Document doc = jsoup.connect ("http://news.sohu.com/").
Elements links = doc.select ("a[href]");
for (Element link:links) {
String strURL = link.attr ("Abs:href");
if (Strurl.startswith ("http://news.sohu.com/"))//To ... Opening URL
System.out.println (strurl);
}
catch (IOException e) {
//TODO auto-generated catch block< C24/>e.printstacktrace ();}}
Run results
http://news.sohu.com/
http://news.sohu.com/mindiao/
http://news.sohu.com/scroll/
http://news.sohu.com/mindiao/
Http://news.sohu.com/special.shtml
Http://news.sohu.com/guoneixinwen.shtml
Http://news.sohu.com/shehuixinwen.shtml
Http://news.sohu.com/guojixinwen.shtml
http://news.sohu.com/matrix/
http://news.sohu.com/newsmaker_list/
http://news.sohu.com/photo/
http://news.sohu.com/wurenji/
http://news.sohu.com/#
http://news.sohu.com/#
http://news.sohu.com/#
Http://news.sohu.com/20160414/n444127123.shtml
Http://news.sohu.com/20160414/n444127800.shtml
Http://news.sohu.com/20160414/n444193395.shtml
Http://news.sohu.com/20160414/n444148450.shtml
Http://news.sohu.com/20160414/n444133304.shtml
Http://news.sohu.com/20160414/n444199124.shtml
Http://news.sohu.com/20160413/n444107224.shtml
Http://news.sohu.com/20160414/n444127800.shtml
Http://news.sohu.com/20160413/n444105842.shtml
Http://news.sohu.com/20160414/n444140620.shtml
Http://news.sohu.com/20160414/n444126073.shtml
Http://news.sohu.com/20160413/n444086783.shtml
Http://news.sohu.com/20160414/n444187234.shtml
Http://news.sohu.com/20160414/n444193015.shtml
Http://news.sohu.com/20160414/n444207393.shtml
Http://news.sohu.com/20160414/n444148450.shtml
Http://news.sohu.com/20160414/n444193395.shtml