Based on the HttpClient download page, followed by the URL should be extracted, the first I used is htmlpraser, after a few days, I found that there are jsoup this package, very useful, and then I will directly use Jsoup To crawl the page and extract the URL inside, here to share the code with you.
Import Java. IO. IOException;Import Java. Util. HashSet;Import Java. Util. Iterator;Import Java. Util. Set;import org. Jsoup. Jsoup;import org. Jsoup. Nodes. Document;import org. Jsoup. Nodes. Element;import org. Jsoup. Select. Elements;public class Jsoup {public staticSet<String> extractlinks (String URL) {Set<String> urls = new HashSet ();try {Document doc =jsoup. Connect(URL). Get();Elements Links=doc. Select("A[href]");System. out. println(Links. Size());for (Element link:links) {String Url =link. attr("Abs:href");URLs. Add(URL);}} catch (IOException e) {//TODO auto-generated catch block E. Printstacktrace();} return URLs;}
Java Writing web crawler notes (Part III: Jsoup's Power)