Java Crawler Combat (a): Crawl all the links on a website

Source: Internet
Author: User

Preface: Before writing this article, mainly I read a few similar crawler writing, some use the queue to write, feeling is not very intuitive, there is only one request and then page parsing, there is no automatic climb up This is also called crawler? So I wrote a simple crawler with my own thoughts, and the test case was to automatically grab all the links to my blog site (http://www.zifangsky.cn).

An introduction to Algorithms

The program uses a breadth- first algorithm on the idea, initiates a GET request for a link that has not been traversed, and then parses the returned page with a regular expression, taking out new links that are not found, adding the collection, and iterating over the next loop.

        specifically implemented on the use of map<string, Boolean , the key-value pairs are links and whether the flags are traversed, respectively. The program uses two map collections, namely: Oldmap and Newmap, the initial link in Oldmap, and then the oldmap inside the flag is false link initiation request, parse the page, with regular check out the link under the <a> tag, If this link is not in Oldmap and Newmap, then this is a new link, and if this link is a link we need to get to the target site, we will put this link into the Newmap, has been parsed, and so on, and so the page parsing is completed, Set the value of the link in the current page in Oldmap to true to indicate that it has been traversed. Finally, when the entire oldmap has not traversed the link after the end of the traversal, if it is found that Newmap is not empty, then this time the loop has a new link generation, so the new links are added to the Oldmap, continue the recursive traversal, and vice versa that the loop does not produce a new link, The continuation loop cannot generate a new link because the end of the task returns the link collection Oldmap

Two-Program implementation

The above-mentioned ideas have been made clear, and the key points in the code are commented, so there is not much to say, the code is as follows:

package action;import java.io.bufferedreader;import java.io.ioexception;import  java.io.inputstream;import java.io.inputstreamreader;import java.net.httpurlconnection;import  java.net.malformedurlexception;import java.net.url;import java.util.linkedhashmap;import  java.util.map;import java.util.regex.matcher;import java.util.regex.pattern;public class  Webcrawlerdemo {public static void main (String[] args)  {WebCrawlerDemo  Webcrawlerdemo = new webcrawlerdemo (); Webcrawlerdemo.myprint ("http://www.zifangsky.cn");} Public void myprint (String baseurl)  {map<string, boolean> oldmap =  new LinkedHashMap<String, Boolean> (); //  Storage link-whether it is traversed//  key value pair string  oldlinkhost =  "";   //hostpattern p = pattern.compile (https?:/ /)? [^/\\s]*];  //For example: HTTP://WWW.ZIFANGSKY.CNMATCHER&NBSp;m = p.matcher (BASEURL);if  (M.find ())  {oldlinkhost = m.group (); Oldmap.put (Baseurl, false); Oldmap = crawllinks (Oldlinkhost, oldmap);for  (Map.Entry <string, boolean> mapping : oldmap.entryset ())  {system.out.println ("Link:"  +  mapping.getkey ());}} /** *  Crawl a Web site can crawl all the Web links, in the idea of using the breadth-first algorithm  *  to the new link has not been traversed constantly initiate get request, until the entire collection has been traversed to find a new link  *   Indicates that new links cannot be found, end of task  *  *  @param  oldLinkHost   domain name, e.g. http:// www.zifangsky.cn *  @param  oldMap   link collection to traverse  *  *  @return   Returns all crawled links to the collection  * */private map<string, boolean> crawllinks (String  OLDLINKHOST,MAP&LT;STRING,&NBSP;BOOLEAN&GT;&NBSP;OLDMAP)  {map<string, boolean> newmap  = new LinkedHashMap<String, Boolean> (); string oldlink =  "";for  (map.entry<string, boolean> mapping : oldmap.entryset ())  {system.out.println ("Link:"  + mapping.getkey ()  +  "--------Check:" + mapping.getvalue ());//  if not traversed if  (!mapping.getvalue ())  { Oldlink = mapping.getkey ();//  initiates a GET request Try {url url = new url (OldLink); httpurlconnection connection =  (HttpURLConnection)  url.openconnection (); Connection.setrequestmethod ("GET"); Connection.setconnecttimeout (+); Connection.setreadtimeout (+);if  (Connection.getresponsecode ()  == 200)  {InputStream inputStream =  Connection.getinputstream (); Bufferedreader reader = new bufferedreader (New inputstreamreader (inputStream,  " UTF-8 ")); string line =  ""; Pattern pattern = pattern.compile ("<a.*?href=[\" ')? ( (https?:/ /)?/? [^\ "']+) [\"]?. *?> (. +) </a> "); matcher matcher = null;while  ((Line = reader.readlINE ())  != null)  {matcher = pattern.matcher (line);if  (Matcher.find ())  { String newlink = matcher.group (1). Trim (); //  link// string title =  Matcher.group (3). Trim ()  //title//  Determine whether the obtained link starts with HTTP if  (!newlink.startswith ("http"))  {if   (Newlink.startswith ("/")) newlink = oldlinkhost + newlink;elsenewlink =  oldlinkhost +  "/"  + newlink;} Remove the  /if at the end of the link (newlink.endswith ("/")) Newlink = newlink.substring (0, newlink.length ()  - &NBSP;1);//de-weight, and discard links to other sites if  (!oldmap.containskey (NewLink) && !newmap.containskey (NewLink) && newlink.startswith (Oldlinkhost))  {// system.out.println ("temp2: "  +  NewLink); Newmap.put (Newlink, false);}}}  catch  (malformedurlexception e)  {e.printstacktrace ();}  catch  (ioexception e)  {e.printstacktrace ();} Try {thrEad.sleep (1000);}  catch  (interruptedexception e)  {e.printstacktrace ();} Oldmap.replace (Oldlink, false, true);}} There are new links that continue to traverse if  (!newmap.isempty ())  {oldmap.putall (Newmap); Oldmap.putall (Crawllinks (oldlinkhost,  Oldmap));   //will not cause duplicate key-value pairs}return oldmap due to map properties;}}

Three final Test results


650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/78/B9/wKiom1aCJ6LjWcotAAFyTOkuY9M691.png "title=" test "alt = "Wkiom1acj6ljwcotaafytokuy9m691.png"/>

PS: In fact, with recursion this way is not too good, because if the site page is more, the program runs a long time on the memory consumption will be very large, but because my blog site page is not many, so the effect can also

This article is from "Zifangsky's personal blog" blog, make sure to keep this source http://983836259.blog.51cto.com/7311475/1729513

Java Crawler Combat (a): Crawl all the links on a website

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.