The basic realization of Java web crawler _

The basic realization of Java web crawler __java

Last Update:2018-07-27 Source: Internet

Author: User

Tags readline java web stringbuffer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a basic program for Web search, from the command line to enter the search criteria (starting URL, the maximum number of processing URLs, the string to search for),
It searches the URLs on the Internet one by one, and finds and outputs pages that match the search criteria. The prototype of this program comes from the Java programming art,
In order to better analysis, the webmaster removed the GUI part, and slightly modified to apply jdk1.5. Based on this program, you can write a search on the internet
"Reptiles" such as images, emails, web downloads, and so on.
First, see the process of running the program:

D:/java>javac Searchcrawler.java (compiled)

D:/java>java Searchcrawler http://127.0.0.1:8080/zz3zcwbwebhome/index.jsp Java

Start searching ...
Result
Searchstring=java
http://127.0.0.1:8080/zz3zcwbwebhome/index.jsp
http://127.0.0.1:8080/zz3zcwbwebhome/reply.jsp
http://127.0.0.1:8080/zz3zcwbwebhome/learn.jsp
http://127.0.0.1:8080/zz3zcwbwebhome/download.jsp
http://127.0.0.1:8080/zz3zcwbwebhome/article.jsp
Http://127.0.0.1:8080/zz3zcwbwebhome/myexample/jlGUIOverview.htm
Http://127.0.0.1:8080/zz3zcwbwebhome/myexample/Proxooldoc/index.html
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=301
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=297
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=291
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=286
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=285
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=284
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=276
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=272

Another example:
D:/java>java Searchcrawler http://www.sina.com Java
Start searching ...
Result
Searchstring=java
Http://sina.com
http://redirect.sina.com/WWW/sinaCN/www.sina.com.cn CLASS=A2
http://redirect.sina.com/WWW/sinaCN/www.sina.com.cn class=a8
HTTP://REDIRECT.SINA.COM/WWW/SINAHK/WWW.SINA.COM.HK CLASS=A2
http://redirect.sina.com/WWW/sinaTW/www.sina.com.tw class=a8
Http://redirect.sina.com/WWW/sinaUS/home.sina.com class=a8
Http://redirect.sina.com/WWW/smsCN/sms.sina.com.cn/class=a2
Http://redirect.sina.com/WWW/smsCN/sms.sina.com.cn/class=a3
Http://redirect.sina.com/WWW/sinaNet/www.sina.net/class=a3

D:/java>
The following is the source code for this program

Import java.util.*; Import java.net.*; Import java.io.*; Import java.util.regex.*; Search web Crawler public class Searchcrawler implements runnable{/* Disallowlistcache cache robot URL that does not allow searching. The robot protocol sets a robots.txt file in the root directory of the Web site, which is to specify which pages on the site are restricted for searching. The search program should skip these areas during the search, following is an example of robots.txt: # robots.txt for http://somehost.com/User-agent: * Disallow:/cgi-bin/disallow :/registration #/disallow registration page Disallow:/login/Private hashmap< string,arraylist<-Strin g>> Disallowlistcache = new hashmap< string,arraylist< string>> (); arraylist< string> errorlist= New arraylist< string> ()//error message arraylist< string> result=new < string> (); Search results String starturl;//beginning of the search start int maxurl;//maximum number of URLs string searchstring;//the string to search for (English) Boolean casesensitive= false;//whether the case-sensitive Boolean limithost=false;//searches the restricted host for public Searchcrawler (String starturl,int maxurl,string searchstring) {this.starturl=starturl; this.maxurl=maxurl; this.searchstring=SearchString; Public arraylist< string> GetResult () {return results;} public void Run () {//Start the search thread crawl (starturl,maxurl, Searchst ring,limithost,casesensitive); }//Detect URL Format private URL verifyurl (String URL) {//Only HTTP URLs are processed. if (!url.tolowercase (). StartsWith ("http://")) return null ; URL verifiedurl = null; try {verifiedurl = new URL (URL);} catch (Exception e) {return null;} return verifiedurl; //Detect if robot is allowed to access the given URL. Private Boolean isrobotallowed (URL urltocheck) {String host = Urltocheck.gethost (). toLowerCase ()//Get the host for the Rul// SYSTEM.OUT.PRINTLN ("host =" +host); Gets the URL cache that the host does not allow to search arraylist< string> disallowlist =disallowlistcache.get (host); If it is not yet cached, download and cache it. if (disallowlist = = null) {disallowlist = new arraylist< string> (); try {url robotsfileurl =new url ("http://" + Ho St + "/robots.txt"); BufferedReader Reader =new BufferedReader (New InputStreamReader (Robotsfileurl.openstream ())); Read the robot file to create a list of paths that are not allowed to be accessed. String Line; while (line = Reader.readline ())!= nulL) {if (Line.indexof ("disallow:") = = 0) {//contains "disallow:" String Disallowpath =line.substring ("Disallow:". Length ()); /Get no access path//check for comments. int commentindex = Disallowpath.indexof ("#"); if (Commentindex!=-1) {Disallowpath =disallowpath.substring (0, Commentindex);//Remove comment} Disallowpath = Disallowpath.trim (); Disallowlist.add (Disallowpath); }//cache the path that this host is not allowed to access. Disallowlistcache.put (host, disallowlist); The catch (Exception e) {return true;//web there is no robots.txt file in the root directory of the site, returns true} String file = Urltocheck.getfile (); System.out.println ("file getfile () =" +file); for (int i = 0; i < disallowlist.size (); i++) {String disallow = Disallowlist.get (i); if (File.startswith (disallow)) { return false; } return true; Private String Downloadpage (URL pageurl) {try {//Open connection to URL for reading. BufferedReader reader = new BufferedReader (New InputStreamReader (Pageurl.openstream ())); Read page into buffer. String Line; StringBuffer pagebuffer = new StringBuffer (); while (line = Reader. ReadLine ())!= null) {pagebuffer.append (line);} return pagebuffer.tostring (); catch (Exception e) {} return null; //Remove the "www" private string removewwwfromurl (string url) {int = Url.indexof (": www.") from the URL; if (index!=-1) {RET Urn url.substring (0, Index + 3) + url.substring (index + 7); return (URL); //Parse the page and find the link private arraylist< string> retrievelinks (URL pageurl, String pagecontents, HashSet crawledlist, Boole An limithost) {//compile link matching pattern with regular expression. Pattern P =pattern.compile ("<a//s+href//s*=//s*/"? *?) [/"|&GT;]", pattern.case_insensitive); Matcher m = P.matcher (pagecontents); arraylist< string> linklist = new arraylist< string> (); while (M.find ()) {String link = m.group (1). Trim (); if (Link.length () < 1) {continue;}//Skip link to this page. if (Link.charat (0) = = ' # ') {continue} if (Link.indexof ("mailto:")!=-1) {continue;} if (Link.tolowercase (). IndexOf ("JavaScript")!=-1) {continue;} if (Link.indexof ("://") = = 1) {if (Link.charat (0) = = '/ ') {Processing absolutely link = "http://" + pageurl.gethost () + ":" +pageurl.getport () + link;} else {String file = Pageurl.getfile (); if (File.indexof ('/') = = 1) {//handle relative address link = "http://" + pageurl.gethost () + ":" +pa Geurl.getport () + "/" + link; else {String path =file.substring (0, File.lastindexof ('/') + 1); link = "http://" + pageurl.gethost () + ":" +pageurl.getp ORT () + path + link; } int index = Link.indexof (' # '); if (index!=-1) {link = link.substring (0, index);} link = removewwwfromurl (link); URL verifiedlink = verifyurl (link); if (Verifiedlink = = null) {Continue}/* If the host is qualified, exclude those unqualified url*/if (limithost &&!pageurl.gethost (). toLowerCase () . Equals (Verifiedlink.gethost (). toLowerCase ())) {Continue}//Skip those links that have been processed. if (crawledlist.contains (link)) {continue;} linklist.add (link); return (linklist); //Search the contents of the download Web page to determine if there is a specified search string in this page private Boolean searchstringmatches (String pagecontents, String searchstring, Boolean casesensitive) {String searchcontents = pagecontents; if (!casesensitive) {//if case-insensitive searchcontents = Pagecontents.tolowercase ();} Pattern p = pattern.compile ("[//s]+"); string[] terms = p.split (searchstring); for (int i = 0; i < terms.length i++) {if (casesensitive) {if (Searchcontents.indexof (terms[i)) = = 1) {return fals E } else {if (Searchcontents.indexof (terms[i].tolowercase ()) = = 1) {return false;}}} return true; //Perform the actual search operation public arraylist< string> Crawl (string starturl, int maxurls, string Searchstring,boolean limithost,b Oolean casesensitive) {System.out.println ("searchstring=" +searchstring); hashset< string> crawledlist = new hashset< string> (); linkedhashset< string> tocrawllist = new linkedhashset< string> (); if (Maxurls < 1) {Errorlist.add ("Invalid Max URL value."); System.out.println ("Invalid Max URL value."); if (Searchstring.length () < 1) {Errorlist.add ("Missing Search String."); System.out.println ("Missing search String"); } if (Errorlist.size () > 0) {System.out.println ("Err!!!"); return errorlist; ///move out of www starturl = Removewwwfromurl (StartURL) from the start URL; Tocrawllist.add (StartURL); while (Tocrawllist.size () > 0) {if (Maxurls!=-1) {if (crawledlist.size () = = Maxurls) {break;}}//Get URL at bot Tom's the list. String URL = tocrawllist.iterator (). Next (); Remove URL from the To crawl list. Tocrawllist.remove (URL); Convert string URL to URL object. URL Verifiedurl = verifyurl (URL); Skip URL If robots are not allowed to access it. if (!isrobotallowed (Verifiedurl)) {continue;}//Add processed URLs to crawledlist crawledlist.add (URL); String pagecontents = Downloadpage (Verifiedurl); if (pagecontents!= null && pagecontents.length () > 0) {//Get a valid link from the page arraylist< string> links =retrieve Links (Verifiedurl, pagecontents, crawledlist,limithost); Tocrawllist.addall (links); if (Searchstringmatches (pagecontents, searchstring,casesensitive)) {result.add (URL); System.out.println (URL); }} return result; }//main function public static voidMain (string[] args) {if (args.length!=3) {System.out.println ("Usage:java searchcrawler starturl maxurl searchstring"); Return int Max=integer.parseint (args[1]); Searchcrawler crawler = new Searchcrawler (args[0],max,args[2]); Thread search=new thread (crawler); System.out.println ("Start searching ..."); SYSTEM.OUT.PRINTLN ("Result:"); Search.start (); } }

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More