Java Web Crawler

Source: Internet
Author: User
Java Web Crawler implementation

I remember that when I was just looking for a job, a classmate next door once made a web crawler during the interview. At that time, my admiration was like a flood of rivers and rivers. Later, when searching for images, a large number of test images were required. As a result, the idea of crawling book cover images from Amazon also learned some previous experiences from the Internet, A simple but usable crawler system is implemented.

AD:

I remember that when I was just looking for a job, a classmate next door once made a web crawler during the interview. At that time, my admiration was like a flood of rivers and rivers. Later, when searching for images, a large number of test images were required. As a result, the idea of crawling book cover images from Amazon also learned some previous experiences from the Internet, A simple but usable crawler system is implemented.

Web Crawler is a program for automatically extracting Web pages. It Downloads Web pages from the World Wide Web for search engines and is an important component of search engines. Its basic architecture is shown in:

A traditional crawler obtains the URLs on an initial webpage from the URLs of one or more initial webpages, and continuously extracts new URLs from the current webpage and puts them in the queue, until the system is stopped. For vertical search, focusing on crawlers is more suitable for crawlers that specifically crawl webpages with specific themes.

The core code of this crawler is as follows:

Java code

 
 
  1. Public void crawl () throws throwable {
  2. While (continuecrawling ()){
  3. Crawlerurl url = getnexturl (); // gets the next URL in the queue to be crawled.
  4. If (URL! = NULL ){
  5. Printcrawlinfo ();
  6. String content = getcontent (URL); // obtain the URL text
  7. // Focus crawlers only crawl webpages related to the subject content. Here, regular matching is used for simple processing.
  8. If (iscontentrelevant (content, this. regex1_archpattern )){
  9. Savecontent (URL, content); // Save the webpage to your local device
  10. // Obtain the link in the webpage content and put it in the queue to be crawled
  11. Collection urlstrings = extracturls (content, URL );
  12. Addurlstourlqueue (URL, urlstrings );
  13. } Else {
  14. System. Out. println (URL + "is not relevant ignoring ...");
  15. }
  16. // Delay prevention from being blocked by the other party
  17. Thread. Sleep (this. delaybetweenurls );
  18. }
  19. }
  20. Closeoutputstream ();
  21. }

The entire function consists of several core methods, including getnexturl, getcontent, iscontentrelevant, extracturls, and addurlstourlqueue. First look at getnexturl:

Java code

 
 
  1. Private crawler URL getnexturl () throws throwable {
  2. Crawlerurl nexturl = NULL;
  3. While (nexturl = NULL )&&(! Urlqueue. isempty ())){
  4. Crawlerurl = This. urlqueue. Remove ();
  5. // Dowehavepermissiontovisit: whether or not you have the permission to access the URL. Friendly crawlers will crawl the URL according to the rules configured in "robot.txt" provided by the website.
  6. // Isurlalreadyvisited: whether the URL has been accessed. Large search engines usually use bloomfilter to sort the weights. hashmap is used here.
  7. // Isdepthacceptable: whether to reach the specified depth limit. Crawlers generally take the breadth-first approach. Some websites will build crawler traps (some invalid links are automatically generated to bring crawlers into an endless loop) and use deep restrictions to avoid them
  8. If (dowehavepermissiontovisit (crawler URL)
  9. &&(! Isurlalreadyvisited (crawler URL ))
  10. & Isdepthacceptable (crawler URL )){
  11. Nexturl = crawler URL;
  12. // System. Out. println ("next URL to be visited is" + nexturl );
  13. }
  14. }
  15. Return nexturl;
  16. }

For more details about robot.txt, refer to the following article:

Http://www.bloghuman.com/post/67/

Getcontent internally uses Apache's httpclient 4.1 to obtain webpage content. The specific code is as follows:

Java code

 
 
  1. Private string getcontent (crawler URL) throws throwable {
  2. // The call of httpclient4.1 is different from the previous method.
  3. Httpclient client = new defaulthttpclient ();
  4. Httpget = new httpget (URL. geturlstring ());
  5. Stringbuffer strbuf = new stringbuffer ();
  6. Httpresponse response = client.exe cute (httpget );
  7. If (httpstatus. SC _ OK = response. getstatusline (). getstatuscode ()){
  8. Httpentity entity = response. getentity ();
  9. If (entity! = NULL ){
  10. Bufferedreader reader = new bufferedreader (
  11. New inputstreamreader (entity. getcontent (), "UTF-8 "));
  12. String line = NULL;
  13. If (entity. getcontentlength ()> 0 ){
  14. Strbuf = new stringbuffer (INT) entity. getcontentlength ());
  15. While (line = reader. Readline ())! = NULL ){
  16. Strbuf. append (line );
  17. }
  18. }
  19. }
  20. If (entity! = NULL ){
  21. Entity. consumecontent ();
  22. }
  23. }
  24. // Mark the URL as accessed
  25. Markurlasvisited (URL );
  26. Return strbuf. tostring ();
  27. }

For vertical applications, data accuracy is often more important. The main feature of a focused crawler is to collect only theme-related data, which is the role of the iscontentrelevant method. The classification prediction technology may be used here, instead of regular matching for simplicity. The main code is as follows:

Java code

 
 
  1. Public static Boolean iscontentrelevant (string content,
  2. Pattern regexppattern ){
  3. Boolean retvalue = false;
  4. If (content! = NULL ){
  5. // Whether the regular expression conditions are met
  6. Matcher M = regexppattern. matcher (content. tolowercase ());
  7. Retvalue = M. Find ();
  8. }
  9. Return retvalue;
  10. }

Extracturls is mainly used to obtain more URLs from webpages, including internal links and external links. The Code is as follows:

Java code

 
 
  1. Public list extracturls (string text, crawler URL ){
  2. Map urlmap = new hashmap ();
  3. Extracthttpurls (urlmap, text );
  4. Extractrelativeurls (urlmap, text, crawler URL );
  5. Return new arraylist (urlmap. keyset ());
  6. }
  7. // Process external links
  8. Private void extracthttpurls (MAP urlmap, string text ){
  9. Matcher M = httpregexp. matcher (text );
  10. While (M. Find ()){
  11. String url = M. Group ();
  12. String [] terms = URL. Split ("a href = \"");
  13. For (string term: Terms ){
  14. // System. Out. println ("term =" + term );
  15. If (term. startswith ("HTTP ")){
  16. Int Index = term. indexof ("\"");
  17. If (index> 0 ){
  18. Term = term. substring (0, index );
  19. }
  20. Urlmap. Put (term, term );
  21. System. Out. println ("hyperlink:" + term );
  22. }
  23. }
  24. }
  25. }
  26. // Process internal links
  27. Private void extractrelativeurls (MAP urlmap, string text,
  28. Crawlerurl ){
  29. Matcher M = relativeregexp. matcher (text );
  30. URL texturl = crawler URL. geturl ();
  31. String host = texturl. gethost ();
  32. While (M. Find ()){
  33. String url = M. Group ();
  34. String [] terms = URL. Split ("a href = \"");
  35. For (string term: Terms ){
  36. If (term. startswith ("/")){
  37. Int Index = term. indexof ("\"");
  38. If (index> 0 ){
  39. Term = term. substring (0, index );
  40. }
  41. String S = "http: //" + host + term;
  42. Urlmap. Put (S, S );
  43. System. Out. println ("relative URL:" + S );
  44. }
  45. }
  46. }
  47. }

In this way, a simple web crawler program is built and can be tested using the following program:

Java code

 
 
  1. public static void main(String[] args) {     
  2.     try {     
  3.         String url = "http://www.amazon.com";     
  4.         Queue urlQueue = new LinkedList();     
  5.         String regexp = "java";     
  6.         urlQueue.add(new CrawlerUrl(url, 0));     
  7.         NaiveCrawler crawler = new NaiveCrawler(urlQueue, 100, 5, 1000L,     
  8.                 regexp);     
  9.         // boolean allowCrawl = crawler.areWeAllowedToVisit(url);     
  10.         // System.out.println("Allowed to crawl: " + url + " " +     
  11.         // allowCrawl);     
  12.         crawler.crawl();     
  13.     } catch (Throwable t) {     
  14.         System.out.println(t.toString());     
  15.         t.printStackTrace();     
  16.     }     
  17. }    

Of course, you can give it more advanced functions, such as multithreading, more intelligent focus, and indexing with Lucene. In more complex cases, you can consider using some open-source spider programs, such as nutch or heritrix, which will not be discussed in this article.

Reprinted: http://developer.51cto.com/art/201103/248141.htm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.