Try Java to develop search engine crawler

Source: Internet
Author: User

We should also know that Baidu search results will have a Baidu snapshot, which is through the cache server calls out of the page information, so that we can quickly browse through the Baidu snapshot of the Web information, then this cache server and crawler what is the connection?

Let's take a general look at the basic principles of reptiles (personal understanding, error correction). First, the search engine will not produce content, its information is through the crawler to retrieve information. The crawler obtains the source code through the domain name URL, storing the page content on the cache server and indexing. The downloaded page URL is placed in the URL queue and recorded to avoid repeated crawls. Then check the URL in this queue, and if you find that it has not been crawled, put the URL into the queue to be crawled, and download the corresponding Web page for the URL in the next schedule.

First we need a jar package:Jsoup-1.7.2.jar this one


package com.html;import java.io.bufferedreader;import java.io.file;import  java.io.fileoutputstream;import java.io.ioexception;import java.io.inputstream;import  java.io.inputstreamreader;import java.net.httpurlconnection;import java.net.malformedurlexception; import java.net.url;import java.net.urlconnection;import org.jsoup.jsoup;import  org.jsoup.nodes.document;import org.jsoup.nodes.element;import org.jsoup.select.elements;/** *   *  exploiting Java's jsoup to develop search engine crawler  * HtmlJsoup<BR> *  Creator: Youshangdetudoudou  <BR> *  Time: September 9, 2014-Morning 10:55:27 <br> *  @version  1.0.0 *  */public class HtmlJsoup {/** *  *  get the source code of the Web page based on the URL and page encoding set <BR>  *  Method Name:gethtmlresourcebyurl<br> *  founder: youshangdetudoudou <br> *   Time: September 9, 2014-Morning 11:01:22 <br> *  @param  url  need to download the URL address  *  @param  encoding  need a page encoding set  *  @return  String  Back to Web source <BR> *  @exception  <BR> *  @since   1.0.0 */public  static string gethtmlresourcebyurl (string url,string encoding) {// Declares a container that stores Web page source code stringbuffer buffer = new stringbuffer (); url urlobj = null; urlconnection uc = null;inputstreamreader in = null; The bufferedreader reader = null;//parameter is the URL. To try catchtry {//establish a network connection urlobj = new url (URL);//Open Network connection uc =  Urlobj.openconnection ();//establish the network input stream In = new inputstreamreader (Uc.getinputstream (), encoding);// Buffered write file stream Reader = new bufferedreader (in);//Temp Variable string templine = null;// Loop read file stream while ((Templine = reader.readline ())!=null) {buffer.append (templine+ "\ n");//cyclic append Data}}  catch  (exception e) &NBSP;{//&NBsp Todo auto-generated catch blocke.printstacktrace (); System.err.println ("Connection timeout ..."); finally{if  (In!=null)  {try {in.close ();}  catch  (ioexception e)  {// TODO Auto-generated catch  Blocke.printstacktrace ();}}} Return buffer.tostring ();} /** *  *  Bulk Download image to server disk <BR> *  method name via image address:downimages<br> *  Founder:youshangdetudoudou <br> *  Time: September 9, 2014-2:15:51 <br> *  @param  imgURL *  @param  filePath void<BR> *  @exception  <BR>  *  @since   1.0.0 */public static void downimages (String imgURL,String  filepath) {string filename = imgurl.substring (Imgurl.lastindexof ("/"));//directory where files are created try { File files = new file (FilePath);//Determine if there is a folder if (!files.exists ()) {Files.mkdir ();} Get the url url =& of a picture fileNbsp;new url (Imgurl);//connection Network picture address httpurlconnection uc =  (httpurlconnection) Url.openconnection ();//Gets the output stream of the connection inputstream is = uc.getinputstream ();//Create File file file =  new file (filepath+filename);//create output stream, write to file fileoutputstream out = new  FileOutputStream (file); Int i = 0;while ((I=is.read ())!=-1) {out.write (i);} Is.close (); Out.close ();}  catch  (exception e)  {// TODO Auto-generated catch  Blocke.printstacktrace ();}} Java entry function Public static void main (String[] args) {System.out.println ("haha");//Based on URL and page encoding set   Get the source code of the webpage String htmlresource = gethtmlresourcebyurl ("http://www.4399.com/", "GBK");// System.out.println (Htmlresource);//parsing source code  document document = jsoup.parse (Htmlresource);  //gets the picture of the webpage  elements elements = document.getelementsbytag ("img");  for (Element  element : elements) { sTring imgsrc = element.attr ("src"); string imgpath =imgsrc;  System.out.println ("Picture address:" +imgsrc);d ownimages (Imgpath, "f:\\xfmovie\\images"); System.out.println ("Download Complete!!!!!!!!!  }//Parse what we need to download the content section}}

The above is the source code to get http://www.4399.com/Web page

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/49/16/wKiom1QOr-yBeFt_AAPmf6Dd_X0960.jpg "title=" Qq20140909154409.jpg "alt=" Wkiom1qor-ybeft_aapmf6dd_x0960.jpg "/>

The above is a part of parsing the source code of the Web page ...

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/49/18/wKioL1QOsB_DOPUpAAT1PgdmW0s176.jpg "title=" Qq20140909154509.jpg "alt=" Wkiol1qosb_dopupaat1pgdmw0s176.jpg "/>

The above is the webpage download down the picture ... Crawl success:


This is a relatively simple crawl: There is time up the master will continue to improve and continue to learn. Thank you..

This article from "Sad Potato cake" blog, declined reprint!

Try Java to develop search engine crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.