Try Java to develop search engine crawler

Last Update:2014-09-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We should also know that Baidu search results will have a Baidu snapshot, which is through the cache server calls out of the page information, so that we can quickly browse through the Baidu snapshot of the Web information, then this cache server and crawler what is the connection?

Let's take a general look at the basic principles of reptiles (personal understanding, error correction). First, the search engine will not produce content, its information is through the crawler to retrieve information. The crawler obtains the source code through the domain name URL, storing the page content on the cache server and indexing. The downloaded page URL is placed in the URL queue and recorded to avoid repeated crawls. Then check the URL in this queue, and if you find that it has not been crawled, put the URL into the queue to be crawled, and download the corresponding Web page for the URL in the next schedule.

First we need a jar package:Jsoup-1.7.2.jar this one

package com.html;import java.io.bufferedreader;import java.io.file;import  java.io.fileoutputstream;import java.io.ioexception;import java.io.inputstream;import  java.io.inputstreamreader;import java.net.httpurlconnection;import java.net.malformedurlexception; import java.net.url;import java.net.urlconnection;import org.jsoup.jsoup;import  org.jsoup.nodes.document;import org.jsoup.nodes.element;import org.jsoup.select.elements;/** *   *  exploiting Java's jsoup to develop search engine crawler  * HtmlJsoup<BR> *  Creator: Youshangdetudoudou  <BR> *  Time: September 9, 2014-Morning 10:55:27 <br> *  @version  1.0.0 *  */public class HtmlJsoup {/** *  *  get the source code of the Web page based on the URL and page encoding set <BR>  *  Method Name:gethtmlresourcebyurl<br> *  founder: youshangdetudoudou <br> *   Time: September 9, 2014-Morning 11:01:22 <br> *  @param  url  need to download the URL address  *  @param  encoding  need a page encoding set  *  @return  String  Back to Web source <BR> *  @exception  <BR> *  @since   1.0.0 */public  static string gethtmlresourcebyurl (string url,string encoding) {// Declares a container that stores Web page source code stringbuffer buffer = new stringbuffer (); url urlobj = null; urlconnection uc = null;inputstreamreader in = null; The bufferedreader reader = null;//parameter is the URL. To try catchtry {//establish a network connection urlobj = new url (URL);//Open Network connection uc =  Urlobj.openconnection ();//establish the network input stream In = new inputstreamreader (Uc.getinputstream (), encoding);// Buffered write file stream Reader = new bufferedreader (in);//Temp Variable string templine = null;// Loop read file stream while ((Templine = reader.readline ())!=null) {buffer.append (templine+ "\ n");//cyclic append Data}}  catch  (exception e) &NBSP;{//&NBsp Todo auto-generated catch blocke.printstacktrace (); System.err.println ("Connection timeout ..."); finally{if  (In!=null)  {try {in.close ();}  catch  (ioexception e)  {// TODO Auto-generated catch  Blocke.printstacktrace ();}}} Return buffer.tostring ();} /** *  *  Bulk Download image to server disk <BR> *  method name via image address:downimages<br> *  Founder:youshangdetudoudou <br> *  Time: September 9, 2014-2:15:51 <br> *  @param  imgURL *  @param  filePath void<BR> *  @exception  <BR>  *  @since   1.0.0 */public static void downimages (String imgURL,String  filepath) {string filename = imgurl.substring (Imgurl.lastindexof ("/"));//directory where files are created try { File files = new file (FilePath);//Determine if there is a folder if (!files.exists ()) {Files.mkdir ();} Get the url url =& of a picture fileNbsp;new url (Imgurl);//connection Network picture address httpurlconnection uc =  (httpurlconnection) Url.openconnection ();//Gets the output stream of the connection inputstream is = uc.getinputstream ();//Create File file file =  new file (filepath+filename);//create output stream, write to file fileoutputstream out = new  FileOutputStream (file); Int i = 0;while ((I=is.read ())!=-1) {out.write (i);} Is.close (); Out.close ();}  catch  (exception e)  {// TODO Auto-generated catch  Blocke.printstacktrace ();}} Java entry function Public static void main (String[] args) {System.out.println ("haha");//Based on URL and page encoding set   Get the source code of the webpage String htmlresource = gethtmlresourcebyurl ("http://www.4399.com/", "GBK");// System.out.println (Htmlresource);//parsing source code  document document = jsoup.parse (Htmlresource);  //gets the picture of the webpage  elements elements = document.getelementsbytag ("img");  for (Element  element : elements) { sTring imgsrc = element.attr ("src"); string imgpath =imgsrc;  System.out.println ("Picture address:" +imgsrc);d ownimages (Imgpath, "f:\\xfmovie\\images"); System.out.println ("Download Complete!!!!!!!!!  }//Parse what we need to download the content section}}

The above is the source code to get http://www.4399.com/Web page

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/49/16/wKiom1QOr-yBeFt_AAPmf6Dd_X0960.jpg "title=" Qq20140909154409.jpg "alt=" Wkiom1qor-ybeft_aapmf6dd_x0960.jpg "/>

The above is a part of parsing the source code of the Web page ...

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/49/18/wKioL1QOsB_DOPUpAAT1PgdmW0s176.jpg "title=" Qq20140909154509.jpg "alt=" Wkiol1qosb_dopupaat1pgdmw0s176.jpg "/>

The above is the webpage download down the picture ... Crawl success:

This is a relatively simple crawl: There is time up the master will continue to improve and continue to learn. Thank you..

This article from "Sad Potato cake" blog, declined reprint!

Try Java to develop search engine crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Try Java to develop search engine crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Try Java to develop search engine crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support