We should also know that Baidu search results will have a Baidu snapshot, which is through the cache server calls out of the page information, so that we can quickly browse through the Baidu snapshot of the Web information, then this cache server and crawler what is the connection?
Let's take a general look at the basic principles of reptiles (personal understanding, error correction). First, the search engine will not produce content, its information is through the crawler to retrieve information. The crawler obtains the source code through the domain name URL, storing the page content on the cache server and indexing. The downloaded page URL is placed in the URL queue and recorded to avoid repeated crawls. Then check the URL in this queue, and if you find that it has not been crawled, put the URL into the queue to be crawled, and download the corresponding Web page for the URL in the next schedule.
First we need a jar package:Jsoup-1.7.2.jar this one
package com.html;import java.io.bufferedreader;import java.io.file;import java.io.fileoutputstream;import java.io.ioexception;import java.io.inputstream;import java.io.inputstreamreader;import java.net.httpurlconnection;import java.net.malformedurlexception; import java.net.url;import java.net.urlconnection;import org.jsoup.jsoup;import org.jsoup.nodes.document;import org.jsoup.nodes.element;import org.jsoup.select.elements;/** * * exploiting Java's jsoup to develop search engine crawler * HtmlJsoup<BR> * Creator: Youshangdetudoudou <BR> * Time: September 9, 2014-Morning 10:55:27 <br> * @version 1.0.0 * */public class HtmlJsoup {/** * * get the source code of the Web page based on the URL and page encoding set <BR> * Method Name:gethtmlresourcebyurl<br> * founder: youshangdetudoudou <br> * Time: September 9, 2014-Morning 11:01:22 <br> * @param url need to download the URL address * @param encoding need a page encoding set * @return String Back to Web source <BR> * @exception <BR> * @since 1.0.0 */public static string gethtmlresourcebyurl (string url,string encoding) {// Declares a container that stores Web page source code stringbuffer buffer = new stringbuffer (); url urlobj = null; urlconnection uc = null;inputstreamreader in = null; The bufferedreader reader = null;//parameter is the URL. To try catchtry {//establish a network connection urlobj = new url (URL);//Open Network connection uc = Urlobj.openconnection ();//establish the network input stream In = new inputstreamreader (Uc.getinputstream (), encoding);// Buffered write file stream Reader = new bufferedreader (in);//Temp Variable string templine = null;// Loop read file stream while ((Templine = reader.readline ())!=null) {buffer.append (templine+ "\ n");//cyclic append Data}} catch (exception e) &NBSP;{//&NBsp Todo auto-generated catch blocke.printstacktrace (); System.err.println ("Connection timeout ..."); finally{if (In!=null) {try {in.close ();} catch (ioexception e) {// TODO Auto-generated catch Blocke.printstacktrace ();}}} Return buffer.tostring ();} /** * * Bulk Download image to server disk <BR> * method name via image address:downimages<br> * Founder:youshangdetudoudou <br> * Time: September 9, 2014-2:15:51 <br> * @param imgURL * @param filePath void<BR> * @exception <BR> * @since 1.0.0 */public static void downimages (String imgURL,String filepath) {string filename = imgurl.substring (Imgurl.lastindexof ("/"));//directory where files are created try { File files = new file (FilePath);//Determine if there is a folder if (!files.exists ()) {Files.mkdir ();} Get the url url =& of a picture fileNbsp;new url (Imgurl);//connection Network picture address httpurlconnection uc = (httpurlconnection) Url.openconnection ();//Gets the output stream of the connection inputstream is = uc.getinputstream ();//Create File file file = new file (filepath+filename);//create output stream, write to file fileoutputstream out = new FileOutputStream (file); Int i = 0;while ((I=is.read ())!=-1) {out.write (i);} Is.close (); Out.close ();} catch (exception e) {// TODO Auto-generated catch Blocke.printstacktrace ();}} Java entry function Public static void main (String[] args) {System.out.println ("haha");//Based on URL and page encoding set Get the source code of the webpage String htmlresource = gethtmlresourcebyurl ("http://www.4399.com/", "GBK");// System.out.println (Htmlresource);//parsing source code document document = jsoup.parse (Htmlresource); //gets the picture of the webpage elements elements = document.getelementsbytag ("img"); for (Element element : elements) { sTring imgsrc = element.attr ("src"); string imgpath =imgsrc; System.out.println ("Picture address:" +imgsrc);d ownimages (Imgpath, "f:\\xfmovie\\images"); System.out.println ("Download Complete!!!!!!!!! }//Parse what we need to download the content section}}
The above is the source code to get http://www.4399.com/Web page
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/49/16/wKiom1QOr-yBeFt_AAPmf6Dd_X0960.jpg "title=" Qq20140909154409.jpg "alt=" Wkiom1qor-ybeft_aapmf6dd_x0960.jpg "/>
The above is a part of parsing the source code of the Web page ...
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/49/18/wKioL1QOsB_DOPUpAAT1PgdmW0s176.jpg "title=" Qq20140909154509.jpg "alt=" Wkiol1qosb_dopupaat1pgdmw0s176.jpg "/>
The above is the webpage download down the picture ... Crawl success:
This is a relatively simple crawl: There is time up the master will continue to improve and continue to learn. Thank you..
This article from "Sad Potato cake" blog, declined reprint!
Try Java to develop search engine crawler