(interrupt) Web crawler, grab what you want.

Source: Internet
Author: User
Tags gettext stub

Works in the following

Recently, a friend said that he wanted to get some critical information on some pages. For example, telephone, address, and so on. A page to find and very troublesome. At this time, think of why not use the "crawler" to grab something you want. Save the trouble, province. Well, today we're going to tell you something about reptiles.


Here oneself is also, saw some about the reptile knowledge, just, these few days leisure to be OK. made a small crawler.


Here is the use of Java to write the first we have to introduce. Using the frame, Jdk1.6,htmlparser.jar (Java Classic access to HTML page classes), Httpclient-3.01.jar,logging-1.1.jar,codec-1.4.jar.


Well, these are the basic ideas that we're going to crawl. Simple, that is, width first traversal, you may be more familiar with the depth first traversal. Width first traversal that. In fact, it can be very simple to understand. To understand your computer as a site, the Internet is a network to walk. And we have to do is from your site, first visit from your nearest, and then around the circle, and then visit the next level, such a circle to go down, much like at that time in school to learn the concept of drawing. Then we can put Baidu, Sina, ... such as such a site as a site, from his own start digging.


Said thought we started our first step, first we have to put, we have to think to achieve him. The first step is to define the classes we need to use. First of all, we have to remember that there is a visit, there is no need to visit. Then we have a queue of gratitude, advanced first out. We have to have a queue to manage the class.

Import java.util.LinkedList;


public class Queue {
	private LinkedList queue = new LinkedList ();
	public void EnQueue (Object t) {
		queue.addlast (t);
	}
	
	Public Object dequeue () {return
		queue.removefirst ();
	}
	
	public Boolean isqueueempty () {return
		queue.isempty ();
	}
	
	public boolean Contians (Object t) {return
		queue.contains (t);
	}
	
}

Import Java.util.HashSet;
Import Java.util.Set;


public class Linkqueue {
	private static Set Visitedurl = new HashSet ();
	private static Queue Unvisitedurl = new Queue ();
	
	public static \ Getunvisitedurl () {return
		unvisitedurl;
	}
	
	public static void Addvisitedurl (String url) {
		visitedurl.add (URL);
	}
	
	public static void Removevisitedurl (String url) {
		visitedurl.remove (URL);
	}
	
	public static Object Unvisitedurldequeue () {return
		unvisitedurl.dequeue ();
	}
	
	public static void Addunvisitedurl (String url) {
		if (URL!= null &&!url.trim (). Equals ("") &&! Visitedurl.contains (URL) &&!unvisitedurl.contians (URL)) {
			unvisitedurl.enqueue (URL);
		}
	}
	
	public static int Getvisitedurlnum () {return
		visitedurl.size ();
	}
	
	public static Boolean Unvisitedurlsempty () {return
		
		unvisitedurl.isqueueempty ();
	}
	
	
}

After writing a queue to manage, we have to write the most critical class that downloads the page class and gets the page information and processes a class.

I am here to get the title and Web site to get some information to write a file, you can according to their own requirements, to modify the access to information requirements and processing methods.



Import Java.io.ByteArrayInputStream;
Import Java.io.DataOutputStream;
Import Java.io.File;
Import java.io.FileNotFoundException;
Import Java.io.FileOutputStream;
Import Java.io.FileWriter;
Import java.io.IOException;
Import Java.io.InputStream;
Import Java.io.OutputStream;

Import Java.net.URLDecoder;
Import Org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
Import org.apache.commons.httpclient.HttpClient;
Import org.apache.commons.httpclient.HttpException;
Import Org.apache.commons.httpclient.HttpStatus;
Import Org.apache.commons.httpclient.methods.GetMethod;


Import Org.apache.commons.httpclient.params.HttpMethodParams; public class DownLoadFile {public string Getfilenamebyurl (string url,string contentType) {//url = url.substring (arg
			0 if (contenttype.indexof ("HTML")!=-1) {URL = Url.replaceall ("[\\?/:* |<>\", "");
		return URL; }else {return Url.replaceall ("[\\?/:* |<>\"]http "," ") +". " +contenttype.substring (Contenttype.lastindexof ("/") +1);
		} private void Savetolocal (String data) {//DataOutputStream outputstream; try {//OutputStream = new DataOutputStream (New FileOutputStream ("/users/vika/desktop/url.txt"));//Output
STREAM.WRITEUTF (data);
Outputstream.flush ();
			
			Outputstream.close ();
			FileWriter FW = new FileWriter ("/users/vika/desktop/url.txt", true);
			Fw.write (data);
Fw.close ();
OutputStream output = null;
File File = new file ("/users/vika/desktop/url.txt");
Output = new FileOutputStream (file);
int tempbyte =-1; while ((Tempbyte=data.read ()) >0) {//Output.write (tempbyte);//}//} catch (IOException
		e) {//TODO auto-generated catch block E.printstacktrace ();
		} public string DownloadFile (string url) {string filePath = null;
		HttpClient httpclient = new HttpClient ();
		Httpclient.gethttpconnectionmanager (). Getparams (). Setconnectiontimeout (5000);
		GetMethod GetMethod = new GetMethod (URL); GetmeThod.getparams (). Setparameter (Httpmethodparams.so_timeout, 5000);
		Getmethod.getparams (). Setparameter (Httpmethodparams.retry_handler,new Defaulthttpmethodretryhandler ());
			try {int code = Httpclient.executemethod (GetMethod);
				If (code!= HTTPSTATUS.SC_OK) {System.err.println ("Method failed:" +getmethod.getstatusline ());
			FilePath = null;  InputStream responsebody = Getmethod.getresponsebodyasstream ()///read as Byte array///to generate saved file name based on Web page URL String content
			
			= Getmethod.getresponsebodyasstring (); if (Content.contains ("title")) {content = content.substring (Content.indexof ("<title>"), Content.indexof ("<
				/title> "));
			Urldecoder.decode (Content, "UTF-8"); } FilePath = Getfilenamebyurl (URL, getmethod.getresponseheader ("Content-type"). G
			
			
			Etvalue ());
			Savetolocal (New Bytearrayinputstream (filepath+content). GetBytes ());
			Savetolocal (filepath+content+ "\ n");
			System.out.println (FilePath); SysTEM.OUT.PRINTLN (content); catch (HttpException E) {//TODO auto-generated catch block System.out.println (' Please check your provided HTTP a
			Ddress! ");
		E.printstacktrace ();
		catch (IOException e) {//TODO auto-generated catch block E.printstacktrace ();
		}finally{getmethod.releaseconnection ();
	return filePath;
 }
}



Import Java.util.HashSet;


Import Java.util.Set;
Import Org.htmlparser.Node;
Import Org.htmlparser.NodeFilter;
Import Org.htmlparser.Parser;
Import Org.htmlparser.filters.NodeClassFilter;
Import Org.htmlparser.filters.OrFilter;
Import Org.htmlparser.tags.LinkTag;
Import org.htmlparser.util.NodeList;


Import org.htmlparser.util.ParserException; public class Htmlparsertool {public static set<string> extraclinks (String url,linkfilter filter) {set<string& Gt
		Links = new hashset<string> ();
			try {Parser Parser = new Parser (URL);
			Parser.setencoding ("UTF-8"); 
					Final Nodefilter framefilter = new Nodefilter () {@Override public boolean accept (Org.htmlparser.Node Node) {
TODO auto-generated Method stub return true;					if (Node.gettext (). StartsWith ("Frame src=")) {//return true;//} else {//return false;//
			
			}
				}
			}; Orfilter linkfilter = new Orfilter (new Nodeclassfilter (Linktag.class), Framefilter);
			
			NodeList list = (nodelist) parser.extractallnodesthatmatch (linkfilter);
				for (int i = 0; i < list.size (); i++) {Node tag = (node) list.elementat (i);
					if (tag instanceof linktag) {Linktag link = (linktag) tag;
					String Linkurl = Link.getlink ();
					if (Filter.accept (Linkurl)) {Links.add (Linkurl); }else {String frame = ((linktag) tag). GetText ();//int start = Frame.indexof ("href=");//frame = frame.
SUBSTRING (start);
int end = Frame.indexof ("");
if (end = = 1)//end = Frame.indexof (">");
						String Frameurl = frame.substring (5,end-1);
						if (Filter.accept (frame)) {Links.add (frame);
		catch (Parserexception e) {//TODO auto-generated catch block E.printstacktrace ();
	return links;
 }
}


An interface is also used here.

Public interface Linkfilter {public
	 boolean accept (String URL);
}


Well done with these we're going to write our winners class. Here I just define it casually.

Using Test as the main class.

Here set to get timeout settings I set it to be 5s, and the keyword, I set is Youku, because I grabbed the www.youku.com above information, as long as the inclusion of Rul contains Youku keyword. I'll get it all. Here can be based on their own preferences to crawl.

public static void Main (string[] args) {
		//TODO auto-generated method stub
		test = new test ();
		Test.crawling (New string[]{"http://www.youku.com/"});
	
	


 private void Initcrawlerwithseeds (string[] seeds) {for (int i=0;i<seeds.length;i++)
    Linkqueue.addunvisitedurl (Seeds[i]); public void crawling (string[] seeds) {linkfilter filter = new Linkfilter () {@Override public Boole
				An accept (String URL) {//TODO auto-generated Method stub if (Url.contains ("Youku")) return true;
			else return false;
		
		}
		};
		
		Initcrawlerwithseeds (seeds); while (! Linkqueue.unvisitedurlsempty () &&linkqueue.getvisitedurlnum () <=100000) {String visitUrl = (String) LinkQu
			Eue.unvisitedurldequeue ();
			if (Visiturl = = null) continue;
			DownLoadFile DownLoadFile = new DownLoadFile ();
			Downloadfile.downloadfile (Visiturl);
			Linkqueue.addvisitedurl (Visiturl);
			
			Set<string>links=htmlparsertool.extraclinks (Visiturl,filter);
			for (String link:links) {linkqueue.addunvisitedurl (link); }
		}
    	
    }
Engineering documents
Click to open the link





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.