(interrupt) Web crawler, grab what you want.

Last Update:2018-07-26 Source: Internet

Author: User

Tags gettext stub

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Works in the following

Recently, a friend said that he wanted to get some critical information on some pages. For example, telephone, address, and so on. A page to find and very troublesome. At this time, think of why not use the "crawler" to grab something you want. Save the trouble, province. Well, today we're going to tell you something about reptiles.

Here oneself is also, saw some about the reptile knowledge, just, these few days leisure to be OK. made a small crawler.

Here is the use of Java to write the first we have to introduce. Using the frame, Jdk1.6,htmlparser.jar (Java Classic access to HTML page classes), Httpclient-3.01.jar,logging-1.1.jar,codec-1.4.jar.

Well, these are the basic ideas that we're going to crawl. Simple, that is, width first traversal, you may be more familiar with the depth first traversal. Width first traversal that. In fact, it can be very simple to understand. To understand your computer as a site, the Internet is a network to walk. And we have to do is from your site, first visit from your nearest, and then around the circle, and then visit the next level, such a circle to go down, much like at that time in school to learn the concept of drawing. Then we can put Baidu, Sina, ... such as such a site as a site, from his own start digging.

Said thought we started our first step, first we have to put, we have to think to achieve him. The first step is to define the classes we need to use. First of all, we have to remember that there is a visit, there is no need to visit. Then we have a queue of gratitude, advanced first out. We have to have a queue to manage the class.

Import java.util.LinkedList;


public class Queue {
	private LinkedList queue = new LinkedList ();
	public void EnQueue (Object t) {
		queue.addlast (t);
	}
	
	Public Object dequeue () {return
		queue.removefirst ();
	}
	
	public Boolean isqueueempty () {return
		queue.isempty ();
	}
	
	public boolean Contians (Object t) {return
		queue.contains (t);
	}
	
}

Import Java.util.HashSet;
Import Java.util.Set;


public class Linkqueue {
	private static Set Visitedurl = new HashSet ();
	private static Queue Unvisitedurl = new Queue ();
	
	public static \ Getunvisitedurl () {return
		unvisitedurl;
	}
	
	public static void Addvisitedurl (String url) {
		visitedurl.add (URL);
	}
	
	public static void Removevisitedurl (String url) {
		visitedurl.remove (URL);
	}
	
	public static Object Unvisitedurldequeue () {return
		unvisitedurl.dequeue ();
	}
	
	public static void Addunvisitedurl (String url) {
		if (URL!= null &&!url.trim (). Equals ("") &&! Visitedurl.contains (URL) &&!unvisitedurl.contians (URL)) {
			unvisitedurl.enqueue (URL);
		}
	}
	
	public static int Getvisitedurlnum () {return
		visitedurl.size ();
	}
	
	public static Boolean Unvisitedurlsempty () {return
		
		unvisitedurl.isqueueempty ();
	}
	
	
}

After writing a queue to manage, we have to write the most critical class that downloads the page class and gets the page information and processes a class.

I am here to get the title and Web site to get some information to write a file, you can according to their own requirements, to modify the access to information requirements and processing methods.

Import Java.io.ByteArrayInputStream;
Import Java.io.DataOutputStream;
Import Java.io.File;
Import java.io.FileNotFoundException;
Import Java.io.FileOutputStream;
Import Java.io.FileWriter;
Import java.io.IOException;
Import Java.io.InputStream;
Import Java.io.OutputStream;

Import Java.net.URLDecoder;
Import Org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
Import org.apache.commons.httpclient.HttpClient;
Import org.apache.commons.httpclient.HttpException;
Import Org.apache.commons.httpclient.HttpStatus;
Import Org.apache.commons.httpclient.methods.GetMethod;


Import Org.apache.commons.httpclient.params.HttpMethodParams; public class DownLoadFile {public string Getfilenamebyurl (string url,string contentType) {//url = url.substring (arg
			0 if (contenttype.indexof ("HTML")!=-1) {URL = Url.replaceall ("[\\?/:* |<>\", "");
		return URL; }else {return Url.replaceall ("[\\?/:* |<>\"]http "," ") +". " +contenttype.substring (Contenttype.lastindexof ("/") +1);
		} private void Savetolocal (String data) {//DataOutputStream outputstream; try {//OutputStream = new DataOutputStream (New FileOutputStream ("/users/vika/desktop/url.txt"));//Output
STREAM.WRITEUTF (data);
Outputstream.flush ();
			
			Outputstream.close ();
			FileWriter FW = new FileWriter ("/users/vika/desktop/url.txt", true);
			Fw.write (data);
Fw.close ();
OutputStream output = null;
File File = new file ("/users/vika/desktop/url.txt");
Output = new FileOutputStream (file);
int tempbyte =-1; while ((Tempbyte=data.read ()) >0) {//Output.write (tempbyte);//}//} catch (IOException
		e) {//TODO auto-generated catch block E.printstacktrace ();
		} public string DownloadFile (string url) {string filePath = null;
		HttpClient httpclient = new HttpClient ();
		Httpclient.gethttpconnectionmanager (). Getparams (). Setconnectiontimeout (5000);
		GetMethod GetMethod = new GetMethod (URL); GetmeThod.getparams (). Setparameter (Httpmethodparams.so_timeout, 5000);
		Getmethod.getparams (). Setparameter (Httpmethodparams.retry_handler,new Defaulthttpmethodretryhandler ());
			try {int code = Httpclient.executemethod (GetMethod);
				If (code!= HTTPSTATUS.SC_OK) {System.err.println ("Method failed:" +getmethod.getstatusline ());
			FilePath = null;  InputStream responsebody = Getmethod.getresponsebodyasstream ()///read as Byte array///to generate saved file name based on Web page URL String content
			
			= Getmethod.getresponsebodyasstring (); if (Content.contains ("title")) {content = content.substring (Content.indexof ("<title>"), Content.indexof ("<
				/title> "));
			Urldecoder.decode (Content, "UTF-8"); } FilePath = Getfilenamebyurl (URL, getmethod.getresponseheader ("Content-type"). G
			
			
			Etvalue ());
			Savetolocal (New Bytearrayinputstream (filepath+content). GetBytes ());
			Savetolocal (filepath+content+ "\ n");
			System.out.println (FilePath); SysTEM.OUT.PRINTLN (content); catch (HttpException E) {//TODO auto-generated catch block System.out.println (' Please check your provided HTTP a
			Ddress! ");
		E.printstacktrace ();
		catch (IOException e) {//TODO auto-generated catch block E.printstacktrace ();
		}finally{getmethod.releaseconnection ();
	return filePath;
 }
}

Import Java.util.HashSet;


Import Java.util.Set;
Import Org.htmlparser.Node;
Import Org.htmlparser.NodeFilter;
Import Org.htmlparser.Parser;
Import Org.htmlparser.filters.NodeClassFilter;
Import Org.htmlparser.filters.OrFilter;
Import Org.htmlparser.tags.LinkTag;
Import org.htmlparser.util.NodeList;


Import org.htmlparser.util.ParserException; public class Htmlparsertool {public static set<string> extraclinks (String url,linkfilter filter) {set<string& Gt
		Links = new hashset<string> ();
			try {Parser Parser = new Parser (URL);
			Parser.setencoding ("UTF-8"); 
					Final Nodefilter framefilter = new Nodefilter () {@Override public boolean accept (Org.htmlparser.Node Node) {
TODO auto-generated Method stub return true;					if (Node.gettext (). StartsWith ("Frame src=")) {//return true;//} else {//return false;//
			
			}
				}
			}; Orfilter linkfilter = new Orfilter (new Nodeclassfilter (Linktag.class), Framefilter);
			
			NodeList list = (nodelist) parser.extractallnodesthatmatch (linkfilter);
				for (int i = 0; i < list.size (); i++) {Node tag = (node) list.elementat (i);
					if (tag instanceof linktag) {Linktag link = (linktag) tag;
					String Linkurl = Link.getlink ();
					if (Filter.accept (Linkurl)) {Links.add (Linkurl); }else {String frame = ((linktag) tag). GetText ();//int start = Frame.indexof ("href=");//frame = frame.
SUBSTRING (start);
int end = Frame.indexof ("");
if (end = = 1)//end = Frame.indexof (">");
						String Frameurl = frame.substring (5,end-1);
						if (Filter.accept (frame)) {Links.add (frame);
		catch (Parserexception e) {//TODO auto-generated catch block E.printstacktrace ();
	return links;
 }
}

An interface is also used here.

Public interface Linkfilter {public
	 boolean accept (String URL);
}

Well done with these we're going to write our winners class. Here I just define it casually.

Using Test as the main class.

Here set to get timeout settings I set it to be 5s, and the keyword, I set is Youku, because I grabbed the www.youku.com above information, as long as the inclusion of Rul contains Youku keyword. I'll get it all. Here can be based on their own preferences to crawl.

public static void Main (string[] args) {
		//TODO auto-generated method stub
		test = new test ();
		Test.crawling (New string[]{"http://www.youku.com/"});

 private void Initcrawlerwithseeds (string[] seeds) {for (int i=0;i<seeds.length;i++)
    Linkqueue.addunvisitedurl (Seeds[i]); public void crawling (string[] seeds) {linkfilter filter = new Linkfilter () {@Override public Boole
				An accept (String URL) {//TODO auto-generated Method stub if (Url.contains ("Youku")) return true;
			else return false;
		
		}
		};
		
		Initcrawlerwithseeds (seeds); while (! Linkqueue.unvisitedurlsempty () &&linkqueue.getvisitedurlnum () <=100000) {String visitUrl = (String) LinkQu
			Eue.unvisitedurldequeue ();
			if (Visiturl = = null) continue;
			DownLoadFile DownLoadFile = new DownLoadFile ();
			Downloadfile.downloadfile (Visiturl);
			Linkqueue.addvisitedurl (Visiturl);
			
			Set<string>links=htmlparsertool.extraclinks (Visiturl,filter);
			for (String link:links) {linkqueue.addunvisitedurl (link); }
		}
    	
    }

Engineering documents
Click to open the link

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More