Works in the following
Recently, a friend said that he wanted to get some critical information on some pages. For example, telephone, address, and so on. A page to find and very troublesome. At this time, think of why not use the "crawler" to grab something you want. Save the trouble, province. Well, today we're going to tell you something about reptiles.
Here oneself is also, saw some about the reptile knowledge, just, these few days leisure to be OK. made a small crawler.
Here is the use of Java to write the first we have to introduce. Using the frame, Jdk1.6,htmlparser.jar (Java Classic access to HTML page classes), Httpclient-3.01.jar,logging-1.1.jar,codec-1.4.jar.
Well, these are the basic ideas that we're going to crawl. Simple, that is, width first traversal, you may be more familiar with the depth first traversal. Width first traversal that. In fact, it can be very simple to understand. To understand your computer as a site, the Internet is a network to walk. And we have to do is from your site, first visit from your nearest, and then around the circle, and then visit the next level, such a circle to go down, much like at that time in school to learn the concept of drawing. Then we can put Baidu, Sina, ... such as such a site as a site, from his own start digging.
Said thought we started our first step, first we have to put, we have to think to achieve him. The first step is to define the classes we need to use. First of all, we have to remember that there is a visit, there is no need to visit. Then we have a queue of gratitude, advanced first out. We have to have a queue to manage the class.
Import java.util.LinkedList;
public class Queue {
private LinkedList queue = new LinkedList ();
public void EnQueue (Object t) {
queue.addlast (t);
}
Public Object dequeue () {return
queue.removefirst ();
}
public Boolean isqueueempty () {return
queue.isempty ();
}
public boolean Contians (Object t) {return
queue.contains (t);
}
}
Import Java.util.HashSet;
Import Java.util.Set;
public class Linkqueue {
private static Set Visitedurl = new HashSet ();
private static Queue Unvisitedurl = new Queue ();
public static \ Getunvisitedurl () {return
unvisitedurl;
}
public static void Addvisitedurl (String url) {
visitedurl.add (URL);
}
public static void Removevisitedurl (String url) {
visitedurl.remove (URL);
}
public static Object Unvisitedurldequeue () {return
unvisitedurl.dequeue ();
}
public static void Addunvisitedurl (String url) {
if (URL!= null &&!url.trim (). Equals ("") &&! Visitedurl.contains (URL) &&!unvisitedurl.contians (URL)) {
unvisitedurl.enqueue (URL);
}
}
public static int Getvisitedurlnum () {return
visitedurl.size ();
}
public static Boolean Unvisitedurlsempty () {return
unvisitedurl.isqueueempty ();
}
}
After writing a queue to manage, we have to write the most critical class that downloads the page class and gets the page information and processes a class.
I am here to get the title and Web site to get some information to write a file, you can according to their own requirements, to modify the access to information requirements and processing methods.
Import Java.io.ByteArrayInputStream;
Import Java.io.DataOutputStream;
Import Java.io.File;
Import java.io.FileNotFoundException;
Import Java.io.FileOutputStream;
Import Java.io.FileWriter;
Import java.io.IOException;
Import Java.io.InputStream;
Import Java.io.OutputStream;
Import Java.net.URLDecoder;
Import Org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
Import org.apache.commons.httpclient.HttpClient;
Import org.apache.commons.httpclient.HttpException;
Import Org.apache.commons.httpclient.HttpStatus;
Import Org.apache.commons.httpclient.methods.GetMethod;
Import Org.apache.commons.httpclient.params.HttpMethodParams; public class DownLoadFile {public string Getfilenamebyurl (string url,string contentType) {//url = url.substring (arg
0 if (contenttype.indexof ("HTML")!=-1) {URL = Url.replaceall ("[\\?/:* |<>\", "");
return URL; }else {return Url.replaceall ("[\\?/:* |<>\"]http "," ") +". " +contenttype.substring (Contenttype.lastindexof ("/") +1);
} private void Savetolocal (String data) {//DataOutputStream outputstream; try {//OutputStream = new DataOutputStream (New FileOutputStream ("/users/vika/desktop/url.txt"));//Output
STREAM.WRITEUTF (data);
Outputstream.flush ();
Outputstream.close ();
FileWriter FW = new FileWriter ("/users/vika/desktop/url.txt", true);
Fw.write (data);
Fw.close ();
OutputStream output = null;
File File = new file ("/users/vika/desktop/url.txt");
Output = new FileOutputStream (file);
int tempbyte =-1; while ((Tempbyte=data.read ()) >0) {//Output.write (tempbyte);//}//} catch (IOException
e) {//TODO auto-generated catch block E.printstacktrace ();
} public string DownloadFile (string url) {string filePath = null;
HttpClient httpclient = new HttpClient ();
Httpclient.gethttpconnectionmanager (). Getparams (). Setconnectiontimeout (5000);
GetMethod GetMethod = new GetMethod (URL); GetmeThod.getparams (). Setparameter (Httpmethodparams.so_timeout, 5000);
Getmethod.getparams (). Setparameter (Httpmethodparams.retry_handler,new Defaulthttpmethodretryhandler ());
try {int code = Httpclient.executemethod (GetMethod);
If (code!= HTTPSTATUS.SC_OK) {System.err.println ("Method failed:" +getmethod.getstatusline ());
FilePath = null; InputStream responsebody = Getmethod.getresponsebodyasstream ()///read as Byte array///to generate saved file name based on Web page URL String content
= Getmethod.getresponsebodyasstring (); if (Content.contains ("title")) {content = content.substring (Content.indexof ("<title>"), Content.indexof ("<
/title> "));
Urldecoder.decode (Content, "UTF-8"); } FilePath = Getfilenamebyurl (URL, getmethod.getresponseheader ("Content-type"). G
Etvalue ());
Savetolocal (New Bytearrayinputstream (filepath+content). GetBytes ());
Savetolocal (filepath+content+ "\ n");
System.out.println (FilePath); SysTEM.OUT.PRINTLN (content); catch (HttpException E) {//TODO auto-generated catch block System.out.println (' Please check your provided HTTP a
Ddress! ");
E.printstacktrace ();
catch (IOException e) {//TODO auto-generated catch block E.printstacktrace ();
}finally{getmethod.releaseconnection ();
return filePath;
}
}
Import Java.util.HashSet;
Import Java.util.Set;
Import Org.htmlparser.Node;
Import Org.htmlparser.NodeFilter;
Import Org.htmlparser.Parser;
Import Org.htmlparser.filters.NodeClassFilter;
Import Org.htmlparser.filters.OrFilter;
Import Org.htmlparser.tags.LinkTag;
Import org.htmlparser.util.NodeList;
Import org.htmlparser.util.ParserException; public class Htmlparsertool {public static set<string> extraclinks (String url,linkfilter filter) {set<string& Gt
Links = new hashset<string> ();
try {Parser Parser = new Parser (URL);
Parser.setencoding ("UTF-8");
Final Nodefilter framefilter = new Nodefilter () {@Override public boolean accept (Org.htmlparser.Node Node) {
TODO auto-generated Method stub return true; if (Node.gettext (). StartsWith ("Frame src=")) {//return true;//} else {//return false;//
}
}
}; Orfilter linkfilter = new Orfilter (new Nodeclassfilter (Linktag.class), Framefilter);
NodeList list = (nodelist) parser.extractallnodesthatmatch (linkfilter);
for (int i = 0; i < list.size (); i++) {Node tag = (node) list.elementat (i);
if (tag instanceof linktag) {Linktag link = (linktag) tag;
String Linkurl = Link.getlink ();
if (Filter.accept (Linkurl)) {Links.add (Linkurl); }else {String frame = ((linktag) tag). GetText ();//int start = Frame.indexof ("href=");//frame = frame.
SUBSTRING (start);
int end = Frame.indexof ("");
if (end = = 1)//end = Frame.indexof (">");
String Frameurl = frame.substring (5,end-1);
if (Filter.accept (frame)) {Links.add (frame);
catch (Parserexception e) {//TODO auto-generated catch block E.printstacktrace ();
return links;
}
}
An interface is also used here.
Public interface Linkfilter {public
boolean accept (String URL);
}
Well done with these we're going to write our winners class. Here I just define it casually.
Using Test as the main class.
Here set to get timeout settings I set it to be 5s, and the keyword, I set is Youku, because I grabbed the www.youku.com above information, as long as the inclusion of Rul contains Youku keyword. I'll get it all. Here can be based on their own preferences to crawl.
public static void Main (string[] args) {
//TODO auto-generated method stub
test = new test ();
Test.crawling (New string[]{"http://www.youku.com/"});
private void Initcrawlerwithseeds (string[] seeds) {for (int i=0;i<seeds.length;i++)
Linkqueue.addunvisitedurl (Seeds[i]); public void crawling (string[] seeds) {linkfilter filter = new Linkfilter () {@Override public Boole
An accept (String URL) {//TODO auto-generated Method stub if (Url.contains ("Youku")) return true;
else return false;
}
};
Initcrawlerwithseeds (seeds); while (! Linkqueue.unvisitedurlsempty () &&linkqueue.getvisitedurlnum () <=100000) {String visitUrl = (String) LinkQu
Eue.unvisitedurldequeue ();
if (Visiturl = = null) continue;
DownLoadFile DownLoadFile = new DownLoadFile ();
Downloadfile.downloadfile (Visiturl);
Linkqueue.addvisitedurl (Visiturl);
Set<string>links=htmlparsertool.extraclinks (Visiturl,filter);
for (String link:links) {linkqueue.addunvisitedurl (link); }
}
}
Engineering documents
Click to open the link