When it comes to reptiles, using the urlconnection of Java itself can achieve some basic crawl page functions, but for some of the more advanced features, such as redirect processing, HTML tag removal, only the use of urlconnection is not enough.
Here we can use httpclient this third party jar package.
Next we use httpclient simple to write a crawl to Baidu's demo:
Import Java.io.FileOutputStream;
Import Java.io.InputStream;
Import Java.io.OutputStream;
Import org.apache.commons.httpclient.HttpClient;
Import Org.apache.commons.httpclient.HttpStatus;
Import Org.apache.commons.httpclient.methods.GetMethod;
/**
*
* @author callmewhy
*
*/
public class Spider {
private static HttpClient httpclient = new HttpClient ();
/**
* @param path
* Link to the target page
* @return Returns a Boolean value indicating whether the target page is downloaded normally
* @throws Exception
* IO exception that reads the page stream or writes to the local file stream
*/
public static Boolean Downloadpage (String path) throws Exception {
Define input output stream
InputStream input = null;
OutputStream output = null;
Get Post method
GetMethod GetMethod = new GetMethod (path);
execution, return status code
int statusCode = Httpclient.executemethod (GetMethod);
Processing for the status code
For simplicity, only a status code with a return value of 200 is processed
if (StatusCode = = HTTPSTATUS.SC_OK) {
input = Getmethod.getresponsebodyasstream ();
By getting the file name of the URL
String filename = path.substring (path.lastindexof ('/') + 1)
+ ". html";
Get file output stream
Output = new FileOutputStream (filename);
Output to File
int tempbyte =-1;
while ((Tempbyte = Input.read ()) > 0) {
Output.write (Tempbyte);
}
Close the input stream
if (input!= null) {
Input.close ();
}
Turn off the output stream
if (output!= null) {
Output.close ();
}
return true;
}
return false;
}
public static void Main (string[] args) {
try {
Crawl Baidu home page, output
Spider.downloadpage ("http://www.baidu.com");
catch (Exception e) {
E.printstacktrace ();
}
}
}
But such basic reptiles cannot satisfy the needs of a wide variety of reptiles.
First to introduce the width first crawler.
Width first believe that everyone is not unfamiliar, simply to understand the breadth first crawler.
We think of the Internet as a super large map, the links on each page are a side, each file or no link to the plain page is the end of the diagram:
Width Priority crawler is such a reptile, crawling on this forward graph, starting from the root node to crawl out of the new node's data at a level.
The width traversal algorithm looks like this:
(1) Vertex V into the queue.
(2) The algorithm is empty when the queue is not empty.
(3) Out queue, get team head node V, access Vertex V and Mark V has been accessed.
(4) To find the first contiguous vertex of vertex V Col.
(5) If the adjacency vertex col of V is not accessed, the col into the queue.
(6) Continue to look for other adjacency vertex col of V, go to step (5), and if all adjacency vertices of V have been accessed, go to step (2).
According to the width traversal algorithm, the traversal sequence of the above graph is: a->b->c->d->e->f->h->g->i, such a layer of traversal down.
The width first crawler actually crawls a series of seed nodes, and the graph traversal is basically the same.
We can place the URL that needs to crawl the page in a todo table, place the page that has been visited in a visited table:
The basic flow of the width-first crawler is as follows:
(1) To compare the parsed link with the link in the visited table, if this link does not exist in the visited table, it is not visited.
(2) Put the link into the TODO table.
(3) After processing, get a link from the TODO table and put it directly into the visited table.
(4) for the page represented by this link, continue the above process. This cycle.
Let's take one step at a A-width-first crawler.
First, for the first design of a data structure to store the TODO table, consider the need for advanced first out so the queue, custom a Quere class:
Import java.util.LinkedList;
/**
* Custom Queue class save Todo Table
*/
public class Queue {
/**
* Define a queue, use LinkedList to implement
*/
Private linkedlist<object> queue = new linkedlist<object> (); into queues
/**
* Add T to queue
*/
public void EnQueue (Object t) {
Queue.addlast (t);
}
/**
* Remove the first item in the queue and return it
*/
Public Object dequeue () {
return Queue.removefirst ();
}
/**
* Returns whether the queue is empty
*/
public Boolean isqueueempty () {
return Queue.isempty ();
}
/**
* To determine and return whether the queue contains T
*/
public boolean Contians (Object t) {
return Queue.contains (t);
}
/**
* To determine and return whether the queue is empty
*/
public Boolean empty () {
return Queue.isempty ();
}
}
A data structure is also required to record the URLs that have been visited, that is, the visited table.
Taking into account the role of this table, whenever you want to access a URL, first in this data structure to find, if the current URL already exists, then discard this URL task.
This data structure needs not to be duplicated and can be quickly searched, so select HashSet to store it.
In summary, we build another Spiderqueue class to hold the visited table and the TODO table:
Import Java.util.HashSet;
Import Java.util.Set;
/**
* Custom class Save visited table and unvisited table
*/
public class Spiderqueue {
/**
* Visited URL collection, that is, the visited table
*/
private static set<object> Visitedurl = new hashset<> ();
/**
* Added to the visited URL queue
*/
public static void Addvisitedurl (String URL) {
Visitedurl.add (URL);
}
/**
* Remove the URL that has been visited
*/
public static void Removevisitedurl (String URL) {
Visitedurl.remove (URL);
}
/**
* Get the number of URLs that have been accessed
*/
public static int Getvisitedurlnum () {
return Visitedurl.size ();
}
/**
* A collection of URLs to be accessed, that is, the unvisited table
*/
private static Queue Unvisitedurl = new Queue ();
/**
* Get unvisited queue
*/
public static Queue Getunvisitedurl () {
return unvisitedurl;
}
/**
* Inaccessible Unvisitedurl out queue
*/
public static Object Unvisitedurldequeue () {
return Unvisitedurl.dequeue ();
}
/**
* Guaranteed to add URL to unvisitedurl when each URL is only accessed once
*/
public static void Addunvisitedurl (String URL) {
if (URL!= null &&!url.trim (). Equals ("") &&!visitedurl.contains (URL)
&&!unvisitedurl.contians (URL))
Unvisitedurl.enqueue (URL);
}
/**
* Determine if the unreachable URL queue is empty
*/
public static Boolean Unvisitedurlsempty () {
return Unvisitedurl.empty ();
}
}
This is the encapsulation of some custom classes, followed by a tool class that defines a Web page to download, which we define as the Downtool class:
Package controller;
Import java.io.*;
Import org.apache.commons.httpclient.*;
Import org.apache.commons.httpclient.methods.*;
Import org.apache.commons.httpclient.params.*;
public class Downtool {
/**
* Generate the file name of the Web page that needs to be saved according to the URL and page type, and remove the non-filename character in the URL
*/
private string Getfilenamebyurl (string url, string contentType) {
Remove seven characters from "http://"
url = url.substring (7);
Confirm that the crawled page is text/html type
if (Contenttype.indexof ("HTML")!=-1) {
Converts special symbols in all URLs to underscores
url = Url.replaceall ("[\\?/:* |<>\"] "," _ ") +". html ";
} else {
url = Url.replaceall ("[\\?/:* |<>\"] "," _ ") +". "
+ contenttype.substring (Contenttype.lastindexof ("/") + 1);
}
return URL;
}
/**
* Save the page byte array to the local file, filePath the relative address of the file to be saved
*/
private void Savetolocal (byte[] data, String FilePath) {
try {
DataOutputStream out = new DataOutputStream (New FileOutputStream (
New File (FilePath));
for (int i = 0; i < data.length; i++)
Out.write (Data[i]);
Out.flush ();
Out.close ();
catch (IOException e) {
E.printstacktrace ();
}
}
Download the Web page to which the URL points
public string downloadFile (string url) {
String filePath = null;
1. Generate Httpclinet objects and set parameters
HttpClient httpclient = new HttpClient ();
Set HTTP connection Timeout 5s
Httpclient.gethttpconnectionmanager (). Getparams ()
. Setconnectiontimeout (5000);
2. Generate GetMethod objects and set parameters
GetMethod GetMethod = new GetMethod (URL);
Set a GET request timeout of 5s
Getmethod.getparams (). Setparameter (Httpmethodparams.so_timeout, 5000);
Set Request retry processing
Getmethod.getparams (). Setparameter (Httpmethodparams.retry_handler,
New Defaulthttpmethodretryhandler ());
3. Execute GET Request
try {
int statusCode = Httpclient.executemethod (GetMethod);
Determine the status code of the access
if (StatusCode!= httpstatus.sc_ok) {
System.err.println ("Method failed:"
+ Getmethod.getstatusline ());
FilePath = null;
}
4. Handling HTTP Response Content
byte[] Responsebody = Getmethod.getresponsebody ();//read as Byte array
Generate the file name at save time based on the URL of the page
FilePath = "temp\\"
+ Getfilenamebyurl (URL,
Getmethod.getresponseheader ("Content-type")
. GetValue ());
Savetolocal (Responsebody, FilePath);
catch (HttpException e) {
A fatal exception occurred, either the protocol is incorrect or the returned content is problematic
System.out.println ("Please check your HTTP address is correct");
E.printstacktrace ();
catch (IOException e) {
A network exception occurred
E.printstacktrace ();
finally {
Release connection
Getmethod.releaseconnection ();
}
return filePath;
}
}
Here we need a Htmlparsertool class to handle HTML tags:
Package controller;
Import Java.util.HashSet;
Import Java.util.Set;
Import Org.htmlparser.Node;
Import Org.htmlparser.NodeFilter;
Import Org.htmlparser.Parser;
Import Org.htmlparser.filters.NodeClassFilter;
Import Org.htmlparser.filters.OrFilter;
Import Org.htmlparser.tags.LinkTag;
Import org.htmlparser.util.NodeList;
Import org.htmlparser.util.ParserException;
Import model. Linkfilter;
public class Htmlparsertool {
Get a link on a Web site where filter is used for filtering links
public static set<string> extraclinks (String URL, linkfilter filter) {
Set<string> links = new hashset<string> ();
try {
Parser Parser = new Parser (URL);
Parser.setencoding ("gb2312");
Filter <frame > label to extract the SRC attribute in the frame label
Nodefilter framefilter = new Nodefilter () {
Private static final long serialversionuid = 1L;
@Override
public Boolean Accept (node node) {
if (Node.gettext (). StartsWith ("Frame src=")) {
return true;
} else {
return false;
}
}
};
Orfilter to set filters <a> tags and <frame> tags
Orfilter linkfilter = new Orfilter (New Nodeclassfilter (
Linktag.class), framefilter);
Get all the filtered labels
NodeList list = Parser.extractallnodesthatmatch (Linkfilter);
for (int i = 0; i < list.size (); i++) {
Node tag = List.elementat (i);
if (tag instanceof Linktag)//<a> label
{
Linktag link = (linktag) tag;
String Linkurl = Link.getlink ();//URL
if (Filter.accept (Linkurl))
Links.add (Linkurl);
} else//<frame> Tags
{
Extracts links to src attributes in frame, such as <frame src= "test.html"/>
String frame = Tag.gettext ();
int start = Frame.indexof ("src=");
frame = frame.substring (start);
int end = Frame.indexof ("");
if (end = = 1)
End = Frame.indexof (">");
String Frameurl = frame.substring (5, end-1);
if (Filter.accept (Frameurl))
Links.add (Frameurl);
}
}
catch (Parserexception e) {
E.printstacktrace ();
}
return links;
}
}
Finally, we're going to write a reptilian class called the previous encapsulation class and function:
Package controller;
Import Java.util.Set;
Import model. Linkfilter;
Import model. Spiderqueue;
public class Bfsspider {
/**
* Initialize the URL queue with seed
*/
private void Initcrawlerwithseeds (string[] seeds) {
for (int i = 0; i < seeds.length; i++)
Spiderqueue.addunvisitedurl (Seeds[i]);
}
Defines a filter that extracts a link beginning with http://www.xxxx.com
public void crawling (string[] seeds) {
Linkfilter filter = new Linkfilter () {
Public boolean accept (String URL) {
if (Url.startswith ("http://www.baidu.com"))
return true;
Else
return false;
}
};
Initializing the URL queue
Initcrawlerwithseeds (seeds);
Loop condition: The link to crawl is not empty and crawls more than 1000 pages
while (! Spiderqueue.unvisitedurlsempty ()
&& spiderqueue.getvisitedurlnum () <= 1000) {
Team Header URL out queue
String visiturl = (string) spiderqueue.unvisitedurldequeue ();
if (Visiturl = null)
Continue
Downtool Downloader = new Downtool ();
Download Web page
Downloader.downloadfile (Visiturl);
The URL is placed in the URL that has been visited
Spiderqueue.addvisitedurl (Visiturl);
Extract the URL from the download Web page
Set<string> links = htmlparsertool.extraclinks (Visiturl, filter);
New unread URLs to the team
for (String link:links) {
Spiderqueue.addunvisitedurl (link);
}
}
}
Main Method entry
public static void Main (string[] args) {
Bfsspider crawler = new Bfsspider ();
Crawler.crawling (new string[] {"http://www.baidu.com"});
}
}
Run can see that the crawler has been Baidu page under all the pages are crawled out:
The above is Java Use HttpClient Toolkit and Width crawler to crawl the content of the entire content of the operation, a little more complex, small partners to carefully think about oh, I hope that we can help