Recently looking for a small Java project to write their own play, but can not find the appropriate, so write began to learn a little crawler, they are also feeling reptiles more interesting. Here I found a tutorial, this time is based on the socket and HTTP crawl.
Small Project Structure chart:
(1) Systemcontorl class, realize the whole crawler task scheduling, crawling task
Package Com.simple.control;
Import Com.simple.Level.TaskLevel;
Import Com.simple.manger.CrawlerManger;
Import Com.simple.pojos.CrawlResultPojo;
Import Com.simple.pojos.UrlPojo;
Import java.util.ArrayList;
/**
* * *
Created by Lewis on
2016/10/15.
*
/public class Systemcontrol {public
static void main (String []args) {
arraylist<urlpojo> Urlpojoarraylist = new arraylist<> ();
Urlpojoarraylist.add (New Urlpojo ("https://www.taobao.com/", Tasklevel.high));
Urlpojoarraylist.add (New Urlpojo ("https://www.taobao.com/", Tasklevel.high));
int count=0;
for (Urlpojo urlpojo:urlpojoarraylist) {
Crawlermanger Crawlermanger = new Crawlermanger (false);
Crawlresultpojo Crawlresultpojo = Crawlermanger.crawl (Urlpojo);
System.out.println (Crawlresultpojo.getpagecontent ());
count++;
System.out.println ("Already crawled:" +count+ "page");}}
(2) The interface Icrawl is a uniform specification for 2 kinds of crawl modes, and 2 kinds of crawl methods implement this interface.
Package com.simple.Icrawl;
Import Com.simple.pojos.CrawlResultPojo;
Import Com.simple.pojos.UrlPojo;
/**
* Implementation class interface
* Created by Lewis on 2016/10/15.
* * Public
interface Icrawl {public
Crawlresultpojo crawl (Urlpojo Urlpojo);
}
(3) Prioritize for each task
Package com.simple.Level;
/**
* Crawl task level levels
* Created by Lewis on 2016/10/15.
*/Public
enum Tasklevel {
High,middle,low
}
(4) The task class and the result class of the reptile
1. Reptile required task class, including specific crawl content URL, task priority, etc.
Package Com.simple.pojos;
Import Com.simple.Level.TaskLevel;
Import Com.simple.crawImpl.HttpUrlConnectionCrawlerImpl;
Import java.io.IOException;
Import Java.io.InputStream;
Import java.net.HttpURLConnection;
Import java.net.MalformedURLException;
Import Java.net.URL;
Import java.net.URLConnection;
/** * @author Lewis * URL task class * Created by Lewis on 2016/10/15. * * Public class Urlpojo {private String URL; Page URL private tasklevel Tasklevel=tasklevel.middle;//url priority level public Urlpojo (String url) {This.url =
Url
Public urlpojo (String URL, tasklevel tasklevel) {this (URL);
This.tasklevel = Tasklevel;
Public String GetUrl () {return URL;
public void SetUrl (String url) {this.url = URL;
Public Tasklevel Gettasklevel () {return tasklevel;
public void Settasklevel (Tasklevel tasklevel) {this.tasklevel = Tasklevel; } public String GetHost () {//Get host name URL Url=null;
try {url= new Url (This.url);
catch (Malformedurlexception e) {e.printstacktrace ();
return Url.gethost ();
Public HttpURLConnection getconnection () {URL url=null;
try {url= new Url (This.url);
URLConnection conn = Url.openconnection ();
IF (conn instanceof HttpURLConnection) return (httpurlconnection) conn;
else throw new Exception ("Open convergence failed");
catch (Malformedurlexception e) {e.printstacktrace ();
catch (IOException e) {e.printstacktrace ();
catch (Exception e) {e.printstacktrace ();
return null;
}
}
2. The result set after the crawl, all the crawl results are saved in this class
Package Com.simple.pojos;
/**
* Capture result Encapsulation
* Created by Lewis on 2016/10/15.
* * Public
class Crawlresultpojo {
private Boolean issuccess//has succeeded in
private String pagecontent;//Web page content
private int httpstatucode;//http Status code public
Boolean issuccess () {return
issuccess;
}
public void Setsuccess (Boolean success) {
issuccess = success;
}
Public String getpagecontent () {return
pagecontent;
}
public void Setpagecontent (String pagecontent) {
this.pagecontent = pagecontent;
}
public int Gethttpstatucode () {return
httpstatucode;
}
public void Sethttpstatucode (int httpstatucode) {
httpstatucode = Httpstatucode;
}
}
(5) Crawler management, including the choice of crawl mode, query query crawl results
Package Com.simple.manger;
Import Com.simple.Icrawl.ICrawl;
Import Com.simple.Level.TaskLevel;
Import Com.simple.crawImpl.CrawlerImpl;
Import Com.simple.crawImpl.HttpUrlConnectionCrawlerImpl;
Import Com.simple.pojos.CrawlResultPojo;
Import Com.simple.pojos.UrlPojo;
Import Java.net.Socket;
Import java.util.Objects;
/**
* @author Lewis
* contains the business logic of the grab manager
* Created by Lewis on 2016/10/15.
* * Public
class Crawlermanger {
private icrawl crawler;
Public Crawlermanger (Boolean issocket) {
if (issocket) {
This.crawler = new Crawlerimpl ();
} else {
This.crawler = new Httpurlconnectioncrawlerimpl ();
}
}
Public Crawlresultpojo Crawl (Urlpojo urlpojo) {return
this.crawler.crawl (Urlpojo);
}
}
(6) 2 kinds of crawl way:
1). Socket mode:
Package Com.simple.crawImpl;
Import Com.simple.Icrawl.ICrawl;
Import Com.simple.Level.TaskLevel;
Import Com.simple.pojos.CrawlResultPojo;
Import Com.simple.pojos.UrlPojo;
Import java.io.*;
Import Java.net.Socket;
/** * Implements interface class * Created by Lewis on 2016/10/15. */public class Crawlerimpl implements icrawl{//socket Crawl mode @Override public Crawlresultpojo Craw
L (Urlpojo Urlpojo) {//Crawl URL content, return result set Crawlresultpojo Crawlresultpojo = new Crawlresultpojo (); if (urlpojo==null| |
Urlpojo.geturl () ==null) {//If the URL is empty, or Urlpojo crawlresultpojo.setpagecontent (null);
Crawlresultpojo.setsuccess (FALSE);
return Crawlresultpojo;
} String host=urlpojo.gethost ();
BufferedWriter bw = NULL;
BufferedReader br = null;
Socket Socket=null; if (host!=null) {try {/** * Socket programming General steps * (1) Create soCket * (2) Open the input/out stream connected to the socket; * (3) read/write the socket according to a certain protocol; * (4) Close socket
.
* Where address, host, and port are the IP addresses, host names, and port numbers of the other side of the two-way connection, * stream Indicates whether the socket is a stream socket or datagram socket,localport that represents the port number of the local host. * Localaddr and BINDADDR are the address of the local machine (ServerSocket host address) */socket=new socket (
HOST,80);
BW = new BufferedWriter (New OutputStreamWriter (Socket.getoutputstream ()));
/** * HTTP1.1 * It supports continuous connections.
* Relative to the * HTTP1.0 * When the connection is established, the browser sends a request and a response message is sent back. The TCP connection is then freed.
* So a blocking * * */Bw.write ("Get" +urlpojo.geturl () + "http/1.0\r\n");//http/1.1 occurs
Bw.write ("HOST:" +host+ "\ r \ n"); Bw.write ("\ r \ n"); terminator \ r \ n does not have any data before, representing the HTTP head output to server-side data end and complete bw.flush (); Empty buffer br=new BufferedReader (New InputStreamReader (Socket.getinputstream (), "Utf-8"));
String line = null;
StringBuilder StringBuilder = new StringBuilder ();
while ((Line=br.readline ())!=null) {stringbuilder.append (line+ "\ n");
} crawlresultpojo.setsuccess (True);
Crawlresultpojo.setpagecontent (Stringbuilder.tostring ());
return Crawlresultpojo;
catch (IOException e) {e.printstacktrace (); Finally {try {if (socket!=null)//prevents the occurrence of NULL pointer exceptions Socket.close ();
Releasing resources to prevent memory leaks if (br!=null) br.close ();
if (bw!=null) bw.close ();
catch (IOException e) {e.printstacktrace ();
System.out.println ("Stream shutdown failed");
}
}
} return null;
public static void Main (String []args) {Crawlerimpl cl = new Crawlerimpl ();
System.out.println (Cl.crawl (New Urlpojo ("https://www.taobao.com/", Tasklevel.high)). Getpagecontent ());
System.out.println ("Done");
}
}
2). HTTP mode:
Package Com.simple.crawImpl;
Import Com.simple.Icrawl.ICrawl;
Import Com.simple.Level.TaskLevel;
Import Com.simple.pojos.CrawlResultPojo;
Import Com.simple.pojos.UrlPojo;
Import Java.io.BufferedReader;
Import java.io.IOException;
Import Java.io.InputStreamReader;
Import java.net.HttpURLConnection;
Import Java.nio.Buffer;
/** * Created by Lewis on 2016/10/15. */public class Httpurlconnectioncrawlerimpl implements icrawl{//http Crawl mode @Override public Crawlresu
Ltpojo Crawl (Urlpojo urlpojo) {Crawlresultpojo Crawlresultpojo = new Crawlresultpojo (); if (urlpojo==null| |
Urlpojo.geturl () ==null) {//If the URL is empty, or Urlpojo crawlresultpojo.setpagecontent (null);
Crawlresultpojo.setsuccess (FALSE);
return Crawlresultpojo;
} httpurlconnection httpurlconnection = Urlpojo.getconnection ();
if (httpurlconnection!=null) {BufferedReader bufferedreader=null;
try { Bufferedreader= New BufferedReader (New InputStreamReader (Httpurlconnection.getinputstream (), "Utf-8"));
String line =null;
StringBuilder StringBuilder = new StringBuilder ();
while ((Line=bufferedreader.readline ())!=null) {stringbuilder.append (line+ "\ n");
} crawlresultpojo.setpagecontent (Stringbuilder.tostring ());
Crawlresultpojo.setsuccess (TRUE);
return Crawlresultpojo;
catch (IOException e) {e.printstacktrace (); }finally {try {if (bufferedreader!=null) Bufferedreader.clos
E ();
catch (IOException e) {e.printstacktrace ();
}} return null; public static void Main (String []args) {System.out.println] (new Httpurlconnectioncrawlerimpl (). Crawl (New URlpojo ("https://www.taobao.com/", Tasklevel.high)). Getpagecontent ());
System.out.println ("Done");
}
}