2 ways to crawl Java crawler (http| | Socket) Simple Demo (a) __java

Source: Internet
Author: User

Recently looking for a small Java project to write their own play, but can not find the appropriate, so write began to learn a little crawler, they are also feeling reptiles more interesting. Here I found a tutorial, this time is based on the socket and HTTP crawl.


Small Project Structure chart:



(1) Systemcontorl class, realize the whole crawler task scheduling, crawling task


Package Com.simple.control;

Import Com.simple.Level.TaskLevel;
Import Com.simple.manger.CrawlerManger;
Import Com.simple.pojos.CrawlResultPojo;
Import Com.simple.pojos.UrlPojo;

Import java.util.ArrayList;

/**
 * * *
 Created by Lewis on
 2016/10/15.
 *
/public class Systemcontrol {public
    static void main (String []args) {

        arraylist<urlpojo> Urlpojoarraylist = new arraylist<> ();

        Urlpojoarraylist.add (New Urlpojo ("https://www.taobao.com/", Tasklevel.high));
        Urlpojoarraylist.add (New Urlpojo ("https://www.taobao.com/", Tasklevel.high));

        int count=0;

        for (Urlpojo urlpojo:urlpojoarraylist) {
            Crawlermanger Crawlermanger = new Crawlermanger (false);
            Crawlresultpojo Crawlresultpojo = Crawlermanger.crawl (Urlpojo);
            System.out.println (Crawlresultpojo.getpagecontent ());
            count++;
            System.out.println ("Already crawled:" +count+ "page");}}



(2) The interface Icrawl is a uniform specification for 2 kinds of crawl modes, and 2 kinds of crawl methods implement this interface.


Package com.simple.Icrawl;

Import Com.simple.pojos.CrawlResultPojo;
Import Com.simple.pojos.UrlPojo;

/**
 * Implementation class interface
 * Created by Lewis on 2016/10/15.
 * * Public
interface Icrawl {public
    Crawlresultpojo crawl (Urlpojo Urlpojo);
}



(3) Prioritize for each task

Package com.simple.Level;

/**
 * Crawl task level levels
 * Created by Lewis on 2016/10/15.
 */Public
enum Tasklevel {
    High,middle,low
}


(4) The task class and the result class of the reptile

1. Reptile required task class, including specific crawl content URL, task priority, etc.


Package Com.simple.pojos;
Import Com.simple.Level.TaskLevel;

Import Com.simple.crawImpl.HttpUrlConnectionCrawlerImpl;
Import java.io.IOException;
Import Java.io.InputStream;
Import java.net.HttpURLConnection;
Import java.net.MalformedURLException;
Import Java.net.URL;

Import java.net.URLConnection;
 /** * @author Lewis * URL task class * Created by Lewis on 2016/10/15.                          * * Public class Urlpojo {private String URL; Page URL private tasklevel Tasklevel=tasklevel.middle;//url priority level public Urlpojo (String url) {This.url =
    Url
        Public urlpojo (String URL, tasklevel tasklevel) {this (URL);
    This.tasklevel = Tasklevel;
    Public String GetUrl () {return URL;
    public void SetUrl (String url) {this.url = URL;
    Public Tasklevel Gettasklevel () {return tasklevel;
    public void Settasklevel (Tasklevel tasklevel) {this.tasklevel = Tasklevel; } public String GetHost () {//Get host name URL Url=null;
        try {url= new Url (This.url);
        catch (Malformedurlexception e) {e.printstacktrace ();
    return Url.gethost ();

        Public HttpURLConnection getconnection () {URL url=null;
            try {url= new Url (This.url);
            URLConnection conn = Url.openconnection ();
            IF (conn instanceof HttpURLConnection) return (httpurlconnection) conn;
        else throw new Exception ("Open convergence failed");
        catch (Malformedurlexception e) {e.printstacktrace ();
        catch (IOException e) {e.printstacktrace ();
        catch (Exception e) {e.printstacktrace ();
    return null;
 }

}


2. The result set after the crawl, all the crawl results are saved in this class


Package Com.simple.pojos;

/**
 * Capture result Encapsulation
 * Created by Lewis on 2016/10/15.
 * * Public
class Crawlresultpojo {
    private Boolean issuccess//has succeeded in
    private String pagecontent;//Web page content
    private int httpstatucode;//http Status code public

    Boolean issuccess () {return
        issuccess;
    }

    public void Setsuccess (Boolean success) {
        issuccess = success;
    }

    Public String getpagecontent () {return
        pagecontent;
    }

    public void Setpagecontent (String pagecontent) {
        this.pagecontent = pagecontent;
    }

    public int Gethttpstatucode () {return
        httpstatucode;
    }

    public void Sethttpstatucode (int httpstatucode) {
        httpstatucode = Httpstatucode;
    }
}



(5) Crawler management, including the choice of crawl mode, query query crawl results


Package Com.simple.manger;

Import Com.simple.Icrawl.ICrawl;
Import Com.simple.Level.TaskLevel;
Import Com.simple.crawImpl.CrawlerImpl;
Import Com.simple.crawImpl.HttpUrlConnectionCrawlerImpl;
Import Com.simple.pojos.CrawlResultPojo;
Import Com.simple.pojos.UrlPojo;

Import Java.net.Socket;
Import java.util.Objects;

/**
 * @author Lewis
 * contains the business logic of the grab manager
 * Created by Lewis on 2016/10/15.
 * * Public
class Crawlermanger {
    private icrawl crawler;

    Public Crawlermanger (Boolean issocket) {
        if (issocket) {
            This.crawler = new Crawlerimpl ();
        } else {
            This.crawler = new Httpurlconnectioncrawlerimpl ();
        }
    }

    Public Crawlresultpojo Crawl (Urlpojo urlpojo) {return
        this.crawler.crawl (Urlpojo);
    }
}


(6) 2 kinds of crawl way:


1). Socket mode:

Package Com.simple.crawImpl;
Import Com.simple.Icrawl.ICrawl;
Import Com.simple.Level.TaskLevel;
Import Com.simple.pojos.CrawlResultPojo;

Import Com.simple.pojos.UrlPojo;
Import java.io.*;

Import Java.net.Socket;
 /** * Implements interface class * Created by Lewis on 2016/10/15. */public class Crawlerimpl implements icrawl{//socket Crawl mode @Override public Crawlresultpojo Craw

        L (Urlpojo Urlpojo) {//Crawl URL content, return result set Crawlresultpojo Crawlresultpojo = new Crawlresultpojo (); if (urlpojo==null| |
            Urlpojo.geturl () ==null) {//If the URL is empty, or Urlpojo crawlresultpojo.setpagecontent (null);
            Crawlresultpojo.setsuccess (FALSE);
        return Crawlresultpojo;
        } String host=urlpojo.gethost ();
        BufferedWriter bw = NULL;
        BufferedReader br = null;

        Socket Socket=null; if (host!=null) {try {/** * Socket programming General steps * (1) Create soCket * (2) Open the input/out stream connected to the socket; * (3) read/write the socket according to a certain protocol; * (4) Close socket
                 .
                 * Where address, host, and port are the IP addresses, host names, and port numbers of the other side of the two-way connection, * stream Indicates whether the socket is a stream socket or datagram socket,localport that represents the port number of the local host. * Localaddr and BINDADDR are the address of the local machine (ServerSocket host address) */socket=new socket (
                HOST,80);
                BW = new BufferedWriter (New OutputStreamWriter (Socket.getoutputstream ()));
                 /** * HTTP1.1 * It supports continuous connections.
                 * Relative to the * HTTP1.0 * When the connection is established, the browser sends a request and a response message is sent back. The TCP connection is then freed.
                * So a blocking * * */Bw.write ("Get" +urlpojo.geturl () + "http/1.0\r\n");//http/1.1 occurs
                Bw.write ("HOST:" +host+ "\ r \ n");      Bw.write ("\ r \ n"); terminator \ r \ n does not have any data before, representing the HTTP head output to server-side data end and complete bw.flush (); Empty buffer br=new BufferedReader (New InputStreamReader (Socket.getinputstream (), "Utf-8"));
                String line = null;
                StringBuilder StringBuilder = new StringBuilder ();
                while ((Line=br.readline ())!=null) {stringbuilder.append (line+ "\ n");
                } crawlresultpojo.setsuccess (True);
                Crawlresultpojo.setpagecontent (Stringbuilder.tostring ());
            return Crawlresultpojo;
            catch (IOException e) {e.printstacktrace (); Finally {try {if (socket!=null)//prevents the occurrence of NULL pointer exceptions Socket.close ();
                    Releasing resources to prevent memory leaks if (br!=null) br.close ();
                if (bw!=null) bw.close ();
                    catch (IOException e) {e.printstacktrace ();
                System.out.println ("Stream shutdown failed");

 }
            }
        }       return null;
        public static void Main (String []args) {Crawlerimpl cl = new Crawlerimpl ();
        System.out.println (Cl.crawl (New Urlpojo ("https://www.taobao.com/", Tasklevel.high)). Getpagecontent ());
    System.out.println ("Done");
 }
}


2). HTTP mode:

Package Com.simple.crawImpl;
Import Com.simple.Icrawl.ICrawl;
Import Com.simple.Level.TaskLevel;
Import Com.simple.pojos.CrawlResultPojo;

Import Com.simple.pojos.UrlPojo;
Import Java.io.BufferedReader;
Import java.io.IOException;
Import Java.io.InputStreamReader;
Import java.net.HttpURLConnection;

Import Java.nio.Buffer;
 /** * Created by Lewis on 2016/10/15. */public class Httpurlconnectioncrawlerimpl implements icrawl{//http Crawl mode @Override public Crawlresu

        Ltpojo Crawl (Urlpojo urlpojo) {Crawlresultpojo Crawlresultpojo = new Crawlresultpojo (); if (urlpojo==null| |
            Urlpojo.geturl () ==null) {//If the URL is empty, or Urlpojo crawlresultpojo.setpagecontent (null);
            Crawlresultpojo.setsuccess (FALSE);
        return Crawlresultpojo;
        } httpurlconnection httpurlconnection = Urlpojo.getconnection ();

            if (httpurlconnection!=null) {BufferedReader bufferedreader=null;
        try {       Bufferedreader= New BufferedReader (New InputStreamReader (Httpurlconnection.getinputstream (), "Utf-8"));
                String line =null;


                StringBuilder StringBuilder = new StringBuilder ();
                while ((Line=bufferedreader.readline ())!=null) {stringbuilder.append (line+ "\ n");
                } crawlresultpojo.setpagecontent (Stringbuilder.tostring ());

                Crawlresultpojo.setsuccess (TRUE);

            return Crawlresultpojo;
            catch (IOException e) {e.printstacktrace (); }finally {try {if (bufferedreader!=null) Bufferedreader.clos
                E ();
                catch (IOException e) {e.printstacktrace ();
    }} return null; public static void Main (String []args) {System.out.println] (new Httpurlconnectioncrawlerimpl (). Crawl (New URlpojo ("https://www.taobao.com/", Tasklevel.high)). Getpagecontent ());
    System.out.println ("Done");
 }
}




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.