An introduction to the WebMagic Crawler (ii) a complete example of a home for crawling anime

Source: Internet
Author: User
Tags int size return tag stmt xpath mysql view

(a) Foreword

My last blog has explained how to crawl a certain page of the animation data, here to focus on a complete reptile example.

Compared with the previous article, more is the type of animation, Japanese name or something.

Recommend this crawl blog: http://blog.csdn.net/qq598535550/article/details/51287630

I also learn from this.

The tools used are: IntelliJ idea,mysql, webmagic0.73, etc.


(ii) detailed process

crawling content includes: title, animation type, Japanese name, alias, playback time, playback status, type,

Original, supervision, production company, official website, synopsis, synopsis, score, number of ratings, related URLs, etc.

crawl the picture as follows:


     

recommended

(1) First MySQL building table dmzjanimation

CREATE TABLE ' dmzjanimation ' (  
  ' id ' int (one) unsigned not NULL auto_increment, 
  ' hahawebname ' varchar (127) Default NULL,  
  ' antag ' varchar (255) default NULL,  
  ' japanname ' varchar (255) default NULL, 
  ' allname ' varchar (255) Default NULL, 
  ' year ' varchar (255) default NULL,  
   ' state ' varchar (255) default NULL,
  ' tag ' varchar (643) Default NULL,
  ' original ' varchar (255) default NULL, 
  ' screenwriter ' varchar (255) default NULL, 
  ' company ' varchar (255) default NULL,
  ' website ' varchar (511) default NULL,
  ' content ' varchar (2559) default null,
  ' Contentdetail ' varchar (10240) default NULL,
  ' goal ' varchar (255) default NULL,
  ' mentotal ' varchar (255) Default null,
  ' url ' varchar (255) default NULL,
  PRIMARY KEY  (' id ')  

With Navicat for MySQL view is like this


(2) The second step is to start writing Java code, mainly including Dmzjanimationprocessor (Core Class),DMzjanimationdao (Java connection MySQL class) and D Mzjanimation (Data entity Class)


(2.1) Dmzjanimation, entity class corresponding

Package Webmagic.donghua.dmzj.com;
 /***created by Mo *on 2017/10/23 ***12:08.
    /public class Dmzjanimation {private int id; private string hahawebname;//Title private string antag;//animation kind private string japanname;//Japanese name private string all name;//alias private string year;//play time private string state;//play state private string tag;//type private string or iginal;//Original private string screenwriter;//Monitor private string company;//production company private string website;//official website pri Vate String content;//Synopsis private String contentdetail;//Synopsis private string goal;//scoring private string Mentota
                l;//Number of ratings private string url;//related URL @Override public string toString () {return "dmzjanimation{" + "id=" + ID + ", hahawebname= '" + hahawebname + ' \ ' + ", antag= '" + Antag +
                ' + ' + ', japanname= ' "+ japanname + ' \ ' +", allname= ' "+ allname + ' + ' +", year= '" + year + ' \ ' + ", state= '" + state + ' \ ' + ", tag= ' + tag + ' + ' + ", original= '" + original + ' + ", screenwriter= '" + screenwriter + ' \ ' + ", CO Mpany= ' + company + ' \ ' + ', website= ' + website + ' + ' + ', content= ' + content + ' \ ' ' + ', contentdetail= ' "+ contentdetail + ' +", goal= "+ goal +", men
    Total= "+ Mentotal +", url= ' + URL + ' \ ' + '} ';
    public int getId () {return id;
    The public void setId (int id) {this.id = ID;
    Public String Gethahawebname () {return hahawebname;
    } public void Sethahawebname (String hahawebname) {this.hahawebname = Hahawebname;
    Public String Getantag () {return antag;
    } public void Setantag (String antag) {this.antag = Antag;

 }   Public String Getjapanname () {return japanname;
    } public void Setjapanname (String japanname) {this.japanname = Japanname;
    Public String Getallname () {return allname;
    } public void Setallname (String allname) {this.allname = Allname;
    Public String getyear () {return year;
    public void Setyear (String year) {this.year = year;
    Public String GetState () {return state;
    public void SetState (String state) {this.state = state;
    Public String Gettag () {return tag;
    public void Settag (String tag) {This.tag = tag;
    Public String getoriginal () {return original;
    } public void Setoriginal (String original) {this.original = original;
    Public String Getscreenwriter () {return screenwriter; } public void Setscreenwriter (String screenwriter) {This.screenwRiter = screenwriter;
    Public String Getcompany () {return to company;
    public void Setcompany (String company) {this.company = Company;
    Public String Getwebsite () {return website;
    } public void Setwebsite (String website) {this.website = website;
    Public String getcontent () {return content;
    public void SetContent (String content) {this.content = content;
    Public String Getcontentdetail () {return contentdetail;
    } public void Setcontentdetail (String contentdetail) {this.contentdetail = Contentdetail;
    Public String Getgoal () {return goal;
    } public void Setgoal (String goal) {this.goal = goal;
    Public String Getmentotal () {return mentotal;
    } public void Setmentotal (String mentotal) {this.mentotal = Mentotal;
    Public String GetUrl () {return URL;

   } public void SetUrl (String url) {this.url = URL;
 }



}



(2.2) Dmzjanimationprocessor, reptile logic, core class

Package Webmagic.donghua.dmzj.com;
Import Us.codecraft.webmagic.Page;
Import Us.codecraft.webmagic.Site;
Import Us.codecraft.webmagic.Spider;
Import Us.codecraft.webmagic.processor.PageProcessor;
Import us.codecraft.webmagic.selector.Html;

Import us.codecraft.webmagic.selector.Selectable;

Import java.util.List;
 /***created by Mo *on 2017/10/23 ***12:09.
    /public class Dmzjanimationprocessor implements pageprocessor {int myID = 0;
    int size = 10; Crawl site related configuration, including coding, crawl interval, retry times, such as private site site = site.me (). Setretrytimes (1000). Setsleeptime (1000). Setcharset ("UTF8")
    ;
    @Override public Site Getsite () {return site;
        @Override public void Process (Page page) {dmzjanimation dmzjanimation = new Dmzjanimation ();
        HTML HTML = page.gethtml ();
        size++;
        myid++;
        int id = myID;
        Dmzjanimation.setid (ID); String hahawebname = Html.xpath ("//div[@class =\" odd_anim_title_tnew\ "]/div[@class =\" tvversion\ "]/a/span[@class =\" anim_title_text\ "]/h1/text ()"). Get ()//Score Dmzjanimation.sethahawebname (Hahawebname); String Goal = Html.xpath ("//div[@class =\ anim_star\"]/ul/li[@id =\ "anim_score_info\"]/span[@class =\ "points_text\"]
        /text () "). get ();//score Dmzjanimation.setgoal (goal); String mentotalold = Html.xpath ("//div[@class =\" anim_star\ "]/ul/li[@id =\" score_statistics\ "]/span[@id =\" score_
        Count_span\ "]/text ()"). Get ()//number String Mentotal = Mentotalold.replaceall ("Person score", "");
        Dmzjanimation.setmentotal (mentotal); String content = Html.xpath ("//div[@class =\" odd_anim_title_mnew\ "]/p/span[@id =\" gamedescshort\ "]/text ()"). get ();
        /Contents dmzjanimation.setcontent (content); String Contentdetail = Html.xpath ("//div[@class =\" odd_anim_title_mnew\ "]/p/span[@id =\" gamedescall\ "]/text ()"). Get
        ();//Content Dmzjanimation.setcontentdetail (Contentdetail);
        System.out.println ("Hahawebname:" + hahawebname); SYSTEM.OUT.PRINTLN ("Goal:" +goal);
        System.out.println ("mentotal:" + mentotal);
        System.out.println ("content:" + content);
        System.out.println ("Contentdetail:" + contentdetail);
        list<selectable> nodes = Html.xpath ("//div[@class =\" anim_attributenew_text\ "]/ul/li"). nodes ();
            for (selectable item:nodes) {String tmp = Item.get (); if (Tmp.contains ("animation Kind")) {//Animation type: Theatrical version String antag11 = Tmp.replaceall ("</?[
                ^>]+> "," ");
                String Antag = Antag11.replaceall ("Animation kind:", "");
                System.out.println (Antag);
            Dmzjanimation.setantag (Antag); }//Japanese Name: Temporarily No if (tmp.contains ("Japanese name")) {String japanname11 = Tmp.replaceall ("</?[
                ^>]+> "," ");
                String japanname = japanname11.replaceall ("Japanese Name:", "");
                if (Japanname.contains ("no") {japanname = null;}
              System.out.println (Japanname); Dmzjanimation.setjapanname (JapannAME); }//alias: The Monkey King if (Tmp.contains ("Alias")) {String Allname11 = Tmp.replaceall ( "</?"
                [^>]+> "," ");
                String allname = Allname11.replaceall ("Alias:", "");
                if (Allname.contains ("no") {allname = null;}
                System.out.println (Allname);
            Dmzjanimation.setallname (Allname); }//Premiere Time: No if (Tmp.contains ("premiere")) {String Year11 = Tmp.replaceall ("</?[
                ^>]+> "," ");
                String year1111 = Year11.replaceall ("Premiere Time:", "");
                String year = year1111;
                if (Year1111.contains ("no")) {year = null;}
                SYSTEM.OUT.PRINTLN (year);
            Dmzjanimation.setyear (year); } if (Tmp.contains ("Play State")) {String state11 = Tmp.replaceall ("</?[
                ^>]+> "," ");
                String state1111 = State11.replaceall ("Playback state:", ""); String State = state1111;
                if (State.contains ("no")) {state = null;}
                System.out.println (state);
            Dmzjanimation.setstate (state); } if (Tmp.contains ("Plot type")) {String tag11 = Tmp.replaceall ("</?[
                ^>]+> "," ");
                String tag1111 = Tag11.replaceall ("Plot type:", "");
                String tag = Tag1111.replaceall ("", "/");
                SYSTEM.OUT.PRINTLN (tag);
            Dmzjanimation.settag (tag); }//Original: No if (Tmp.contains ("original")) {String original11 = Tmp.replaceall ("</?[
                ^>]+> "," ");
                String original1111 = Original11.replaceall ("Original:", "");
                String original = original1111;
                if (Original.contains ("no") {original = null;}
                System.out.println (original);
            Dmzjanimation.setoriginal (original); }//Supervision: Wan Ming/Tang Xing if (tmp.contains ("supervised")) {String Screenwriter11 = Tmp.replaceall ("</?[
                ^>]+> "," ");
                String screenwriter1111 = Screenwriter11.replaceall ("Supervised:", "");
                String screenwriter = screenwriter1111;
                if (Screenwriter.contains ("no") {screenwriter = null;}
                Dmzjanimation.setscreenwriter (screenwriter);
            System.out.println (screenwriter); }//production company: Shanghai Art Film Studio if (Tmp.contains ("production company")) {String Company11 = Tmp.replaceall ("</?[
                ^>]+> "," ");
                String company1111 = Company11.replaceall ("production company:", "");
                company1111 = company1111 + "/" + company1111 + "Company";
                String company = company1111;
                if (Company.contains ("no")) {company = null;}
                System.out.println (company);
            Dmzjanimation.setcompany (company); }//Official website: No if (Tmp.contains ("official website")) {String website = tmp.replaceall (". *?href=|target (. *)","");
                if (Website.contains ("no") {website = null;}
                SYSTEM.OUT.PRINTLN (website);
            Dmzjanimation.setwebsite (website);
        } String url = "http://donghua.dmzj.com/donghua_info/" +size+ ". html";
        Dmzjanimation.seturl (URL);
    New Dmzjanimationdao (). Add (Dmzjanimation);
        public static void Main (string[] args) {int username = 10;
        Dmzjanimationprocessor my = new Dmzjanimationprocessor ();
        Long StartTime, Endtime;
        System.out.println ("Start crawling ...");
            for (; username<=15000;username++) {starttime = System.currenttimemillis ();
            Spider.create (My). Addurl ("http://donghua.dmzj.com/donghua_info/" + username +. html). Thread (5). Run ();
            Endtime = System.currenttimemillis ();
        SYSTEM.OUT.PRINTLN ("Crawl end, time consuming approximately" + ((endtime-starttime)/1000) + "seconds");
 }
    }
}



(2.3)Dmzjanimationdao,java connection MySQL class,

Package Webmagic.donghua.dmzj.com;
 /***created by Mo *on 2017/10/23 ***12:08.

    /import java.sql.*;
        public class Dmzjanimationdao {private Connection conn = null;

        Private Statement stmt = null;
                Public Dmzjanimationdao () {try {class.forname ("com.mysql.jdbc.Driver"); Spider is a database, username, password, data format String URL = "Jdbc:mysql://localhost:3306/spider?user=root&password=xiemo&amp
                ; Useunicode=true&characterencoding=utf8 ";
                conn = (Connection) drivermanager.getconnection (URL);
            stmt = Conn.createstatement ();
            catch (ClassNotFoundException e) {e.printstacktrace ();
            catch (SQLException e) {e.printstacktrace ();
        } System.out.println ("Connection database succeeded");
                public int Add (dmzjanimation dmzjanimation) {try {//dmzjanimation is a table name Stringsql = "INSERT into ' spider '. ' Dmzjanimation ' (' IDs ', ' hahawebname ', ' antag ', ' japanname ', ' allname ', ' year ', ' state ', ' tag ', ' Original ', ' screenwriter ', ' Company ', ' website ', ' content ', ' contentdetail ', ' goal ', ' mentotal ', ' url ', VALUES (?,?,?,?
                , ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);";
                PreparedStatement PS = conn.preparestatement (SQL);
                Ps.setint (1, Dmzjanimation.getid ());
                Ps.setstring (2, Dmzjanimation.gethahawebname ());
                Ps.setstring (3, Dmzjanimation.getantag ());
                Ps.setstring (4, Dmzjanimation.getjapanname ());
                Ps.setstring (5, Dmzjanimation.getallname ());
                Ps.setstring (6, Dmzjanimation.getyear ());
                Ps.setstring (7, Dmzjanimation.getstate ());

                Ps.setstring (8, Dmzjanimation.gettag ());
                Ps.setstring (9, dmzjanimation.getoriginal ());
                Ps.setstring (Dmzjanimation.getscreenwriter ()); Ps.setstring (One, dmzjanimation.geTcompany ());
                Ps.setstring (Dmzjanimation.getwebsite ());
                Ps.setstring (Dmzjanimation.getcontent ());
                Ps.setstring (Dmzjanimation.getcontentdetail ());

                Ps.setstring (Dmzjanimation.getgoal ());
                Ps.setstring (Dmzjanimation.getmentotal ());
                Ps.setstring (Dmzjanimation.geturl ());
            return Ps.executeupdate ();
            catch (SQLException e) {e.printstacktrace ();
        } return-1;
 }

}



(2.4) Results

Idea :




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.