Crawler code implementation of three: to get through the crawler project download, analysis, storage flow

Source: Internet
Author: User

1. Create a new storage interface Istoreservice

Package com.dajiangtai.djt_spider.service;

Import Com.dajiangtai.djt_spider.entity.Page;

/**
* Data Storage interface
* @author Administrator
*
*/
Public interface Istoreservice {
public void Store (Page page);
}

2. Create a new storage interface implementation class Consolestoreservice (print stored data in console)

Package Com.dajiangtai.djt_spider.service.impl;

Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IStoreService;

public class Consolestoreservice implements Istoreservice {

public void Store (Page page) {
SYSTEM.OUT.PRINTLN ("Total number of Plays:" +page.getallnumber ());
System.out.println ("Number of comments:" +page.getcommentnumber ());
System.out.println ("Likes:" +page.getsupportnumber ());
}

}

3. Refactoring Startdsjcount:

1. Add the Storeservice property and set the Get/set method

2.dsj.setstoreservice (New Consolestoreservice ()); Instantiate an interface

3. Add Storage page Information method public void Storepageinfo (Page page)

4. Test whether the Storepageinfo (Page page) method is valid.

Package Com.dajiangtai.djt_spider.start;

Import Java.util.regex.Matcher;
Import Java.util.regex.Pattern;

Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IDownLoadService;
Import Com.dajiangtai.djt_spider.service.IProcessService;
Import Com.dajiangtai.djt_spider.service.IStoreService;
Import Com.dajiangtai.djt_spider.service.impl.ConsoleStoreService;
Import Com.dajiangtai.djt_spider.service.impl.HttpClientDownLoadService;
Import Com.dajiangtai.djt_spider.service.impl.YOUKUProcessService;

/**
* TV series Reptile Entrance class
* @author Administrator
*
*/
public class Startdsjcount {

Page Download interface
Private Idownloadservice Downloadservice;
Page parsing interface
Private Iprocessservice Processservice;
Data storage Interface
Private Istoreservice Storeservice;

public static void Main (string[] args) {
Startdsjcount DSJ = new Startdsjcount ();
Dsj.setdownloadservice (New Httpclientdownloadservice ());
Dsj.setprocessservice (New Youkuprocessservice ());
Dsj.setstoreservice (New Consolestoreservice ());
String url = "Http://list.youku.com/show/id_z9cd2277647d311e5b692.html?spm=a2h0j.8191423.sMain.5~5~A!2.iCUyO9";
Download page
Page page = dsj.downloadpage (URL);
Dsj.processpage (page);
Store page Information
Dsj.storepageinfo (page);


}


Download Page method
Public Page downloadpage (String URL) {
return this.downLoadService.download (URL);
}

Parsing page methods
public void Processpage (Page page) {
This.processService.process (page);
}

Store page Information methods
public void Storepageinfo (Page page) {
This.storeService.store (page);
}
Public Idownloadservice Getdownloadservice () {
return downloadservice;
}

public void Setdownloadservice (Idownloadservice downloadservice) {
This.downloadservice = Downloadservice;
}

Public Iprocessservice Getprocessservice () {
return processservice;
}

public void Setprocessservice (Iprocessservice processservice) {
This.processservice = Processservice;
}

Public Istoreservice Getstoreservice () {
return storeservice;
}

public void Setstoreservice (Istoreservice storeservice) {
This.storeservice = Storeservice;
}


}

Console output:

Total Plays: null
Number of comments: null
Likes: null

Why are all three values null? You can only say that the page value is empty. Therefore, the Youkuprocessservice process method needs to be reconstructed, and the information it parses is stored in each property of the page. The specific code is as follows:

Package Com.dajiangtai.djt_spider.service.impl;

Import Java.util.regex.Matcher;
Import Java.util.regex.Pattern;

Import Org.htmlcleaner.HtmlCleaner;
Import Org.htmlcleaner.TagNode;
Import org.htmlcleaner.XPatherException;

Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IProcessService;
Import Com.dajiangtai.djt_spider.util.HtmlUtil;
Import Com.dajiangtai.djt_spider.util.LoadPropertyUtil;
Import Com.dajiangtai.djt_spider.util.RegexUtil;

public class Youkuprocessservice implements Iprocessservice {

Here is the total number of plays: 16,931,628,832, so use regular expressions to get the numbers
Private String Allnumberregex = "(? <= total Played:) [\\d,]+";
Private String Commentnumberregex = "(? <= comment:) [\\d,]+";
Private String Supportnumberregex = "(? <= top:) [\\d,]+";
//
Private String Parseallnumber = "/body/div/div/div/div/div/ul/li[11]";
Private String Parsecommentnumber = "//div[@class =\" p-base\ "]/ul/li[12]";
Private String Parsesupportnumber = "//div[@class =\" p-base\ "]/ul/li[13]";

public void Process (Page page) {
String content = Page.getcontent ();
Htmlcleaner Htmlcleaner = new Htmlcleaner ();
Use Htmlcleaner to parse the Web page to get the root node
Tagnode RootNode = htmlcleaner.clean (content);

try {
/HTML/BODY/DIV[4]/DIV/DIV[1]/DIV[2]/DIV[2]/UL/LI[11]
Make a corresponding adjustment to the XPath, make it valid, if not write, then use debug mode, you will find that Evaluatexpath is []
Total number of plays
String Allnumber = Htmlutil.getfieldbyregex (RootNode, Parseallnumber, Allnumberregex);
String Allnumber = Htmlutil.getfieldbyregex (RootNode, Loadpropertyutil.getyouky ("Parseallnumber"), Loadpropertyutil.getyouky ("Allnumberregex"));
SYSTEM.OUT.PRINTLN ("Total number of Plays:" +allnumber);
Page.setallnumber (Allnumber);

Total number of plays
String Commentnumber = Htmlutil.getfieldbyregex (RootNode, Loadpropertyutil.getyouky ("Parsecommentnumber"), Loadpropertyutil.getyouky ("Commentnumberregex"));
SYSTEM.OUT.PRINTLN ("Total number of comments:" +commentnumber);
Page.setcommentnumber (Commentnumber);

Total number of plays
String Supportnumber = Htmlutil.getfieldbyregex (RootNode, Loadpropertyutil.getyouky ("Parsesupportnumber"), Loadpropertyutil.getyouky ("Supportnumberregex"));
SYSTEM.OUT.PRINTLN ("Total number of comments:" +supportnumber);
Page.setsupportnumber (Supportnumber);

Page.setdaynumber ("0");
Page.setagainstnumber ("0");
Page.setcollectnumber ("0");

} catch (Exception e) {
TODO auto-generated Catch block
E.printstacktrace ();
}
}
}

Refactoring Consolestoreservice:

Package Com.dajiangtai.djt_spider.service.impl;

Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IStoreService;

public class Consolestoreservice implements Istoreservice {

public void Store (Page page) {
SYSTEM.OUT.PRINTLN ("Total number of Plays:" +page.getallnumber ());
System.out.println ("Number of comments:" +page.getcommentnumber ());
System.out.println ("Likes:" +page.getsupportnumber ());
System.out.println ("Tread:" +page.getagainstnumber ());
System.out.println ("Collection:" +page.getcollectnumber ());
System.out.println ("Daily play Increment:" +page.getdaynumber ());

}

}

Test, run the main method of Startdsjcount, console output:

Total Plays: 17,015,726,387
Number of reviews: 1,256,223
Likes: 13,835,376
Step: 0
Favorites: 0
Daily Play Increment: 0

Crawler code implementation of three: to get through the crawler project download, analysis, storage flow

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.