1. Create a new storage interface Istoreservice
Package com.dajiangtai.djt_spider.service;
Import Com.dajiangtai.djt_spider.entity.Page;
/**
* Data Storage interface
* @author Administrator
*
*/
Public interface Istoreservice {
public void Store (Page page);
}
2. Create a new storage interface implementation class Consolestoreservice (print stored data in console)
Package Com.dajiangtai.djt_spider.service.impl;
Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IStoreService;
public class Consolestoreservice implements Istoreservice {
public void Store (Page page) {
SYSTEM.OUT.PRINTLN ("Total number of Plays:" +page.getallnumber ());
System.out.println ("Number of comments:" +page.getcommentnumber ());
System.out.println ("Likes:" +page.getsupportnumber ());
}
}
3. Refactoring Startdsjcount:
1. Add the Storeservice property and set the Get/set method
2.dsj.setstoreservice (New Consolestoreservice ()); Instantiate an interface
3. Add Storage page Information method public void Storepageinfo (Page page)
4. Test whether the Storepageinfo (Page page) method is valid.
Package Com.dajiangtai.djt_spider.start;
Import Java.util.regex.Matcher;
Import Java.util.regex.Pattern;
Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IDownLoadService;
Import Com.dajiangtai.djt_spider.service.IProcessService;
Import Com.dajiangtai.djt_spider.service.IStoreService;
Import Com.dajiangtai.djt_spider.service.impl.ConsoleStoreService;
Import Com.dajiangtai.djt_spider.service.impl.HttpClientDownLoadService;
Import Com.dajiangtai.djt_spider.service.impl.YOUKUProcessService;
/**
* TV series Reptile Entrance class
* @author Administrator
*
*/
public class Startdsjcount {
Page Download interface
Private Idownloadservice Downloadservice;
Page parsing interface
Private Iprocessservice Processservice;
Data storage Interface
Private Istoreservice Storeservice;
public static void Main (string[] args) {
Startdsjcount DSJ = new Startdsjcount ();
Dsj.setdownloadservice (New Httpclientdownloadservice ());
Dsj.setprocessservice (New Youkuprocessservice ());
Dsj.setstoreservice (New Consolestoreservice ());
String url = "Http://list.youku.com/show/id_z9cd2277647d311e5b692.html?spm=a2h0j.8191423.sMain.5~5~A!2.iCUyO9";
Download page
Page page = dsj.downloadpage (URL);
Dsj.processpage (page);
Store page Information
Dsj.storepageinfo (page);
}
Download Page method
Public Page downloadpage (String URL) {
return this.downLoadService.download (URL);
}
Parsing page methods
public void Processpage (Page page) {
This.processService.process (page);
}
Store page Information methods
public void Storepageinfo (Page page) {
This.storeService.store (page);
}
Public Idownloadservice Getdownloadservice () {
return downloadservice;
}
public void Setdownloadservice (Idownloadservice downloadservice) {
This.downloadservice = Downloadservice;
}
Public Iprocessservice Getprocessservice () {
return processservice;
}
public void Setprocessservice (Iprocessservice processservice) {
This.processservice = Processservice;
}
Public Istoreservice Getstoreservice () {
return storeservice;
}
public void Setstoreservice (Istoreservice storeservice) {
This.storeservice = Storeservice;
}
}
Console output:
Total Plays: null
Number of comments: null
Likes: null
Why are all three values null? You can only say that the page value is empty. Therefore, the Youkuprocessservice process method needs to be reconstructed, and the information it parses is stored in each property of the page. The specific code is as follows:
Package Com.dajiangtai.djt_spider.service.impl;
Import Java.util.regex.Matcher;
Import Java.util.regex.Pattern;
Import Org.htmlcleaner.HtmlCleaner;
Import Org.htmlcleaner.TagNode;
Import org.htmlcleaner.XPatherException;
Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IProcessService;
Import Com.dajiangtai.djt_spider.util.HtmlUtil;
Import Com.dajiangtai.djt_spider.util.LoadPropertyUtil;
Import Com.dajiangtai.djt_spider.util.RegexUtil;
public class Youkuprocessservice implements Iprocessservice {
Here is the total number of plays: 16,931,628,832, so use regular expressions to get the numbers
Private String Allnumberregex = "(? <= total Played:) [\\d,]+";
Private String Commentnumberregex = "(? <= comment:) [\\d,]+";
Private String Supportnumberregex = "(? <= top:) [\\d,]+";
//
Private String Parseallnumber = "/body/div/div/div/div/div/ul/li[11]";
Private String Parsecommentnumber = "//div[@class =\" p-base\ "]/ul/li[12]";
Private String Parsesupportnumber = "//div[@class =\" p-base\ "]/ul/li[13]";
public void Process (Page page) {
String content = Page.getcontent ();
Htmlcleaner Htmlcleaner = new Htmlcleaner ();
Use Htmlcleaner to parse the Web page to get the root node
Tagnode RootNode = htmlcleaner.clean (content);
try {
/HTML/BODY/DIV[4]/DIV/DIV[1]/DIV[2]/DIV[2]/UL/LI[11]
Make a corresponding adjustment to the XPath, make it valid, if not write, then use debug mode, you will find that Evaluatexpath is []
Total number of plays
String Allnumber = Htmlutil.getfieldbyregex (RootNode, Parseallnumber, Allnumberregex);
String Allnumber = Htmlutil.getfieldbyregex (RootNode, Loadpropertyutil.getyouky ("Parseallnumber"), Loadpropertyutil.getyouky ("Allnumberregex"));
SYSTEM.OUT.PRINTLN ("Total number of Plays:" +allnumber);
Page.setallnumber (Allnumber);
Total number of plays
String Commentnumber = Htmlutil.getfieldbyregex (RootNode, Loadpropertyutil.getyouky ("Parsecommentnumber"), Loadpropertyutil.getyouky ("Commentnumberregex"));
SYSTEM.OUT.PRINTLN ("Total number of comments:" +commentnumber);
Page.setcommentnumber (Commentnumber);
Total number of plays
String Supportnumber = Htmlutil.getfieldbyregex (RootNode, Loadpropertyutil.getyouky ("Parsesupportnumber"), Loadpropertyutil.getyouky ("Supportnumberregex"));
SYSTEM.OUT.PRINTLN ("Total number of comments:" +supportnumber);
Page.setsupportnumber (Supportnumber);
Page.setdaynumber ("0");
Page.setagainstnumber ("0");
Page.setcollectnumber ("0");
} catch (Exception e) {
TODO auto-generated Catch block
E.printstacktrace ();
}
}
}
Refactoring Consolestoreservice:
Package Com.dajiangtai.djt_spider.service.impl;
Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IStoreService;
public class Consolestoreservice implements Istoreservice {
public void Store (Page page) {
SYSTEM.OUT.PRINTLN ("Total number of Plays:" +page.getallnumber ());
System.out.println ("Number of comments:" +page.getcommentnumber ());
System.out.println ("Likes:" +page.getsupportnumber ());
System.out.println ("Tread:" +page.getagainstnumber ());
System.out.println ("Collection:" +page.getcollectnumber ());
System.out.println ("Daily play Increment:" +page.getdaynumber ());
}
}
Test, run the main method of Startdsjcount, console output:
Total Plays: 17,015,726,387
Number of reviews: 1,256,223
Likes: 13,835,376
Step: 0
Favorites: 0
Daily Play Increment: 0
Crawler code implementation of three: to get through the crawler project download, analysis, storage flow