What if XPath is not valid? Obviously XPath is obtained by locating the child nodes, copy XPath, which is theoretically correct
What if XPath is not valid? Obviously XPath is obtained by pressing the F12 locator and then copy the XPath, but is it wrong to put it in the code?
Premise: Youku TV series crawler Code implementation of one: Download resolution video Site page (2) The workload has been completed. Based on this foundation, further refine the code
1. Create a new page resolution interface.
Package com.dajiangtai.djt_spider.service;
Import Com.dajiangtai.djt_spider.entity.Page;
/**
* Page Parsing interface
* @author Administrator
*
*/
Public interface Iprocessservice {
public void Process (Page page);
}
2. New Page resolution implementation class
Package Com.dajiangtai.djt_spider.service.impl;
Import Java.util.regex.Matcher;
Import Java.util.regex.Pattern;
Import Org.htmlcleaner.HtmlCleaner;
Import Org.htmlcleaner.TagNode;
Import org.htmlcleaner.XPatherException;
Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IProcessService;
Import Com.dajiangtai.djt_spider.util.HtmlUtil;
Import Com.dajiangtai.djt_spider.util.LoadPropertyUtil;
Import Com.dajiangtai.djt_spider.util.RegexUtil;
/**
* Youku Page Parsing implementation class
* @author Administrator
*
*/
public class Youkuprocessservice implements iprocessservice{
Total play Volume:
Private String Parseallnumber = "/html/body/div[4]/div/div[1]/div[2]/div[2]/ul/li[11]";
public void Process (Page page) {
String content = Page.getcontent ();
Htmlcleaner Htmlcleaner = new Htmlcleaner ();
Use Htmlcleaner to parse the Web page to get the root node
Tagnode RootNode = htmlcleaner.clean (content);
try {
object[] Evaluatexpath = Rootnode.evaluatexpath (Parseallnumber);
if (evaluatexpath.length>0) {
Through XPath, navigate to the child node and output the child node information
Tagnode node = (tagnode) evaluatexpath[0];
System.out.println (Node.gettext (). toString ());
}
} catch (Xpatherexception e) {
TODO auto-generated Catch block
E.printstacktrace ();
}
}
}
3. New TV drama Reptile Entry class Startdsjcount, defining Downloadservice, Processservice, and generating get/set methods, and instantiating the two interfaces through the set method. Use previous work to encapsulate page downloads and parsing methods. Test the page resolution again.
Package Com.dajiangtai.djt_spider.start;
Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IDownLoadService;
Import Com.dajiangtai.djt_spider.service.IProcessService;
Import Com.dajiangtai.djt_spider.service.IStoreService;
Import Com.dajiangtai.djt_spider.service.impl.ConsoleStoreService;
Import Com.dajiangtai.djt_spider.service.impl.HttpClientDownLoadService;
Import Com.dajiangtai.djt_spider.service.impl.YOUKUProcessService;
/**
* TV series Reptile Entrance class
* @author Administrator
*
*/
public class Startdsjcount {
Page Download interface
Private Idownloadservice Downloadservice;
Page parsing interface
Private Iprocessservice Processservice;
public static void Main (string[] args) {
Startdsjcount DSJ = new Startdsjcount ();
Httpclientdownloadservice Implementing the Downloadservice interface
Dsj.setdownloadservice (New Httpclientdownloadservice ());
Youkuprocessservice Implementing the Processservice interface
Dsj.setprocessservice (New Youkuprocessservice ());
String url = "Http://list.youku.com/show/id_z9cd2277647d311e5b692.html?spm=a2h0j.8191423.sMain.5~5~A!2.iCUyO9";
Download page
Page page = dsj.downloadpage (URL);
Parse page
Dsj.processpage (page); Test
}
Download Page method
Public Page downloadpage (String URL) {
return this.downLoadService.download (URL);
}
Parsing page methods
public void Processpage (Page page) {
This.processService.process (page);
}
Public Idownloadservice Getdownloadservice () {
return downloadservice;
}
public void Setdownloadservice (Idownloadservice downloadservice) {
This.downloadservice = Downloadservice;
}
Public Iprocessservice Getprocessservice () {
return processservice;
}
public void Setprocessservice (Iprocessservice processservice) {
This.processservice = Processservice;
}
}
4. Test the Main method, and if correct, you should output a red field that is: Total Played: 16,960,061,208
However, the console is empty:
That is, the XPath is invalid and parsing fails. This can be debugged using debug, step-by test, finally found, object[] Evaluatexpath = Rootnode.evaluatexpath (Parseallnumber), the Evaluatexpath value is [], What causes the parsing to fail? The red part of the XPath is "/html/body/div[4]/div/div[1]/div[2]/div[2]/ul/li[11", and, obviously, is the absolute path, specifically why it failed, and I now boil it down to an absolute path. Self-summary, there are currently two solutions:
1. If the XPath parsing is not correct, then the simplest solution is to keep the last label subscript, the rest is deleted, starting from the body.
2. Rewrite it as a relative path: "//div[@class =\" p-base\ "]/ul/li[11]", here is a reference to the XPath in the Http://www.cnblogs.com/miercler/p/5599465.html blog.
Both of these methods can solve the problem of invalid XPath!
Youku TV series Crawler Code implementation one: Download parse video website page (3) Supplementary knowledge: What if XPath is invalid?