Youku TV series Crawler Code implementation one: Download parse video website page (3) Supplementary knowledge: What if XPath is invalid?

Source: Internet
Author: User
Tags xpath

What if XPath is not valid? Obviously XPath is obtained by locating the child nodes, copy XPath, which is theoretically correct

What if XPath is not valid? Obviously XPath is obtained by pressing the F12 locator and then copy the XPath, but is it wrong to put it in the code?

Premise: Youku TV series crawler Code implementation of one: Download resolution video Site page (2) The workload has been completed. Based on this foundation, further refine the code

1. Create a new page resolution interface.

Package com.dajiangtai.djt_spider.service;

Import Com.dajiangtai.djt_spider.entity.Page;

/**
* Page Parsing interface
* @author Administrator
*
*/
Public interface Iprocessservice {

public void Process (Page page);
}

2. New Page resolution implementation class

Package Com.dajiangtai.djt_spider.service.impl;

Import Java.util.regex.Matcher;
Import Java.util.regex.Pattern;

Import Org.htmlcleaner.HtmlCleaner;
Import Org.htmlcleaner.TagNode;
Import org.htmlcleaner.XPatherException;

Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IProcessService;
Import Com.dajiangtai.djt_spider.util.HtmlUtil;
Import Com.dajiangtai.djt_spider.util.LoadPropertyUtil;
Import Com.dajiangtai.djt_spider.util.RegexUtil;

/**
* Youku Page Parsing implementation class
* @author Administrator
*
*/
public class Youkuprocessservice implements iprocessservice{

Total play Volume:
Private String Parseallnumber = "/html/body/div[4]/div/div[1]/div[2]/div[2]/ul/li[11]";

public void Process (Page page) {

String content = Page.getcontent ();
Htmlcleaner Htmlcleaner = new Htmlcleaner ();
Use Htmlcleaner to parse the Web page to get the root node
Tagnode RootNode = htmlcleaner.clean (content);
try {
object[] Evaluatexpath = Rootnode.evaluatexpath (Parseallnumber);
if (evaluatexpath.length>0) {

Through XPath, navigate to the child node and output the child node information
Tagnode node = (tagnode) evaluatexpath[0];
System.out.println (Node.gettext (). toString ());
}
} catch (Xpatherexception e) {
TODO auto-generated Catch block
E.printstacktrace ();
}
}

}

3. New TV drama Reptile Entry class Startdsjcount, defining Downloadservice, Processservice, and generating get/set methods, and instantiating the two interfaces through the set method. Use previous work to encapsulate page downloads and parsing methods. Test the page resolution again.

Package Com.dajiangtai.djt_spider.start;

Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IDownLoadService;
Import Com.dajiangtai.djt_spider.service.IProcessService;
Import Com.dajiangtai.djt_spider.service.IStoreService;
Import Com.dajiangtai.djt_spider.service.impl.ConsoleStoreService;
Import Com.dajiangtai.djt_spider.service.impl.HttpClientDownLoadService;
Import Com.dajiangtai.djt_spider.service.impl.YOUKUProcessService;

/**
* TV series Reptile Entrance class
* @author Administrator
*
*/
public class Startdsjcount {

Page Download interface
Private Idownloadservice Downloadservice;
Page parsing interface
Private Iprocessservice Processservice;

public static void Main (string[] args) {
Startdsjcount DSJ = new Startdsjcount ();

Httpclientdownloadservice Implementing the Downloadservice interface
Dsj.setdownloadservice (New Httpclientdownloadservice ());

Youkuprocessservice Implementing the Processservice interface
Dsj.setprocessservice (New Youkuprocessservice ());
String url = "Http://list.youku.com/show/id_z9cd2277647d311e5b692.html?spm=a2h0j.8191423.sMain.5~5~A!2.iCUyO9";
Download page
Page page = dsj.downloadpage (URL);
Parse page
Dsj.processpage (page); Test

}

Download Page method
Public Page downloadpage (String URL) {
return this.downLoadService.download (URL);
}

Parsing page methods
public void Processpage (Page page) {
This.processService.process (page);
}


Public Idownloadservice Getdownloadservice () {
return downloadservice;
}

public void Setdownloadservice (Idownloadservice downloadservice) {
This.downloadservice = Downloadservice;
}

Public Iprocessservice Getprocessservice () {
return processservice;
}

public void Setprocessservice (Iprocessservice processservice) {
This.processservice = Processservice;
}


}

4. Test the Main method, and if correct, you should output a red field that is: Total Played: 16,960,061,208

However, the console is empty:

That is, the XPath is invalid and parsing fails. This can be debugged using debug, step-by test, finally found, object[] Evaluatexpath = Rootnode.evaluatexpath (Parseallnumber), the Evaluatexpath value is [], What causes the parsing to fail? The red part of the XPath is "/html/body/div[4]/div/div[1]/div[2]/div[2]/ul/li[11", and, obviously, is the absolute path, specifically why it failed, and I now boil it down to an absolute path. Self-summary, there are currently two solutions:

1. If the XPath parsing is not correct, then the simplest solution is to keep the last label subscript, the rest is deleted, starting from the body.

2. Rewrite it as a relative path: "//div[@class =\" p-base\ "]/ul/li[11]", here is a reference to the XPath in the Http://www.cnblogs.com/miercler/p/5599465.html blog.

Both of these methods can solve the problem of invalid XPath!

Youku TV series Crawler Code implementation one: Download parse video website page (3) Supplementary knowledge: What if XPath is invalid?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.