Youku TV series Crawler Code implementation one: Download parse video website page (3) Supplementary knowledge: What if XPath is invalid?

Last Update:2017-01-13 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What if XPath is not valid? Obviously XPath is obtained by locating the child nodes, copy XPath, which is theoretically correct

What if XPath is not valid? Obviously XPath is obtained by pressing the F12 locator and then copy the XPath, but is it wrong to put it in the code?

Premise: Youku TV series crawler Code implementation of one: Download resolution video Site page (2) The workload has been completed. Based on this foundation, further refine the code

1. Create a new page resolution interface.

Package com.dajiangtai.djt_spider.service;

Import Com.dajiangtai.djt_spider.entity.Page;

/**
* Page Parsing interface
* @author Administrator
*
*/
Public interface Iprocessservice {

public void Process (Page page);
}

2. New Page resolution implementation class

Package Com.dajiangtai.djt_spider.service.impl;

Import Java.util.regex.Matcher;
Import Java.util.regex.Pattern;

Import Org.htmlcleaner.HtmlCleaner;
Import Org.htmlcleaner.TagNode;
Import org.htmlcleaner.XPatherException;

Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IProcessService;
Import Com.dajiangtai.djt_spider.util.HtmlUtil;
Import Com.dajiangtai.djt_spider.util.LoadPropertyUtil;
Import Com.dajiangtai.djt_spider.util.RegexUtil;

/**
* Youku Page Parsing implementation class
* @author Administrator
*
*/
public class Youkuprocessservice implements iprocessservice{

Total play Volume:
Private String Parseallnumber = "/html/body/div[4]/div/div[1]/div[2]/div[2]/ul/li[11]";

public void Process (Page page) {

String content = Page.getcontent ();
Htmlcleaner Htmlcleaner = new Htmlcleaner ();
Use Htmlcleaner to parse the Web page to get the root node
Tagnode RootNode = htmlcleaner.clean (content);
try {
object[] Evaluatexpath = Rootnode.evaluatexpath (Parseallnumber);
if (evaluatexpath.length>0) {

Through XPath, navigate to the child node and output the child node information
Tagnode node = (tagnode) evaluatexpath[0];
System.out.println (Node.gettext (). toString ());
}
} catch (Xpatherexception e) {
TODO auto-generated Catch block
E.printstacktrace ();
}
}

}

3. New TV drama Reptile Entry class Startdsjcount, defining Downloadservice, Processservice, and generating get/set methods, and instantiating the two interfaces through the set method. Use previous work to encapsulate page downloads and parsing methods. Test the page resolution again.

Package Com.dajiangtai.djt_spider.start;

Import Com.dajiangtai.djt_spider.entity.Page;
Import Com.dajiangtai.djt_spider.service.IDownLoadService;
Import Com.dajiangtai.djt_spider.service.IProcessService;
Import Com.dajiangtai.djt_spider.service.IStoreService;
Import Com.dajiangtai.djt_spider.service.impl.ConsoleStoreService;
Import Com.dajiangtai.djt_spider.service.impl.HttpClientDownLoadService;
Import Com.dajiangtai.djt_spider.service.impl.YOUKUProcessService;

/**
* TV series Reptile Entrance class
* @author Administrator
*
*/
public class Startdsjcount {

Page Download interface
Private Idownloadservice Downloadservice;
Page parsing interface
Private Iprocessservice Processservice;

public static void Main (string[] args) {
Startdsjcount DSJ = new Startdsjcount ();

Httpclientdownloadservice Implementing the Downloadservice interface
Dsj.setdownloadservice (New Httpclientdownloadservice ());

Youkuprocessservice Implementing the Processservice interface
Dsj.setprocessservice (New Youkuprocessservice ());
String url = "Http://list.youku.com/show/id_z9cd2277647d311e5b692.html?spm=a2h0j.8191423.sMain.5~5~A!2.iCUyO9";
Download page
Page page = dsj.downloadpage (URL);
Parse page
Dsj.processpage (page); Test

}

Download Page method
Public Page downloadpage (String URL) {
return this.downLoadService.download (URL);
}

Parsing page methods
public void Processpage (Page page) {
This.processService.process (page);
}

Public Idownloadservice Getdownloadservice () {
return downloadservice;
}

public void Setdownloadservice (Idownloadservice downloadservice) {
This.downloadservice = Downloadservice;
}

Public Iprocessservice Getprocessservice () {
return processservice;
}

public void Setprocessservice (Iprocessservice processservice) {
This.processservice = Processservice;
}

}

4. Test the Main method, and if correct, you should output a red field that is: Total Played: 16,960,061,208

However, the console is empty:

That is, the XPath is invalid and parsing fails. This can be debugged using debug, step-by test, finally found, object[] Evaluatexpath = Rootnode.evaluatexpath (Parseallnumber), the Evaluatexpath value is [], What causes the parsing to fail? The red part of the XPath is "/html/body/div[4]/div/div[1]/div[2]/div[2]/ul/li[11", and, obviously, is the absolute path, specifically why it failed, and I now boil it down to an absolute path. Self-summary, there are currently two solutions:

1. If the XPath parsing is not correct, then the simplest solution is to keep the last label subscript, the rest is deleted, starting from the body.

2. Rewrite it as a relative path: "//div[@class =\" p-base\ "]/ul/li[11]", here is a reference to the XPath in the Http://www.cnblogs.com/miercler/p/5599465.html blog.

Both of these methods can solve the problem of invalid XPath!

Youku TV series Crawler Code implementation one: Download parse video website page (3) Supplementary knowledge: What if XPath is invalid?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More