Java-xpath Parsing Crawl Content

Source: Internet
Author: User

We have too many choices in terms of crawling and parsing content.
For example, many people feel that jsoup can solve all problems.
Both HTTP requests, DOM manipulation, CSS query selector filtering are very handy.
 
The key is this selector, only one node can be filtered through an expression.
If I want to get a text or a node property value, I need to get it again from the returned element object.
And I happen to have an interesting need, just to show what you want to filter through an expression, get a headline, link, and so on for every news page.

 
XPath is a good fit, such as the following example:

static void crawlByXPath(String url,String xpathExp) throws IOException, ParserConfigurationException, SAXException, XPathExpressionException {    String html = Jsoup.connect(url).post().html();    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();    DocumentBuilder builder = factory.newDocumentBuilder();    Document document = builder.parse(html);    XPathFactory xPathFactory = XPathFactory.newInstance();    XPath xPath = xPathFactory.newXPath();    XPathExpression expression = xPath.compile(xpathExp);    expression.evaluate(html);}

   
Unfortunately, there are few sites that can be documentbuilder.parse by this code.
XPath, however, is very strict with Dom.
Once I clean the HTML, I joined this thing:

    <dependency>        <groupId>net.sourceforge.htmlcleaner</groupId>        <artifactId>htmlcleaner</artifactId>        <version>2.9</version>    </dependency>

 
Htmlcleaner can help me solve this problem, and he supports XPath on his own.
Just one line of Htmlcleaner.clean is solved:

public static void main(String[] args) throws IOException, XPatherException {    String url = "http://zhidao.baidu.com/daily";    String contents = Jsoup.connect(url).post().html();    HtmlCleaner hc = new HtmlCleaner();    TagNode tn = hc.clean(contents);    String xpath = "//h2/a/@href";    Object[] objects = tn.evaluateXPath(xpath);    System.out.println(objects.length);}

 
But htmlcleaner a new problem, and when I wrote the expression "//h2/a[contains (@href, ' daily ')]/@href", he prompted me not to support contains functions.
The problem is that Javax.xml.xpath supports the use of functions.
How to combine them? Htmlcleaner provides domserializer that can convert Tagnode objects to org.w3c.dom.Document objects, such as:

Document dom = new DomSerializer(new CleanerProperties()).createDOM(tn);

 
In this way, you can play their strengths.

public static void Main (string[] args) throws IOException, Xpatherexception, Parserconfigurationexception,    xpathexpressionexception {String url = "Http://zhidao.baidu.com/daily";    String exp = "//h2/a[contains (@href, ' daily ')]/@href";    String html = null;        try {Connection connect = jsoup.connect (URL);    html = Connect.get (). Body (). HTML ();    } catch (IOException e) {e.printstacktrace ();    } htmlcleaner HC = new Htmlcleaner ();    Tagnode tn = Hc.clean (HTML);    Document dom = new Domserializer (new Cleanerproperties ()). CreateDOM (TN);    XPath XPath = xpathfactory.newinstance (). Newxpath ();    Object result;    result = Xpath.evaluate (exp, DOM, Xpathconstants.nodeset);        If (result instanceof NodeList) {NodeList NodeList = (NodeList) result;        System.out.println (Nodelist.getlength ());            for (int i = 0; i < nodelist.getlength (); i++) {Node node = Nodelist.item (i); System.out.println (node.getnodevalue () = = null?Node.gettextcontent (): Node.getnodevalue ()); }    }}

Java-xpath Parse Crawl content

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.