We have too many choices in terms of crawling and parsing content.
For example, many people feel that jsoup can solve all problems.
Both HTTP requests, DOM manipulation, CSS query selector filtering are very handy.
The key is this selector, only one node can be filtered through an expression.
If I want to get a text or a node property value, I need to get it again from the returned element object.
And I happen to have an interesting need, just to show what you want to filter through an expression, get a headline, link, and so on for every news page.
XPath is a good fit, such as the following example:
static void crawlByXPath(String url,String xpathExp) throws IOException, ParserConfigurationException, SAXException, XPathExpressionException { String html = Jsoup.connect(url).post().html(); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document document = builder.parse(html); XPathFactory xPathFactory = XPathFactory.newInstance(); XPath xPath = xPathFactory.newXPath(); XPathExpression expression = xPath.compile(xpathExp); expression.evaluate(html);}
Unfortunately, there are few sites that can be documentbuilder.parse by this code.
XPath, however, is very strict with Dom.
Once I clean the HTML, I joined this thing:
<dependency> <groupId>net.sourceforge.htmlcleaner</groupId> <artifactId>htmlcleaner</artifactId> <version>2.9</version> </dependency>
Htmlcleaner can help me solve this problem, and he supports XPath on his own.
Just one line of Htmlcleaner.clean is solved:
public static void main(String[] args) throws IOException, XPatherException { String url = "http://zhidao.baidu.com/daily"; String contents = Jsoup.connect(url).post().html(); HtmlCleaner hc = new HtmlCleaner(); TagNode tn = hc.clean(contents); String xpath = "//h2/a/@href"; Object[] objects = tn.evaluateXPath(xpath); System.out.println(objects.length);}
But htmlcleaner a new problem, and when I wrote the expression "//h2/a[contains (@href, ' daily ')]/@href", he prompted me not to support contains functions.
The problem is that Javax.xml.xpath supports the use of functions.
How to combine them? Htmlcleaner provides domserializer that can convert Tagnode objects to org.w3c.dom.Document objects, such as:
Document dom = new DomSerializer(new CleanerProperties()).createDOM(tn);
In this way, you can play their strengths.
public static void Main (string[] args) throws IOException, Xpatherexception, Parserconfigurationexception, xpathexpressionexception {String url = "Http://zhidao.baidu.com/daily"; String exp = "//h2/a[contains (@href, ' daily ')]/@href"; String html = null; try {Connection connect = jsoup.connect (URL); html = Connect.get (). Body (). HTML (); } catch (IOException e) {e.printstacktrace (); } htmlcleaner HC = new Htmlcleaner (); Tagnode tn = Hc.clean (HTML); Document dom = new Domserializer (new Cleanerproperties ()). CreateDOM (TN); XPath XPath = xpathfactory.newinstance (). Newxpath (); Object result; result = Xpath.evaluate (exp, DOM, Xpathconstants.nodeset); If (result instanceof NodeList) {NodeList NodeList = (NodeList) result; System.out.println (Nodelist.getlength ()); for (int i = 0; i < nodelist.getlength (); i++) {Node node = Nodelist.item (i); System.out.println (node.getnodevalue () = = null?Node.gettextcontent (): Node.getnodevalue ()); } }}
Java-xpath Parse Crawl content