HTML Information Retrieval (Part 1)-use JDOM, tagsoup, and xpath

Source: Internet
Author: User
Introduction

This article describes how to use JDOM with tagsoup, parse HTML into a DOM file object model, use XPath to retrieve information, or export the file to the XHTML format.

Information Acquisition

The Internet contains rich content for people to share their interest and knowledge. However, before the popularity of Semantic Web, unless the source site provides the resource access API, you must obtain the information on the Internet, you can only start parsing HTML.

Unstructured HTML (malformed and faulty HTML)

However, even if there are XHTML-based rules, the website is still full of various HTML webpages that do not conform to the standards. This scene even has a keyword called Tag soup.

In the industry, HTML5 in the optimistic and optimized field is not selected as XHTML 2.0 in the current situation, in a backward compatible manner, resist with fierce means. On the contrary, they are inclusive of such current circumstances. Therefore, it can be seen that, if you want to parse HTML, the program may not be able to escape the signature of tag soup.

In retrospect, the design concept of XHTML is the primary consideration for convenient programming. HTML5 is developed based on maximum compatibility, it must be compatible with real-world usage requirements. Is there a way to allow both users to write a website and make the program easy to parse?

The answer is right in front of you. The hacker, the main character of the year's Internet benchmark competition, has a lot to do with the existence of tag soup. However, from the other hand, the memory generator is actually a solution for tag Soup: the memory generator can display the content of all kinds of non-standard indexes that flood the network to users, in fact, it is the programmer of the browser who does everything he can to guess what the website designer may mean. The designer will compile an HTML file that is not in the correct format, the parser can be used to parse the format of the parser.

Today, when RIA is prevalent and JavaScript is widely used, client side programming can easily pass through JavaScript, using famework such as jquery, you can easily access the DOM file object model parsed by the parser to obtain the required information.

 

But what about server side? Is there any simpler way to embed the Mozilla or WebKit connector core?

In order to compile the HTML website information, if the HTML website can be converted into Dom like the browser, convert an HTML webpage (whether it meets the standard format) into a standard well-formed XML file or XHTML. Then, can I use tokens such as XPath and xpointer to access the information using existing APIs or tools?

HTML Parser (HTML Parser)

Yes, some people have long thought of this. The following is an example of an HTML Parser:

  • Cyberneko HTML Parser
  • HTML Parser
  • Jtidy
  • Tagsoup

The tagsoup architecture is built on the XML Parser sax2 standard extends xmlreader interface, so it is highly extensible and can be used with mainstream tools such as JAXP, JDOM, XOM, and dom4j.

Parse HTML using JDOM + tagsoup

Next, I will introduce how to use JDOM with tagsoup, parse HTML into Dom file object models, use XPath to retrieve information, or export files to XHTML format. (I will try again later to introduce the usage of xmlslurper built in groovy with tagsoup .)

Zookeeper jobs

First, please wait and install JDOM:

Lower-end jdom-1.1.1.zip, undo, set JDOM. jar (JDOM) and jaxen. jar (the implementation of xpath), add it to classpath, or put it under JRE \ Lib \ ext of JDK security program; or put it under the Lib category of groovy security.

Next, please refer to the following parameters:

The lower commit tagsoup-1.2.jar, as in the previous commit, add the commit case directly to the classpath.

Use saxbuilder to create a DOM File

First, set up JDOM's saxbuilder. saxbuilder can specify the parser class alias in constructor. We only need to inject the parser class of tagsoup with the full name Org. ccil. cowan. tagsoup. parser, you can use tagsoup for parsing:

def builder =newSAXBuilder("org.ccil.cowan.tagsoup.Parser")

After the saxbuilder is created, you can call the build () method to parse HTML and create DOM objects.

Although saxbuilder provides multiple build () methods for different metric dataString systemIdYou can directly import the version of the URI. However, in actual parsing, this method is used to parse the URI using the system parameter, does not automatically parse data based on the custom parsing format specified by the HTML file. In Taiwan, this indicates that the file will be parsed using big5 bytes. If the object file is not using big5 bytes (http://news.google.com.tw/For example, the net is a UTF-8. Otherwise, Google News's first website does not indicate that the website does not comply with HTML rules, it is because Google's services have a good anti-response speed and flexibility, and I believe it can afford Hu weiyun, our programming director ,... This should be an respect for Google !), The resolution result will be a bunch of zookeeper. Therefore, we must usebuild(org.xml.sax.InputSource in)In this version, after setting up inputsouce, set the correct failover mode, and then parse the build () method of saxbuilder:

defis=newInputSource("http://news.google.com.tw/")is.setEncoding(“UTF-8”)def doc = builder.build(is)

In this way, the DOM object is obtained. However, it should be noted that the DOM object created by the build () method of saxbuilder isorg.jdom.DocumentIf you wantorg.w3c.dom.Document(For example, to further process the DOM file by using the java standard interface), you must also useorg.jdom.output.DOMOutputterPerform the following operations:

org.w3c.dom.Document w3cdoc =newDOMOutputter().build( doc )

Similarly, if you want to export the JDOM document into XML/XHTML, useorg.jdom.output.XMLOutputterPerform the following operations:

String xhtml =new org.jdom.output.XMLOutputter().outputString( doc )

Or directly commit the statement:

new org.jdom.output.XMLOutputter().output( doc,newFileWriter("output.html"))
Use XPath to obtain information

After obtaining the DOM object, you can use XPath to obtain information. The basic method of xpath is simple, but the content is also quite large. In this case, the method of xpath is omitted, the legal disclaimer describes how to search for the XPath tutorial.

This will show you how to retrieve all the new standards on the Google News primary website. Viewhttp://news.google.com.tw/, You can see that each new rule is displayed in the following format, for example ︰

<Spanclass = "titletext"> Zeng Yanni's magic data and pink colors </span>

As you can see, the pseudo-settings should select the HTML labels of all classes containing the titletext content. The XPath is as follows:

//*[contains(@class,’titletext’)]

After deciding on the XPath path, you can use it to obtain new criteria and print them out:

def xpath =XPath.newInstance("//*[contains(@class,'titletext')]")def result = xpath.selectNodes( doc )result.each { println it.text }

The complete program list is as follows:

Gnews. Groovy
import org.jdom.*import org.jdom.input.*import org.jdom.xpath.*import org.jdom.output.*import org.xml.sax.*def builder =newSAXBuilder("org.ccil.cowan.tagsoup.Parser")def xpath =XPath.newInstance("//*[contains(@class,'titletext')]")defis=newInputSource("http://news.google.com.tw/")is.setEncoding("UTF-8")def doc = builder.build(is)def result = xpath.selectNodes( doc )result.each { println it.text }

To compile the above groovy program, run the following command in the Command column:

groovy gnews

That is, the new benchmark on the latest Google News will be published on the screen.

That's it! Is it very simple?

 

We welcome you to share your feedback and experiences.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.