Try web-harvest

Source: Internet
Author: User
Tags xquery

Try web-harvest
Web-harvest is a Java open source web data extraction tool. It can collect specified web pages and extract
Data used. Web-harvest mainly uses technologies such as XSLT, XQuery, and regular expressions to perform text/XML operations.
I personally feel that this tool has a good design concept. I can use the written XML script to convert the specified HTML into XML, and then use XML
The parser extracts information from it. In this way, when writing a webpage Information Extraction Tool, we don't have to worry about the changes in the webpage format, which will affect the information.
The extraction result, because the entire extraction information is implemented by configuring the corresponding script, we only need to modify the script, no
Change the program code.
The following is a script for extracting Yahoo search information based on the example provided by this open-source tool. It is executed using web-harvest.
You can extract the results obtained by searching for the Yahoo search engine with the keyword "KMS.

Script XML:
<? XML version = "1.0" encoding = "UTF-8"?>
<Config charset = "ISO-8859-1">

<Include Path = "functions. xml"/>

<Var-Def name = "Search"> kms </var-Def>
<Var-Def name = "url">
<Template> P =$ {search} </template>

<! -- Collects all tables for individual products -->
<Var-Def name = "Products">
<Call name = "Download-multipage-list">
<Call-Param name = "pageurl"> <var name = "url"/> </call-param>
<Call-Param name = "nextxpath"> // big [. = 'Next']/A/@ href </call-param>
<Call-Param name = "itemxpath"> // ol/LI </call-param>
<Call-Param name = "maxloops"> 10 </call-param>

<! -- Iterates over all collected products and extract desired data -->
<File action = "write" Path = "myyahoo. xml" charset = "UTF-8">
<! [CDATA [Yahoo]>
<Loop item = "item" Index = "I">
<List> <var name = "Products"/> </List>
<XQ-Param name = "item"> <var name = "item"/> </XQ-param>
<XQ-expression> <! [CDATA [
Let $ name: = data ($ item // Div [1]/A [1])
Let $ SRC: = data ($ item // Div [1]/A [1]/@ href)
Let $ ABS: = data ($ item // Div [2])
<Name >{ normalize-space ($ name)} </Name>
<SRC> {normalize-space ($ SRC)} </src>
<ABS >{ normalize-space ($ ABS)} </ABS>
]> </XQ-expression>
<! [CDATA [Yahoo]>


Result XML:
<Yahoo> <product>
<Name> kmsresearch </Name>
G8dzqrsa1dtmqrwb3mdmqrzzwmdc3iednrpzam-/Sig = 11fph2b7-/ exp = 1167795403/** HTTP % 3A //</src>
<ABS> learn about each subbrand which has its own purpose and look to support your way
Life, mood, or whim. </ABS>
<Name> summer-kms promotional items </Name>
G8dzqrsa1dtmqrwb3mdmtawbhnlywnzcgr2dglkaw --/Sig = 11q4tb45p/EXP = 1167795451/** HTTP % 3A // </src>
<ABS> kms design. Special Designs. onpacks and inpacks... kms presents the smallest solar
Charger available... the kms softfrisbee-this UFO is foldable! ... </ABS>
</Product> </Yahoo>

If you are familiar with XML, XPath, and XQuery technologies, and have read the help of web-harvest (, I believe the above script XML should not be hard to understand.
During the trial, I also found some problems with web-harvest. For example, he used tagsoup to clean HTML webpages.
, May cause some nonstandard webpage data loss (such as Google's search page), hope web-harvest developers can
After all, there are not many web pages that can strictly comply with the html4.0 specification, but more are existing before the emergence of XML.
. Currently, the XML technology is the most ideal for Web information extraction, and web-harvest has already set up
The alternative extraction model, how to solve the lossless XML Conversion for a large number of non-standard webpages, will be whether this tool can be used in practice
Key Link.
In addition, due to my limited level, I have not found any gibberish webpage while using web-harvest to extract Chinese webpages.
The purpose of this article is to attract more attention to the Web-harvest tool. Because web-harvest
There are still many advanced applications that I have not studied yet; there are still many areas for improvement. But it gives me at least one revelation, fully structured,
Dynamic Web page information extraction can be achieved and is not difficult.
XPath Tutorial:
XQuery Tutorial:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.