Try web-harvest
Web-harvest is a Java open source web data extraction tool. It can collect specified web pages and extract
Data used. Web-harvest mainly uses technologies such as XSLT, XQuery, and regular expressions to perform text/XML operations.
I personally feel that this tool has a good design concept. I can use the written XML script to convert the specified HTML into XML, and then use XML
The parser extracts information from it. In this way, when writing a webpage Information Extraction Tool, we don't have to worry about the changes in the webpage format, which will affect the information.
The extraction result, because the entire extraction information is implemented by configuring the corresponding script, we only need to modify the script, no
Change the program code.
The following is a script for extracting Yahoo search information based on the example provided by this open-source tool. It is executed using web-harvest.
You can extract the results obtained by searching for the Yahoo search engine with the keyword "KMS.
Script XML:
<? XML version = "1.0" encoding = "UTF-8"?>
<Config charset = "ISO-8859-1">
<Include Path = "functions. xml"/>
<Var-Def name = "Search"> kms </var-Def>
<Var-Def name = "url">
<Template> http://search.yahoo.com/search? P =$ {search} </template>
</Var-Def>
<! -- Collects all tables for individual products -->
<Var-Def name = "Products">
<Call name = "Download-multipage-list">
<Call-Param name = "pageurl"> <var name = "url"/> </call-param>
<Call-Param name = "nextxpath"> // big [. = 'Next']/A/@ href </call-param>
<Call-Param name = "itemxpath"> // ol/LI </call-param>
<Call-Param name = "maxloops"> 10 </call-param>
</Call>
</Var-Def>
<! -- Iterates over all collected products and extract desired data -->
<File action = "write" Path = "myyahoo. xml" charset = "UTF-8">
<! [CDATA [Yahoo]>
<Loop item = "item" Index = "I">
<List> <var name = "Products"/> </List>
<Body>
<XQuery>
<XQ-Param name = "item"> <var name = "item"/> </XQ-param>
<XQ-expression> <! [CDATA [
Let $ name: = data ($ item // Div [1]/A [1])
Let $ SRC: = data ($ item // Div [1]/A [1]/@ href)
Let $ ABS: = data ($ item // Div [2])
Return
<Product>
<Name >{ normalize-space ($ name)} </Name>
<SRC> {normalize-space ($ SRC)} </src>
<ABS >{ normalize-space ($ ABS)} </ABS>
</Product>
]> </XQ-expression>
</XQuery>
</Body>
</Loop>
<! [CDATA [Yahoo]>
</File>
</Config>
Result XML:
<Yahoo> <product>
<Name> kmsresearch </Name>
<SRC> http://rds.yahoo.com/_ylt=A0geuodL05lFpaEArQxXNyoA;_ylu=X3oDMTB2b2gzdDdtBGNvb
G8dzqrsa1dtmqrwb3mdmqrzzwmdc3iednrpzam-/Sig = 11fph2b7-/ exp = 1167795403/** HTTP % 3A // www.kmshaircare.com/</src>
<ABS> learn about each subbrand which has its own purpose and look to support your way
Life, mood, or whim. </ABS>
</Product>
.
.
.
<Product>
<Name> summer-kms promotional items </Name>
<SRC> http://rds.yahoo.com/_ylt=A0geupZ705lFwVkAMQZXNyoA;_ylu=X3oDMTExYm1vY2p0BGNvb
G8dzqrsa1dtmqrwb3mdmtawbhnlywnzcgr2dglkaw --/Sig = 11q4tb45p/EXP = 1167795451/** HTTP % 3A // kms-fra.com/en/products/sommer/ </src>
<ABS> kms design. Special Designs. onpacks and inpacks... kms presents the smallest solar
Charger available... the kms softfrisbee-this UFO is foldable! ... </ABS>
</Product> </Yahoo>
If you are familiar with XML, XPath, and XQuery technologies, and have read the help of web-harvest (http://web-harvest.sourceforge.net/manual.php), I believe the above script XML should not be hard to understand.
During the trial, I also found some problems with web-harvest. For example, he used tagsoup to clean HTML webpages.
, May cause some nonstandard webpage data loss (such as Google's search page), hope web-harvest developers can
After all, there are not many web pages that can strictly comply with the html4.0 specification, but more are existing before the emergence of XML.
. Currently, the XML technology is the most ideal for Web information extraction, and web-harvest has already set up
The alternative extraction model, how to solve the lossless XML Conversion for a large number of non-standard webpages, will be whether this tool can be used in practice
Key Link.
In addition, due to my limited level, I have not found any gibberish webpage while using web-harvest to extract Chinese webpages.
The purpose of this article is to attract more attention to the Web-harvest tool. Because web-harvest
There are still many advanced applications that I have not studied yet; there are still many areas for improvement. But it gives me at least one revelation, fully structured,
Dynamic Web page information extraction can be achieved and is not difficult.
References:
Web-harvest: http://web-harvest.sourceforge.net/
XPath Tutorial: http://www.zvon.org/xxl/XPathTutorial/Output_chi/introduction.html
XQuery Tutorial: http://www.w3pop.com/tech/school/xquery/default.asp