Summary of application of Webharvest web crawler

Source: Internet
Author: User

Web-harvest is a Java open source Web data extraction tool. It can collect the specified Web pages and extract useful data from those pages. The principle is to get the full content of the page with httpclient according to the predefined configuration file (for the content of httpclient, some articles in this blog have been introduced), and then use these techniques such as XPath, XQuery, regular expression and so on to implement the text/ XML's content filtering operation, select the accurate data. The first two years of comparison of fire vertical search (such as: Cool news, etc.) is also implemented using a similar principle. Web-harvest application, the key is to understand and define the configuration file, the other is to consider how to deal with the data Java code. Of course, before the crawler started, you can also populate the configuration file with Java variables to implement a dynamic configuration.

Now take all the pages of the Tianya forum for example to introduce the use of web-harvest, especially its configuration file.

Tianya Forum Map page When: http://www.tianya.cn/bbs/index.shtml

[Part of the end of the page list]

Our goal is to capture all the information about the plate, including the relationship between the son and father of the Forum.

First look at the page map of the source code, seek the law:

<div class= "Backrgoundcolor" >
<div class= "Bankuai_list" >
<ul>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/free.shtml" id= "Item Tianya" > Tianya talk </a ></li>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/worldlook.shtml" id= "Item International Observation" > International observation < /a></li>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/news.shtml" id= "Item Tianya Time and Space" > Tianya </a ></li>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/no06.shtml" id= "Item Media Arena" > Media Arena </a ></li>
.....//omitted
</ul>
</div>
<div class= "Clear" ></div>
</div>
<div class= "Nobackrgoundcolor" >
<div class= "Bankuai_list" >
<ul>
<li><a href= "http://www.tianya.cn/techforum/articleslist/0/16.shtml" id= "item Lotus nonsense" > Lotus nonsense </a> </li>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/no05.shtml" id= "Item boiling Wine theory history" > Cooking wine History </a ></li>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/culture.shtml" id= "Item wordsmith" > Wordsmith </a ></li>
.../omit
</ul>
</div>
<div class= "Clear" ></div>
</div>
.../omit

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.