Web-harvest is a Java open source Web data extraction tool. It can collect the specified Web pages and extract useful data from those pages. The principle is to get the full content of the page with httpclient according to the predefined configuration file (for the content of httpclient, some articles in this blog have been introduced), and then use these techniques such as XPath, XQuery, regular expression and so on to implement the text/ XML's content filtering operation, select the accurate data. The first two years of comparison of fire vertical search (such as: Cool news, etc.) is also implemented using a similar principle. Web-harvest application, the key is to understand and define the configuration file, the other is to consider how to deal with the data Java code. Of course, before the crawler started, you can also populate the configuration file with Java variables to implement a dynamic configuration.
Now take all the pages of the Tianya forum for example to introduce the use of web-harvest, especially its configuration file.
Tianya Forum Map page When: http://www.tianya.cn/bbs/index.shtml
[Part of the end of the page list]
Our goal is to capture all the information about the plate, including the relationship between the son and father of the Forum.
First look at the page map of the source code, seek the law:
<div class= "Backrgoundcolor" >
<div class= "Bankuai_list" >
<ul>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/free.shtml" id= "Item Tianya" > Tianya talk </a ></li>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/worldlook.shtml" id= "Item International Observation" > International observation < /a></li>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/news.shtml" id= "Item Tianya Time and Space" > Tianya </a ></li>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/no06.shtml" id= "Item Media Arena" > Media Arena </a ></li>
.....//omitted
</ul>
</div>
<div class= "Clear" ></div>
</div>
<div class= "Nobackrgoundcolor" >
<div class= "Bankuai_list" >
<ul>
<li><a href= "http://www.tianya.cn/techforum/articleslist/0/16.shtml" id= "item Lotus nonsense" > Lotus nonsense </a> </li>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/no05.shtml" id= "Item boiling Wine theory history" > Cooking wine History </a ></li>
<li><a href= "http://www.tianya.cn/publicforum/articleslist/0/culture.shtml" id= "Item wordsmith" > Wordsmith </a ></li>
.../omit
</ul>
</div>
<div class= "Clear" ></div>
</div>
.../omit