Spiderman-another Java web spider/crawler
Spiderman is a micro-kernel + plug-in architecture of the network spider, its goal is to use a simple method to the complex target Web page information can be crawled and resolved to their own needs of business data.
Key Features
* Flexible, scalable, micro-core + plug-in architecture, Spiderman provides up to 10 extension points. Across the entire life cycle of spider threads. * With simple configuration, complex web content can be parsed into the business data you need without writing a code * multithreading
How to use?
- First, determine your target site and the target page (that is, a certain kind of page you want to get data, such as the news page of NetEase News)
- Then, open the target page, analyze the HTML structure of the page, get the XPath you want the data, and the specific XPath how to get it see below.
- Finally, fill in the parameters in an XML configuration file and run Spiderman!
Here's a crawl case.
Here is an article about an example: http://my.oschina.net/laiweiwei/blog/100866
XPath get tips?
This is only the Chrome browser, other browsers estimate the same, but the plug-in is different.
- First, download the Xpathonclick plugin, Https://chrome.google.com/webstore/search/xpathonclick
- Once the installation is complete, open the Chrome browser and you'll see an "X Path" icon in the upper right corner.
- Open your landing page in the browser, then click on the image in the upper-right corner, then click on the Web label where you want to get XPath, such as a title
- At this point, press and hold F12 to open the JS console and drag to the bottom to see a string of XPath content
- Remember, this content is not absolutely OK, you may need to make some changes, so you'd better learn the XPath syntax
- Where to learn the XPath syntax: http://www.w3school.com.cn/xpath/index.asp
Self-wind/spiderman Star 628 | Fork 375 Powerful Java crawler, list paging, detailed page paging, Ajax, micro-core high-scale, flexible configuration
Issues:
- The following exception occurs when you deploy a #1 project to Tomcat: Sesame Valley 11 month ago
recently submitted:
- 736c2512d rm zweb Dependency Laiweiwei 11 month ago
- 640423CBC rm file laiweiwei 11 month ago
- 1ed69b7ec some update. laiweiwei 11 months ago
Download zip
Java Web spider/web crawler spiderman