Java Web spider/web crawler spiderman

Source: Internet
Author: User
Tags xpath

Spiderman-another Java web spider/crawler

Spiderman is a micro-kernel + plug-in architecture of the network spider, its goal is to use a simple method to the complex target Web page information can be crawled and resolved to their own needs of business data.

Key Features
* Flexible, scalable, micro-core + plug-in architecture, Spiderman provides up to 10 extension points. Across the entire life cycle of spider threads. * With simple configuration, complex web content can be parsed into the business data you need without writing a code * multithreading
How to use?
    • First, determine your target site and the target page (that is, a certain kind of page you want to get data, such as the news page of NetEase News)
    • Then, open the target page, analyze the HTML structure of the page, get the XPath you want the data, and the specific XPath how to get it see below.
    • Finally, fill in the parameters in an XML configuration file and run Spiderman!
Here's a crawl case.

Here is an article about an example: http://my.oschina.net/laiweiwei/blog/100866

XPath get tips?

This is only the Chrome browser, other browsers estimate the same, but the plug-in is different.

    • First, download the Xpathonclick plugin, Https://chrome.google.com/webstore/search/xpathonclick
    • Once the installation is complete, open the Chrome browser and you'll see an "X Path" icon in the upper right corner.
    • Open your landing page in the browser, then click on the image in the upper-right corner, then click on the Web label where you want to get XPath, such as a title
    • At this point, press and hold F12 to open the JS console and drag to the bottom to see a string of XPath content
    • Remember, this content is not absolutely OK, you may need to make some changes, so you'd better learn the XPath syntax
    • Where to learn the XPath syntax: http://www.w3school.com.cn/xpath/index.asp
Self-wind/spiderman Star 628 | Fork 375 Powerful Java crawler, list paging, detailed page paging, Ajax, micro-core high-scale, flexible configuration Issues:
    • The following exception occurs when you deploy a #1 project to Tomcat: Sesame Valley 11 month ago
recently submitted:
    • 736c2512d rm zweb Dependency Laiweiwei 11 month ago
    • 640423CBC rm file laiweiwei 11 month ago
    • 1ed69b7ec some update. laiweiwei 11 months ago

Download zip

Java Web spider/web crawler spiderman

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.