Learning Web Crawlers (1) and Learning Web Crawlers
Learn more about Web Crawlers
The following is a summary of the resources that I find useful. The resources are from the Internet.
Programming Language: java
Web Crawler: spiderman
Spiderman is a Java open source Web data extraction tool. It can collect specified Web pages and extract useful data from these pages.
Spiderman mainly uses technologies such as XPath and regular expressions to extract real data.
Spiderman open source China Link (including documents and downloads ):Http://www.oschina.net/p/spiderman
Spiderman Java crawler example:Http://my.oschina.net/laiweiwei/blog/99937
[The latest updates support pagination of channels and articles] [] captures the Q & A data of OSC to show the capabilities of vertical crawlers:Http://my.oschina.net/laiweiwei/blog/100866
XPath JAVA usage summary and sample code:Http://www.open-open.com/lib/view/open1397717612656.html
W3school XPath Tutorial:Http://www.w3school.com.cn/xpath/index.asp