1 Documentation: URL: http: // Code.google.com/p/crawler4j/ Crawler4j is an open source Java crawler which provides a simple Interface For Crawling the web. You can setup a multi-threaded web crawler in 5 minutes! Crawler4j is an open-source Java crawler that provides simple interfaces for crawling web pages. You can configure a multi-threaded crawler in five minutes. Note: Version 3.0 is deprecated and shocould not be used. Please use the latest 3.3 Version. Note: 3 Do not use version. 0. Use version 3.3 you need to create a crawler. Class ThatExtends WebCrawler. This Class Decides which URLs shocould be crawled and handles the downloaded page. the following is a sample implementation: You need to write a crawler class that inherits the webcrawler class. This class determines which URL addresses will be collected and how to process the downloaded webpage, the following is a simple implementation: 2 Simple instructions: 1 Inherit from the crawler class override: shouldvisit: This function determines whether a given URL is collected, true is collected, and false is not collected and overwritten: visit: The function is called after the URL content is downloaded successfully, you can easily get URL, text, links, HTML, and the only idyou shoshould also implement a controller Class Which specifies the seeds of the crawl, the folder in which intermediate crawl data shoshould be stored and number of concurrent threads: 2Inherit the Controller class to specify the seed URL, the storage path of intermediate data, and the number of threads.CodeSee SVN: http: // 192.9.117.75/svncore/crawler/platform/Java/opensourcecrawler 3 Summary: I personally think this crawler is simple and easy to use. It is very convenient and advantageous for collection by a single client: 1 Multi-thread collection 2 The built-in URL filtering mechanism uses Berkeley dB for URL filtering. 3 It can be expanded to support structural extraction of Web Page fields and can be used for vertical collection. 4 dynamic web page capturing is not supported, such as the Ajax part of the web page. 5. distributed collection is not supported. You can consider using it as part of the distributed crawler and the collection part of the client.