Cralwer4j intends to use this tool for some single-client applications.

Source: Internet
Author: User
1 Documentation: URL: http:  //  Code.google.com/p/crawler4j/  Crawler4j is an open source Java crawler which provides a simple  Interface   For Crawling the web. You can setup a multi-threaded web crawler in 5 minutes! Crawler4j is an open-source Java crawler that provides simple interfaces for crawling web pages. You can configure a multi-threaded crawler in five minutes. Note: Version 3.0 is deprecated and shocould not be used. Please use the latest 3.3 Version. Note: 3 Do not use version. 0. Use version 3.3 you need to create a crawler.  Class ThatExtends WebCrawler. This Class  Decides which URLs shocould be crawled and handles the downloaded page. the following is a sample implementation: You need to write a crawler class that inherits the webcrawler class. This class determines which URL addresses will be collected and how to process the downloaded webpage, the following is a simple implementation: 2 Simple instructions: 1 Inherit from the crawler class override: shouldvisit: This function determines whether a given URL is collected, true is collected, and false is not collected and overwritten: visit: The function is called after the URL content is downloaded successfully, you can easily get URL, text, links, HTML, and the only idyou shoshould also implement a controller  Class  Which specifies the seeds of the crawl, the folder in which intermediate crawl data shoshould be stored and number of concurrent threads: 2Inherit the Controller class to specify the seed URL, the storage path of intermediate data, and the number of threads.CodeSee SVN: http:  //  192.9.117.75/svncore/crawler/platform/Java/opensourcecrawler 3 Summary: I personally think this crawler is simple and easy to use. It is very convenient and advantageous for collection by a single client: 1 Multi-thread collection 2 The built-in URL filtering mechanism uses Berkeley dB for URL filtering. 3 It can be expanded to support structural extraction of Web Page fields and can be used for vertical collection. 4 dynamic web page capturing is not supported, such as the Ajax part of the web page. 5. distributed collection is not supported. You can consider using it as part of the distributed crawler and the collection part of the client.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.