Cralwer4j intends to use this tool for some single-client applications.

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Documentation: URL: http:  //  Code.google.com/p/crawler4j/  Crawler4j is an open source Java crawler which provides a simple  Interface   For Crawling the web. You can setup a multi-threaded web crawler in 5 minutes! Crawler4j is an open-source Java crawler that provides simple interfaces for crawling web pages. You can configure a multi-threaded crawler in five minutes. Note: Version 3.0 is deprecated and shocould not be used. Please use the latest 3.3 Version. Note: 3 Do not use version. 0. Use version 3.3 you need to create a crawler.  Class ThatExtends WebCrawler. This Class  Decides which URLs shocould be crawled and handles the downloaded page. the following is a sample implementation: You need to write a crawler class that inherits the webcrawler class. This class determines which URL addresses will be collected and how to process the downloaded webpage, the following is a simple implementation: 2 Simple instructions: 1 Inherit from the crawler class override: shouldvisit: This function determines whether a given URL is collected, true is collected, and false is not collected and overwritten: visit: The function is called after the URL content is downloaded successfully, you can easily get URL, text, links, HTML, and the only idyou shoshould also implement a controller  Class  Which specifies the seeds of the crawl, the folder in which intermediate crawl data shoshould be stored and number of concurrent threads: 2Inherit the Controller class to specify the seed URL, the storage path of intermediate data, and the number of threads.CodeSee SVN: http:  //  192.9.117.75/svncore/crawler/platform/Java/opensourcecrawler 3 Summary: I personally think this crawler is simple and easy to use. It is very convenient and advantageous for collection by a single client: 1 Multi-thread collection 2 The built-in URL filtering mechanism uses Berkeley dB for URL filtering. 3 It can be expanded to support structural extraction of Web Page fields and can be used for vertical collection. 4 dynamic web page capturing is not supported, such as the Ajax part of the web page. 5. distributed collection is not supported. You can consider using it as part of the distributed crawler and the collection part of the client.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cralwer4j intends to use this tool for some single-client applications.

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support