To play big data, no data how to play? Here are some 33 open source crawler software for everyone.
Crawler, or web crawler, is a program that automatically obtains Web content. is an important part of the search engine, so the search engine optimization is to a large extent the optimization of the crawler.
Web crawler is a program that automatically extracts Web pages, and it is an important component of search engine to download Web pages from the World Wide Web. The traditional crawler starts from the URL of one or several initial web pages, obtains the URL on the initial page, and in the process of crawling the Web page, continuously extracts the new URL from the current page into the queue until it satisfies the system's stop condition. The work flow of the focus crawler is more complex, and it is necessary to filter the links that are irrelevant to the topic according to a certain page analysis algorithm, and keep the useful links and put them into the queue of URLs waiting to be crawled. It then selects the next page URL to crawl from the queue according to a certain search strategy, and repeats the process until a certain condition of the system is reached. In addition, all crawled Web pages will be stored by the system, for certain analysis, filtering, and indexing, so that after the query and retrieval, for the focus of the crawler, the results of this process may also provide feedback and guidance for the subsequent crawl process.
The world has formed the crawler software as many as hundreds of, this article on the more well-known and common open source crawler software to comb, according to the development of the language to summarize. Although the search engine also has crawlers, but this time I summarize is only the crawler software, but not the large, complex search engine, because many brothers just want to crawl data, rather than run a search engine.
Java crawler
1, arachnid
Arachnid is a Java-based web spider framework. It contains a simple HTML parser capable of parsing an input stream that contains HTML content. By implementing the Arachnid subclass, you can develop a simple web Spiders and the ability to add a few lines of code calls after each page on the Web site is parsed. Arachnid's download package contains two spider application examples to demonstrate how to use the framework.
Features: Micro crawler frame, contains a small HTML parser
License: GPL
2, Crawlzilla
Crawlzilla is a free software to help you easily build a search engine, with it, you do not have to rely on the commercial company's search engine, and no longer worry about the company's internal site data index issues.
With the Nutch project as the core, and the integration of more relevant packages, and card design installation and management of the UI, so that users more convenient to get started.
Crawlzilla In addition to crawling basic HTML, but also to analyze the files on the Web page, such as (Doc, PDF, ppt, ooo, RSS) and many other file formats, so that your search engine is not just a web search engine, but the site's complete database index library.
Have Chinese word segmentation ability, make your search more accurate.
Crawlzilla Features and objectives, the most important is to provide users with a convenient and easy to install the search platform.
License Agreement: Apache License 2
Development language: Java JavaScript SHELL
Operating system: Linux
Project home: Https://github.com/shunfa/crawlzilla
Download Address: http://sourceforge.net/projects/crawlzilla/
Features: Easy to install, with Chinese word segmentation function
3, Ex-crawler
Ex-crawler is a web crawler, using Java development, the project is divided into two parts, one is the daemon, and the other is a flexible and configurable Web crawler. Use a database to store Web page information.
License Agreement: GPLV3
Development language: Java
Operating systems: cross-platform
Features: run by daemon, use database to store Web page information
4, Heritrix
Heritrix is a Java-developed, open-source web crawler that users can use to crawl the resources they want from the web. The best thing about it is that it's good scalability and allows users to implement their own crawl logic.
Heritrix adopts a modular design, each module is coordinated by a controller class (Crawlcontroller Class), and the controller is the core of the whole.
Code-managed: Https://github.com/internetarchive/heritrix3
License Agreement: Apache
Development language: Java
Operating systems: cross-platform
Features: Strict adherence to robots document exclusion instructions and meta robots tags
5, Heydr
Heydr is a Java-based lightweight, open-source, multi-threaded vertical search crawler framework that follows the GNU GPL V3 protocol.
Users can build their own vertical resource crawler through HEYDR, which is used to prepare data for vertical search engines.
License Agreement: GPLV3
Development language: Java
Operating systems: cross-platform
Features: Lightweight open source multi-threaded vertical retrieval crawler frame
6, Itsucks
Itsucks is a Java web spider (Web bot, crawler) Open source project. Download templates and regular expressions are supported to define download rules. Provides a swing GUI operator interface.
Features: Provides swing GUI operator interface
7, Jcrawl
Jcrawl is a small, high-performance web crawler that can fetch various types of files from a Web page, based on user-defined symbols such as EMAIL,QQ.
License Agreement: Apache
Development language: Java
Operating systems: cross-platform
Features: Light weight, excellent performance, can fetch various types of files from the Web page
8, Jspider
Jspider is a Java-implemented Webspider,jspider execution format as follows:
Jspider [URL] [ConfigName]
URL must be added to the protocol name, such as:/HTTP, or will error. If ConfigName is omitted, the default configuration is used.
Jspider behavior is configured by the configuration file, such as the use of what plug-ins, the results of storage and so on in the Conf[configname] directory settings. Jspider The default configuration is very small and uses little. But Jspider is very easy to expand and can be leveraged to develop powerful web crawling and data analysis tools. To do this, you need to have a deep understanding of the principles of Jspider, and then develop a plug-in based on your needs and compose a configuration file.
License Agreement: LGPL
Development language: Java
Operating systems: cross-platform
Features: Powerful, easy to expand
9, Leopdo
Web search and crawlers written in Java, including full-text and categorical vertical search, and word breakers
License Agreement: Apache
Development language: Java
Operating systems: cross-platform
Features: including full-text and categorical vertical search, and Word segmentation system
10, Metaseeker
is a complete set of Web content crawling, formatting, data integration, storage management and search solutions.
There are many ways to implement a web crawler, and if you follow where you deploy it, you can split it into:
1, server side: is generally a multi-threaded program, while downloading multiple target HTML, you can use PHP, Java, Python (currently very popular) and so on, you can do fast, general integrated search engine crawler to do so. However, if the other side hates reptiles, it is likely to seal off your IP, the server IP is not easy to change, the other consumption of bandwidth is quite expensive. It is recommended to look at beautiful soap.
2, the client: the general implementation of the Crawler, or focus on the crawler, do comprehensive search engine is not easy to succeed, and vertical search or comparison service or recommendation engine, relatively easy, this kind of crawler is not what pages are taken, but only to take your relationship page, and only take care of the page content, such as extracting yellow pages information, Commodity price information, as well as the extraction of competitor advertising information, search Spyfu, very interesting. This type of crawler can be deployed a lot, and can be very aggressive, the other side is difficult to block.
The web crawler in Metaseeker belongs to the latter.
The Metaseeker Toolkit leverages the ability of the Mozilla platform to extract whatever Firefox sees.
Metaseeker Toolkit is free to use, download address: Www.gooseeker.com/cn/node/download/front
Features: Web crawl, information extraction, Data extraction toolkit, easy to operate
11, Playfish
Playfish is a web crawler that uses Java technology to comprehensively apply multiple open source Java components, enabling highly customizable and scalable web crawlers through XML configuration files
Application Open Source Jar package includes httpclient (content Read), dom4j (profile parsing), Jericho (HTML parsing), already in the war package Lib.
The project is still very immature, but the function is basically complete. Requires the user to be familiar with XML and familiar with regular expressions. Now through this tool can crawl all kinds of forums, bar, as well as all kinds of CMS system. Articles like Discuz!,phpbb, forums and blogs can be easily crawled through this tool. The fetch definition is fully XML and suitable for use by Java developers.
To use, 1. Download the right. War package into Eclipse, 2. Use the Wcc.sql file under Webcontent/sql to build a sample database, 3. Modify the DbConfig.txt of Wcc.core under the SRC package to set the username and password to your own MySQL username and password. 4. Then run Systemcore, run time will be in the console, no parameters will execute the default example.xml configuration file, with the parameter when the name is the profile name.
System comes with 3 examples, respectively for Baidu.xml crawl Baidu know, Example.xml crawl my javaeye blog, bbs.xml crawl a Use discuz forum content.