83 open-source web crawler software

Last Update:2014-07-26 Source: Internet

Author: User

Tags nzbget

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, http://www.oschina.net/project/tag/64/spider? Lang = 0 & OS = 0 & sort = view &

Search EngineNutch

Nutch is a search engine implemented by open-source Java. It provides all the tools we need to run our own search engine. Including full-text search and web crawler. Although Web search is a basic requirement for roaming the Internet, the number of existing Web search engines is declining. and this is likely to evolve into a company that monopolizes almost all web... more information

Latest updates:[One blog per day] the study on the regular expression filtering mechanism of the URL of nutch was published 20 days ago.

Web CrawlerGrub Next Generation

Grub next generation is a distributed Web Crawler system that includes clients and servers that can be used to maintain web page indexes. More grub Next Generation Information

Latest updates:Grub next generation 1.0 was released three years ago

Website data collection softwareNetwork miner collector (original soukey picking)

Soukey-based website data collection software is an open-source software based on the. NET platform. It is also the only open-source software of the website data collection software type. Although soukey picking is open-source, it does not affect the provision of software functions, or even richer than some commercial software functions. Soukey picking currently provides the following main functions: 1. Multi-task multi-line... more network miner collector (original soukey picking) Information

Web Crawler and search engine in PHPPHPDig

PHPDig is a web crawler and search engine developed using PHP. Create a vocabulary by indexing dynamic and static pages. When searching for a query, it will display the search result page containing the key words according to certain sorting rules. PHPDig contains a template system that can index PDF, Word, Excel, and PowerPoint documents. PHPDig applies to more... More PHPDig Information

Website content collectorSnoopy

Snoopy is a powerful website content collector (crawler ). Provides functions such as obtaining webpage content and submitting forms. More Snoopy Information

Java Web CrawlerJspider

Jspider is a Java-implemented webspider. The execution format of jspider is as follows: jspider [url] [configname] URL must contain the protocol name, such as http: //. Otherwise, an error is reported. If configname is saved, the default configuration is used. Jspider behavior is configured by the configuration file, such as what plug-in is used, result storage party... more information about jspider

Web CrawlerNwebcrawler

Nwebcrawler is an open-source C # web crawler.

Web CrawlerHeritrix

Heritrix is an open-source and scalable Web Crawler project. Users can use it to capture desired resources from the Internet. Heritrixis designed to strictly follow the exclusion instructions and meta robots labels in the robots.txt file. Its outstanding advantage lies in its excellent scalability, allowing users to easily implement their own crawling logic. Heritrix is a crawler framework that organizes contents... more information about heritrix

Web Crawler frameworkScrapy

Scrapy is a twisted-based Asynchronous Processing framework. It is a crawler framework implemented in Python only. You only need to develop several modules to easily implement a crawler, it is very convenient to capture webpage content and various images ~ More scrapy Information

Latest updates:Use scrapy to create a website. The website was published six months ago.

Vertical CrawlerWebmagic

Webmagic is a crawler framework that requires no configuration and facilitates secondary development. It provides simple and flexible APIs and can implement a crawler with only a small amount of code. The following code crawls the oschina blog: Spider. Create (New simplepageprocessor ("http://my.oschina.net/", "http://my.oschina.net/#/blog/*"). T... more webmagic Information

Latest updates:Webmagic 0.5.2 released, Java crawler framework released one month ago

Openwebspider

Openwebspider is an open-source multi-threaded web spider (ROBOT: crawler) and a search engine that contains many interesting functions. More information about openwebspider

Java multi-thread Web CrawlerCrawler4j

Crawler4j is an open-source Java class library that provides a simple interface for capturing web pages. It can be used to build a multi-threaded web crawler. Sample Code: Import Java. util. arraylist; import Java. util. regEx. pattern; import Edu. UCI. ICS. crawler4j. crawler. page; import Edu. UCI. ICS. cr... more information about crawler4j

Web page capture/Information Extraction SoftwareMetaseeker

The metaseeker (gooseeker) v4.11.2 software toolkit for web page capture/Information Extraction/data extraction is officially released. The online version is free to download and use. You can read the source code. Since its launch, it has been very popular in the main application fields: vertical search (also known as professional search). High-speed, massive, and precise crawling is the topic web crawler datascrap... more metaseeker Information

Java Web Spider/Web CrawlerSpiderman

Spiderman-another Java Web Spider/crawler Spiderman is a web spider Based on the microkernel + plug-in architecture, its goal is to capture and parse complex target webpage information into the business data you need in a simple way. Main features * flexible, highly scalable, microkernel + plug-in architecture, Spiderman provides up to... more Spiderman Information

Web CrawlerMethanol

Methanol is a modular and customizable web crawler software. Its main advantage is its high speed. More methanol Information

Web Crawlers/Web CrawlersLarbin

Larbin is an open-source Web Crawler/web spider, developed independently by French young man s é Bastien ailleret. Larbin aims to track the URLs of pages for extended crawling, and finally provides a wide range of data sources for search engines. Larbin is just a crawler, that is to say, larbin only crawls web pages. As for how to parse, it is up to the user... more larbin Information

CrawlersSinawler

The first crawler program targeting Weibo data in China! The original name is "Sina Weibo crawler ". After logging on to the console, you can specify a user as the starting point, and use the user's followers and fans as the clues to collect basic user information, Weibo data, and comment data from different connections. The data obtained by this application can be used as data support for scientific research and development related to Sina Weibo, but it cannot be used by vendors... more information about sinawler

[Free] Dead link check softwareXenu

Xenu link sleuth may be the smallest but most powerful software you have ever seen to check website dead links. You can open a local webpage file to check its link, or enter any URL to check it. It can list the active and dead links of the website separately, and analyzes the redirection Links clearly. It supports multiple threads and can check the link... more Xenu Information

Web-harvest

Web-harvest is a Java open source web data extraction tool. It can collect specified web pages and extract useful data from these pages. Web-harvest mainly uses technologies such as XSLT, XQuery, and regular expressions to perform text/XML operations. More information about web-harvest

Web crawling toolsPlayfish

Playfish is a Web crawling tool that uses Java technology and is integrated with multiple open-source Java components, the xml configuration file is used to achieve highly customizable and scalable Web page capturing. Open-source jar packages for applications include httpclient (content reading), dom4j (Configuration File Parsing), and Jericho (HTML parsing ), it is already in the Lib of the war package. This

Easy-to-use Network Data Collection System

The system uses mainstream programming languages PHP and MySQL databases. You can use custom collection rules or Download shared rules from my website for websites or website groups, collect the data you need. You can also share your collection rules with everyone. Use the data explorer and editor to edit the collected data. All the code of this system is completely open-source,... more information about the Network Data Collection System

Web CrawlerYacy

Yacy is a P2P-based distributed Web search engine. it is also an HTTP cache proxy server. this project is a new method for building a P2P web index network. it can search for your own or global indexes, Crawl's own web pages, or start distributed crawling. more yacy Information

Latest updates:Yacy 1.4 was released, and the distributed Web search engine was released one year ago.

Web Crawler frameworkSmart and simple Web Crawler

Smart and simple web crawler is a web crawler framework. Integrate Lucene support. This crawler can start with a single link or a link array and provides two traversal modes: Maximum iteration and maximum depth. You can set a filter to limit the crawling back links. By default, three filters are provided: serverfilter, beginningpathfilter, and regulare... For more information, see smart and simple web crawler.

Web CrawlerCrawlzilla

Crawlzilla is a free software that helps you build a search engine. With it, you don't have to rely on the Collection Engine of commercial companies, there is no need to worry about indexing the company's internal website data. The question is handled by the nuch case as the core and more multi-phase relational suites are integrated, and develop the design security and management UI, making it easier for users to get started. In addition to crawling basic... more information about crawlzilla

Simple HTTP CrawlerHttpbot

Httpbot is a simple encapsulation of the java.net. httpurlconnection class. It can easily obtain webpage content, automatically manage sessions, and automatically process 301 redirection. Although it cannot be as powerful as httpclient and supports the complete HTTP protocol, it is very flexible and can meet all my current needs .... More httpbot Information

News collectorNzbget

Nzbget is a news collector. The materials downloaded from the news group are in the NZB format. It can be used in standalone and server/client modes. In standalone mode, the NZB file is used as the command line parameter to download the file. Both the server and client have only one executable file "nzbget ". Functions and features console interface, use plain text, color text or... more nzbget Information

Web CrawlerEx-Crawler

Ex-crawler is a web crawler developed in Java. This project is divided into two parts: a daemon process and a flexible and configurable web crawler. Use a database to store webpage information. More ex-crawler Information

Recruitment Information CrawlerJobhunter

Jobhunter is designed to automatically obtain recruitment information from some large sites, such as ChinaHR, 51job, and Zhaopin. Jobhunter searches for the email address of each work item and automatically sends the application text to this email address. More jobhunter Information

Web Crawler frameworkHispider

Hispider is a fast and high performance spider with high speed strictly speaking, it can only be a framework of a spider system. There is no need for refinement. Currently, it can only extract URL, URL deduplication, asynchronous DNS resolution, queue tasks: supports distributed downloading on N machines and targeted downloading on websites (hispiderd must be configured. INI whitelist ). features... more hispider Information

Perl CrawlerCombine

Combine is an open and scalable Web Resource crawler developed in Perl. More combine information

Web CrawlerJcrawl

Jcrawl is a small web crawler with excellent performance. It can capture various types of files from webpages, based on user-defined symbols, such as email and QQ. More jcrawl Information

Distributed Web CrawlerEbot

Ebot is a Scalable Distributed Web Crawler developed using the Erlang language. URLs are saved in the database and can be queried through restful HTTP requests. More ebot Information

Multi-thread Web CrawlerSpidernet

Spidernet is a multithreaded web crawler based on a recursive tree. It supports retrieving text/html resources. you can set the crawling depth, maximum number of bytes to download, Gzip decoding, GBK (gb2312) and UTF-8 encoded resources, and store them in SQLite data files. in the source code, todo: Mark and describe incomplete functions. You want to submit your code .... more spidernet Information

Itsucks

Itsucks is an open-source Java Web Spider (Web robot, crawler) project. Download rules can be defined by downloading templates and regular expressions. Provides a swing GUI. More information about itsucks

Web Search CrawlerBlueleech

Blueleech is an open-source program that starts from the specified URL and searches for all available links and links above the links. When searching, it can download all or pre-defined content pointed to by the link. More information about blueleech

URL monitoring scriptUrlwatch

Urlwatch is a Python script used to monitor the specified URL address. Once the specified URL content changes, it will be notified by email. Basic functions are easy to configure. You can specify a URL in a text file and a URL in a line. Easily hackable (Clean Python implementation) can run as a cronjob and M... more URL Watch information

Latest updates:Urlwatch 1.8 was released four years ago.

Methabot

Methabot is a high-speed and configurable web, FTP, and local file system crawler software. More methabot Information
Web search and crawlerLeopdo

Web searches and crawlers written in Java, including full-text and classified vertical searches, and more leopdo information in the Word Segmentation System

Web Crawler toolsNcrawler

Ncrawler is a web crawler tool that allows developers to easily develop applications with Web Crawler capabilities and can be extended, allows developers to expand its functions to support other types of resources (such as archives such as PDF, word, and Excel ). Ncrawler uses multiple threads (... more ncrawler Information

Ajax crawling and TestingCrawljax

Crawljax: written in Java and open source code. Crawljax is a Java tool for automated crawling and testing of current Ajax web applications.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More