Open source web crawler Summary

Last Update:2016-12-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Awesome-crawler-cn

Internet crawlers, spiders, data collectors, Web parser summary, because of new technologies continue to evolve, new framework endless, this article will be constantly updated ...

Exchange Discussion

Welcome to recommend you know the Open source web crawler, Web extraction framework.
Open source web crawler QQ Exchange Group: 322937592
Email Address:liinux at qq.com

Python

Scrapy-An efficient screen, web data acquisition framework.
- Django-dynamic-scraper-crawler based on the Scrapy kernel developed by the Django Web framework.
- Scrapy-redis-a crawler that uses Redis components based on the Scrapy core.
- Scrapy-cluster-A distributed crawler framework developed using REDIS and Kafka based on the Scrapy core.
- Distribute_crawler-A distributed crawler framework developed using REDIS and MongoDB based on the Scrapy core.
Pyspider-A powerful, pure Python data acquisition system.
Cola-a distributed crawler framework.
Demiurge-a miniature crawler frame based on Pyquery.
Scrapely-A pure Python HTML page capture library.
Feedparser-a generic feed parser.
You-get-The silent site crawls to the downloader.
Grab-site collection framework.
Mechanicalsoup-a Python library of automated interactive websites.
Portia-a visual data acquisition framework based on Scrapy.
Crawley-a Python crawler framework based on non-blocking communication (NIO).
Robobrowser-A simple Python-based web browser that is not web-based.
Mspider-A Python crawler based on the Gevent (co-network library).
Brownant-A lightweight network data extraction framework.

Java

Apache Nutch-Highly scalable, highly scalable web crawler for production environments.
- Anthelion-an Apache-based Nutch crawl semantic annotation in HTML page plugin.
CRAWLER4J-Simple and lightweight web crawler.
Jsoup-Collects, analyzes, processes and cleans HTML pages.
websphinx-html site-specific processing, information extraction.
Open Search Server-Set your own indexing strategy with a full set of search features. Analyze and extract full-text data, and this framework can index everything.
Gecco-an easy-to-use, lightweight web crawler.
Webcollector-Simple Crawl page interface, you can deploy a multi-threaded web crawler in less than 5 minutes.
Webmagic-an extensible crawler framework.
Spiderman-A scalable, multi-threaded web crawler.
- Spiderman2-A distributed web crawler framework that supports JavaScript rendering.
HERITRIX3-scalable, large-scale web crawler project.
Seimicrawler-an Agile distributed crawler framework.
Stormcrawler-based on open source code and building a low latency network resource acquisition framework based on Apache Storm.
Spark-crawler-a web crawler based on Apache Nutch that can run on spark.

Ccrawler-A simple Web content categorization scheme that separates Web pages based on their content and c#3.5.
Simplecrawler-Simple multi-threaded web crawler, based on reg-expression.
Dotnetspider-A lightweight, cross-platform web crawler based on C # development.
Abot-A C # web crawler with good efficiency and scalability.
Hawk-a web crawler developed with C#/WPF with a simple ETL function.
Skyscraper-a web crawler that supports asynchronous networks and has a good extensibility.

Javascript

Scraperjs-A full-featured web crawler based on JS.
Scrape-it-web crawler based on node. js.
Simplecrawler-a web crawler based on event-driven development.
Node-crawler-Provides a simple API for two-time web crawler development.
Js-crawler-a web crawler that supports HTTP (S) based on node. js.
X-ray-web crawler that supports paging.
Node-osmosis-web crawler based on node. JS suitable for parsing HTML structures.

Php

Goutte-PHP-based screenshot and crawl program for Web pages.
- Laravel-goutte-a web crawler based on Laravel 5.
Dom-crawler-a web crawler that easily extracts DOM files.
Pspider-PHP-based concurrent web crawler.
Php-spider-A highly extensible web crawler based on PHP.

C++

Open-source-search-engine-a web crawler and search engine developed based on C + +.

HTTrack-Entire Site Replication tool. # # Ruby
Upton-a collection of easy-to-get crawler frames that support CSS selectors.
Wombat-based on Ruby's natural, DSL-enabled web crawler, it is easy to extract web body data.
Rubyretriever-based on Ruby's Web site data collection and all-network harvester.
SPIDR-All-station data collection, support unlimited site link address collection.
Cobweb-A very flexible, easy-to-scale web crawler that can be used with a single point of deployment.
Mechanize-A framework for automatically capturing site data.

Rvest-A simple web crawler based on R development.

Erlang

Ebot-a distributed, highly scalable web crawler.

Perl

Web-scraper-A web crawler that uses HTML, CSS, and XPath selectors.

Pholcus-A distributed network crawler that supports high concurrency.
Gocrawl-a highly concurrent, lightweight, ethical web crawler.
Fetchbot-A lightweight web crawler that complies with robots.txt rules and delay rules.
Go_spider-A very good high-concurrency web crawler.
DHT-a web crawler that supports the DHT protocol.
Ants-go-high parallel web crawler based on Golang.
Scrape-a simple web crawler that provides a good interface for development.

Scala

Crawler-a web crawler based on Scala DSL.
Scrala-a web crawler based on the Scrapy kernel developed by Scala.
Ferrit-Akka, Spray,cassandra's web crawler based on Scala development.

Open source web crawler Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Open source web crawler Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Open source web crawler Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support