Awesome-crawler-cn
Internet crawlers, spiders, data collectors, Web parser summary, because of new technologies continue to evolve, new framework endless, this article will be constantly updated ...
Exchange Discussion
- Welcome to recommend you know the Open source web crawler, Web extraction framework.
- Open source web crawler QQ Exchange Group: 322937592
- Email Address:liinux at qq.com
Python
- Scrapy-An efficient screen, web data acquisition framework.
- Django-dynamic-scraper-crawler based on the Scrapy kernel developed by the Django Web framework.
- Scrapy-redis-a crawler that uses Redis components based on the Scrapy core.
- Scrapy-cluster-A distributed crawler framework developed using REDIS and Kafka based on the Scrapy core.
- Distribute_crawler-A distributed crawler framework developed using REDIS and MongoDB based on the Scrapy core.
- Pyspider-A powerful, pure Python data acquisition system.
- Cola-a distributed crawler framework.
- Demiurge-a miniature crawler frame based on Pyquery.
- Scrapely-A pure Python HTML page capture library.
- Feedparser-a generic feed parser.
- You-get-The silent site crawls to the downloader.
- Grab-site collection framework.
- Mechanicalsoup-a Python library of automated interactive websites.
- Portia-a visual data acquisition framework based on Scrapy.
- Crawley-a Python crawler framework based on non-blocking communication (NIO).
- Robobrowser-A simple Python-based web browser that is not web-based.
- Mspider-A Python crawler based on the Gevent (co-network library).
- Brownant-A lightweight network data extraction framework.
Java
- Apache Nutch-Highly scalable, highly scalable web crawler for production environments.
- Anthelion-an Apache-based Nutch crawl semantic annotation in HTML page plugin.
- CRAWLER4J-Simple and lightweight web crawler.
- Jsoup-Collects, analyzes, processes and cleans HTML pages.
- websphinx-html site-specific processing, information extraction.
- Open Search Server-Set your own indexing strategy with a full set of search features. Analyze and extract full-text data, and this framework can index everything.
- Gecco-an easy-to-use, lightweight web crawler.
- Webcollector-Simple Crawl page interface, you can deploy a multi-threaded web crawler in less than 5 minutes.
- Webmagic-an extensible crawler framework.
- Spiderman-A scalable, multi-threaded web crawler.
- Spiderman2-A distributed web crawler framework that supports JavaScript rendering.
- HERITRIX3-scalable, large-scale web crawler project.
- Seimicrawler-an Agile distributed crawler framework.
- Stormcrawler-based on open source code and building a low latency network resource acquisition framework based on Apache Storm.
- Spark-crawler-a web crawler based on Apache Nutch that can run on spark.
C#
- Ccrawler-A simple Web content categorization scheme that separates Web pages based on their content and c#3.5.
- Simplecrawler-Simple multi-threaded web crawler, based on reg-expression.
- Dotnetspider-A lightweight, cross-platform web crawler based on C # development.
- Abot-A C # web crawler with good efficiency and scalability.
- Hawk-a web crawler developed with C#/WPF with a simple ETL function.
- Skyscraper-a web crawler that supports asynchronous networks and has a good extensibility.
Javascript
- Scraperjs-A full-featured web crawler based on JS.
- Scrape-it-web crawler based on node. js.
- Simplecrawler-a web crawler based on event-driven development.
- Node-crawler-Provides a simple API for two-time web crawler development.
- Js-crawler-a web crawler that supports HTTP (S) based on node. js.
- X-ray-web crawler that supports paging.
- Node-osmosis-web crawler based on node. JS suitable for parsing HTML structures.
Php
- Goutte-PHP-based screenshot and crawl program for Web pages.
- Laravel-goutte-a web crawler based on Laravel 5.
- Dom-crawler-a web crawler that easily extracts DOM files.
- Pspider-PHP-based concurrent web crawler.
- Php-spider-A highly extensible web crawler based on PHP.
C++
- Open-source-search-engine-a web crawler and search engine developed based on C + +.
C
- HTTrack-Entire Site Replication tool. # # Ruby
- Upton-a collection of easy-to-get crawler frames that support CSS selectors.
- Wombat-based on Ruby's natural, DSL-enabled web crawler, it is easy to extract web body data.
- Rubyretriever-based on Ruby's Web site data collection and all-network harvester.
- SPIDR-All-station data collection, support unlimited site link address collection.
- Cobweb-A very flexible, easy-to-scale web crawler that can be used with a single point of deployment.
- Mechanize-A framework for automatically capturing site data.
R
- Rvest-A simple web crawler based on R development.
Erlang
- Ebot-a distributed, highly scalable web crawler.
Perl
- Web-scraper-A web crawler that uses HTML, CSS, and XPath selectors.
Go
- Pholcus-A distributed network crawler that supports high concurrency.
- Gocrawl-a highly concurrent, lightweight, ethical web crawler.
- Fetchbot-A lightweight web crawler that complies with robots.txt rules and delay rules.
- Go_spider-A very good high-concurrency web crawler.
- DHT-a web crawler that supports the DHT protocol.
- Ants-go-high parallel web crawler based on Golang.
- Scrape-a simple web crawler that provides a good interface for development.
Scala
- Crawler-a web crawler based on Scala DSL.
- Scrala-a web crawler based on the Scrapy kernel developed by Scala.
- Ferrit-Akka, Spray,cassandra's web crawler based on Scala development.
Open source web crawler Summary