Open source web crawler Summary

Source: Internet
Author: User

Awesome-crawler-cn

Internet crawlers, spiders, data collectors, Web parser summary, because of new technologies continue to evolve, new framework endless, this article will be constantly updated ...

Exchange Discussion
    1. Welcome to recommend you know the Open source web crawler, Web extraction framework.
    2. Open source web crawler QQ Exchange Group: 322937592
    3. Email Address:liinux at qq.com
Python
    • Scrapy-An efficient screen, web data acquisition framework.
      • Django-dynamic-scraper-crawler based on the Scrapy kernel developed by the Django Web framework.
      • Scrapy-redis-a crawler that uses Redis components based on the Scrapy core.
      • Scrapy-cluster-A distributed crawler framework developed using REDIS and Kafka based on the Scrapy core.
      • Distribute_crawler-A distributed crawler framework developed using REDIS and MongoDB based on the Scrapy core.
    • Pyspider-A powerful, pure Python data acquisition system.
    • Cola-a distributed crawler framework.
    • Demiurge-a miniature crawler frame based on Pyquery.
    • Scrapely-A pure Python HTML page capture library.
    • Feedparser-a generic feed parser.
    • You-get-The silent site crawls to the downloader.
    • Grab-site collection framework.
    • Mechanicalsoup-a Python library of automated interactive websites.
    • Portia-a visual data acquisition framework based on Scrapy.
    • Crawley-a Python crawler framework based on non-blocking communication (NIO).
    • Robobrowser-A simple Python-based web browser that is not web-based.
    • Mspider-A Python crawler based on the Gevent (co-network library).
    • Brownant-A lightweight network data extraction framework.
Java
    • Apache Nutch-Highly scalable, highly scalable web crawler for production environments.
      • Anthelion-an Apache-based Nutch crawl semantic annotation in HTML page plugin.
    • CRAWLER4J-Simple and lightweight web crawler.
    • Jsoup-Collects, analyzes, processes and cleans HTML pages.
    • websphinx-html site-specific processing, information extraction.
    • Open Search Server-Set your own indexing strategy with a full set of search features. Analyze and extract full-text data, and this framework can index everything.
    • Gecco-an easy-to-use, lightweight web crawler.
    • Webcollector-Simple Crawl page interface, you can deploy a multi-threaded web crawler in less than 5 minutes.
    • Webmagic-an extensible crawler framework.
    • Spiderman-A scalable, multi-threaded web crawler.
      • Spiderman2-A distributed web crawler framework that supports JavaScript rendering.
    • HERITRIX3-scalable, large-scale web crawler project.
    • Seimicrawler-an Agile distributed crawler framework.
    • Stormcrawler-based on open source code and building a low latency network resource acquisition framework based on Apache Storm.
    • Spark-crawler-a web crawler based on Apache Nutch that can run on spark.
C#
    • Ccrawler-A simple Web content categorization scheme that separates Web pages based on their content and c#3.5.
    • Simplecrawler-Simple multi-threaded web crawler, based on reg-expression.
    • Dotnetspider-A lightweight, cross-platform web crawler based on C # development.
    • Abot-A C # web crawler with good efficiency and scalability.
    • Hawk-a web crawler developed with C#/WPF with a simple ETL function.
    • Skyscraper-a web crawler that supports asynchronous networks and has a good extensibility.
Javascript
    • Scraperjs-A full-featured web crawler based on JS.
    • Scrape-it-web crawler based on node. js.
    • Simplecrawler-a web crawler based on event-driven development.
    • Node-crawler-Provides a simple API for two-time web crawler development.
    • Js-crawler-a web crawler that supports HTTP (S) based on node. js.
    • X-ray-web crawler that supports paging.
    • Node-osmosis-web crawler based on node. JS suitable for parsing HTML structures.
Php
    • Goutte-PHP-based screenshot and crawl program for Web pages.
      • Laravel-goutte-a web crawler based on Laravel 5.
    • Dom-crawler-a web crawler that easily extracts DOM files.
    • Pspider-PHP-based concurrent web crawler.
    • Php-spider-A highly extensible web crawler based on PHP.
C++
    • Open-source-search-engine-a web crawler and search engine developed based on C + +.
C
    • HTTrack-Entire Site Replication tool. # # Ruby
    • Upton-a collection of easy-to-get crawler frames that support CSS selectors.
    • Wombat-based on Ruby's natural, DSL-enabled web crawler, it is easy to extract web body data.
    • Rubyretriever-based on Ruby's Web site data collection and all-network harvester.
    • SPIDR-All-station data collection, support unlimited site link address collection.
    • Cobweb-A very flexible, easy-to-scale web crawler that can be used with a single point of deployment.
    • Mechanize-A framework for automatically capturing site data.
R
    • Rvest-A simple web crawler based on R development.
Erlang
    • Ebot-a distributed, highly scalable web crawler.
Perl
    • Web-scraper-A web crawler that uses HTML, CSS, and XPath selectors.
Go
    • Pholcus-A distributed network crawler that supports high concurrency.
    • Gocrawl-a highly concurrent, lightweight, ethical web crawler.
    • Fetchbot-A lightweight web crawler that complies with robots.txt rules and delay rules.
    • Go_spider-A very good high-concurrency web crawler.
    • DHT-a web crawler that supports the DHT protocol.
    • Ants-go-high parallel web crawler based on Golang.
    • Scrape-a simple web crawler that provides a good interface for development.
Scala
    • Crawler-a web crawler based on Scala DSL.
    • Scrala-a web crawler based on the Scrapy kernel developed by Scala.
    • Ferrit-Akka, Spray,cassandra's web crawler based on Scala development.

Open source web crawler Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.