Global 100 Big Data Tools summary (Top 50)

Source: Internet
Author: User

Tags: compatible with IMP machine difficulty kit compatibility GEO SUV continuous

Talend Open Studio

Is the first open source software vendor for the Data Integration Tool market ETL (data extraction extract, transfer transform, load load). Talend has more than 2 million downloads, and its open source software provides data integration capabilities. Its users include AIG, Comcast, E-bay, GE, Samsung, Ticketmaster and Wellison and other corporate organizations.


The Dyson Intelligent Analysis System, which is researched and developed independently, can realize the collection, analysis and processing of big data completely. Dyson Intelligent Analysis System specializes in Internet data capture, processing, analysis, mining. Can be flexibly and quickly grasp the scattered information on the Web page, and through the powerful processing function, the accurate excavation of the required data, is currently the largest number of users of the Web Capture tool.


A new Hadoop resource Manager, a general-purpose resource management system that provides unified resource management and scheduling for upper-level applications, and addresses the performance bottlenecks of the old MapReduce framework. Its basic idea is to divide the functions of resource management and job scheduling/monitoring into individual daemons.

Mesos, Geneva

The first open source cluster management software developed by Amplab of the University of California, Berkeley, supports architectures such as Hadoop, ElasticSearch, Spark, Storm, and Kafka. It is like a single resource pool for data centers, extracting CPU, memory, storage and other computing resources from physical or virtual machines, and it is easy to establish and efficiently run distributed systems with fault tolerance and elasticity.


A Hadoop-based Big Data Platform development Kit developed by Discovery Technology, the RAI Big Data Application platform architecture.


As part of the Hadoop ecosystem, a web-based, intuitive interface can be used to configure, manage, and monitor Hadoop clusters. Most Hadoop components are currently supported, including HDFs, MapReduce, Hive, Pig, Hbase, Zookeper, Sqoop, and Hcatalog.


A Distributed Application coordination service is an important component of Hadoop and HBase. It is a tool for providing consistent services for distributed applications, allowing the nodes within the Hadoop cluster to coordinate with each other. Zookeeper is now a top-of-the-class project for Apache, providing efficient, reliable and easy-to-use collaborative services for distributed systems.


In 2007, Facebook presented the Apache Foundation with thrift as an open source project, For Facebook at the time, the creation of thrift was designed to address the cross-platform nature of the large data traffic between systems in the Facebook system and the different language environments between the systems.


An open-source data acquisition system for monitoring large distributed systems, built on the Hdfs/mapreduce framework and inheriting the scalability and reliability of Hadoop, can collect data from large distributed systems for monitoring. It also includes a flexible and powerful display tool for monitoring and analyzing results.

Ten Lustre

A large, secure, and highly available cluster file system developed and maintained by Sun. The main purpose of the project is to develop the next generation of cluster file system, currently can support more than 10,000 nodes, the number of petabytes of data storage.


The Hadoop Distributed file system, referred to as HDFs, is a distributed filesystem. HDFs is a highly fault-tolerant system that is suitable for deployment on inexpensive machines. HDFS provides high-throughput data access and is ideal for applications on large-scale datasets.


A clustered file system that supports Petabyte-scale data volumes. GlusterFS uses RDMA and TCP/IP to assemble storage space distributed across different servers into a large networked parallel file system.


Formerly known as Tachyon, a memory-centric distributed file system with high performance and fault tolerance, it provides a reliable, memory-level file-sharing service for cluster frameworks such as Spark and MapReduce.


The new generation of open source Distributed File system, the main goal is to design a POSIX-based distributed file system without single point of failure, improve data fault tolerance and achieve seamless replication.


A high-performance, open-source parallel file system, primarily for applications in parallel computing environments. PVFS is specially designed for a large number of clients and servers, and its modular design structure makes it easy to add new hardware and algorithmic support.


The Quantcast file System (QFS) is a high-performance, fault-tolerant, distributed filesystem for developing applications that support mapreduce processing or require sequential reading and writing of large files.


A platform for application logging, event transfer, processing, management, and search. It can be used to unify the collection and management of application logs, providing a web interface for querying and statistics.


scribe is Facebook's Open source Log collection system, which collects logs from a variety of log sources and stores them on a central storage System (NFS, Distributed File system, etc.) for centralized statistical analysis processing.


Cloudera provides a highly available, high-reliability, distributed, massive log capture, aggregation, and transmission system. Flume supports the customization of various types of data senders in the log system for data collection. At the same time, Flume supports simple processing of data and writing to various data-receiving parties (customizable).


A popular message broker system that is typically used to integrate messages between applications or between different components of a program. RABBITMQ provides reliable application messaging, ease of use, support for all major operating systems, and support for a large number of developer platforms.


Apache, known as the "most popular, most powerful" open source messaging Integrated mode server. The ACTIVEMQ feature is fast, supports a wide range of cross-lingual clients and protocols, its enterprise integration model and many advanced features are easy to use, is a JMS provider implementation that fully supports the JMS1.1 and the Java EE 1.4 specification.


A high-throughput distributed publish-subscribe messaging system that handles all the action flow data in a consumer-scale web site and is now the best choice between asynchronous and distributed messaging for big Data systems.


A high-speed, general-purpose Big Data computing processing engine. The advantages of having Hadoop mapreduce, but the difference is that the intermediate output of the job can be saved in memory, thus eliminating the need to read and write HDFs, so spark is better suited for algorithms such as data mining and machine learning that require iterative mapreduce. It can be used with Hadoop and Apache Mesos, or it can be used standalone.


You can build custom applications that process or analyze streaming data to meet specific needs. Amazon Kinesis Streams continuously captures and stores terabytes of data from hundreds of thousands of sources per hour, such as site clickstream, financial transactions, social media feeds, it logs, and location tracking events.


An open source framework, suitable for running on common hardware, supports a simple program model for distributed processing across cluster large datasets, supporting horizontal scale up from a single server to thousands of servers. Apache's Hadoop project, which is almost synonymous with big data, has grown to become a complete ecosystem with many open source tools for highly scalable distributed computing. Efficient, reliable, scalable, provides the yarn, HDFs, and infrastructure you need for your data storage projects, and runs major big data services and applications.

Spark Streaming

For micro-batching, the goal is to easily build scalable, fault-tolerant streaming applications that support Java, Scala, and Python, and seamlessly integrate with spark. Spark streaming can read data hdfs,flume,kafka,twitter and ZEROMQ, and can read custom data.


is a higher level of abstraction for Storm, which, in addition to providing an easy-to-use Stream data processing API, is processed in batch (a set of tuples), which makes some processing easier and more efficient.


This year has been ranked Apache's top open source project, fully compatible with HDFs. Flink provides Java and Scala-based APIs and is an efficient, distributed, general-purpose big data analytics engine. More important, Flink supports incremental iterative computations, allowing the system to quickly process data-intensive, iterative tasks.


Based on LinkedIn, the Distributed Flow Computing framework built on top of Kafka is Apache's top open source project. Fault tolerance, process isolation, and security and resource management can be provided directly using Kafka and Hadoop yarn.


Storm is a real-time, Hadoop-like data processing framework for Twitter's open source. The programming model is simple, which significantly reduces the difficulty of real-time processing, and is one of the most popular flow computing frameworks in the present. Compared to other computational frameworks, Storm's greatest advantage is the millisecond-level low latency.

Yahoo S4 (Simple scalable streaming System)

is a distributed streaming computing platform with universal, distributed, extensible, fault-tolerant, pluggable, and so on, programmers can easily develop applications that handle continuous, borderless data streams (continuous unbounded streams by). Its goal is to fill the gap between complex proprietary systems and batch-oriented open source products, and to provide a high-performance computing platform to address the complexity of concurrent processing systems.


is a modified version of the Hadoop MapReduce framework with the goal of efficiently supporting iterations, recursive data analysis tasks, such as PAGERANK,HITS,K-MEANS,SSSP.


is an open source, distributed SQL query engine for interactive analytic queries that enables fast, interactive analysis of data above 250PB. Presto is designed and written to address the speed of interactive analysis and processing of commercial data warehouses of a size such as Facebook. Facebook says Presto has more than 10 times times better performance than Hive and MapReduce.

The Drill

Launched in August 2012 by Apache, allows users to query Hadoop, NoSQL databases, and cloud storage services using SQL-based queries. It can run on thousands of nodes on a server cluster, and can process petabytes or trillions of data records in a matter of seconds. It can be used for data mining and ad hoc queries, supporting a wide range of databases including HBase, MongoDB, Mapr-db, HDFS, Mapr-fs, Amazon S3, Azure Blob Storage, Google Cloud storage, and Swift.


Is a Java middle tier that allows developers to execute SQL queries on Apache hbase. Phoenix is written entirely in Java and provides a JDBC driver that the client can embed. The Phoenix query engine translates SQL queries into one or more hbase scan and orchestrates execution to produce a standard JDBC result set.


is a programming language that simplifies the work tasks common to Hadoop. Pig can load data, transform data, and store the final results. The biggest role of pig is to implement a set of shell scripts for the MapReduce framework, similar to the SQL statements we are usually familiar with.

Notoginseng Hive

is a Hadoop-based data warehousing tool that maps structured data files into a single database table and provides simple SQL query functionality to convert SQL statements to MapReduce tasks. The advantage is that the learning cost is low, the simple mapreduce statistics can be quickly realized through the class SQL statements, and it is very suitable for the statistical analysis of data Warehouse without developing specialized mapreduce applications.


Formerly Shark,sparksql discarded the original shark code and gained some advantages, such as memory Columnstore (In-memory columnar Storage), hive compatibility, and so on. Because of its dependence on hive, Sparksql has been greatly facilitated in terms of data compatibility, performance optimization, and component expansion.


Originally called Tez, is the next generation of Hive, developed by Hortonworks and runs on the DAG Computing Framework on yarn. Under some tests, stinger can improve performance by up to 10 times times, while allowing hive to support more SQL.


The goal is to build a reliable, distributed data Warehouse system that supports relational data on top of HDFs, with a focus on providing low latency, scalable AD-HOC queries and online data aggregation, and providing tools for more traditional ETL.


Cloudera claims that the SQL-based Impala database is "the leading open source Analytics database for Apache Hadoop." It can be downloaded as a standalone product and is part of Cloudera's commercial big data products. Cloudera Impala can provide fast, interactive SQL queries directly for Hadoop data stored in HDFs or hbase.


is a Lucene-based search server. It provides a distributed, multi-user-enabled full-text search engine based on a restful web interface. Elasticsearch is a popular enterprise-class search engine developed in Java and published as an open source under the Apache license terms. Designed for cloud computing, it can achieve real-time search, stable, reliable, fast, easy to install and use.


Based on Apache Lucene, it is a highly reliable, highly scalable enterprise search platform. Well-known users include eharmony, Sears, StubHub, Zappos, Best Buy, T, Instagram, Netflix, Bloomberg and Travelocity.

4 4 Shark

That is, hive on spark, essentially through Hive's hql parsing, translates hql into an rdd operation on Spark, and then gets the table information in the database via hive's metadata, the actual HDFS data and files, is obtained by shark and placed on the spark operation. Shark is characterized by fast, fully compatible with hive, and can be used in shell mode using Rdd2sql () API, the HQL result set, continue to operate in the Scala environment, to support their own simple machine learning or simple analysis processing function, The HQL results are further analyzed and calculated.


Java-based Lucene can perform full-text searches very quickly. According to the official website, it can retrieve more than 150GB of data per hour on modern hardware, and it has a powerful and efficient search algorithm.


Claiming that its bigmemory technology is "the world's premier in-memory data management platform", supports simple, scalable, real-time messaging, claiming 2.1 million developers in 190 countries and deploying its software to 1000 companies around the world.


is a high-performance, integrated, distributed, in-memory platform that can be used to perform real-time computation and processing of large-scale datasets at several orders of magnitude faster than traditional disk-based technology or flash technology. The platform includes features such as data grids, compute grids, service grids, streaming media, Hadoop acceleration, advanced clustering, file systems, messaging, events, and data structures.

Gems Fir e

Pivotal announces that it will open the source code for key components of its big data suite, including the GemFire in-memory NoSQL database. It has submitted a proposal to the Apache Software Foundation to manage the core engine of the GemFire database under the name "Geode".


Gridgrain, powered by Apache Ignite, provides in-memory data structures for fast processing of big data and Hadoop accelerators based on the same technology.


is a database based on distributed file storage. Written by the C + + language. Designed to provide scalable, high-performance data storage solutions for Web applications. An open source product between a relational database and a non-relational database, the most versatile and relational database-like product in a non-relational database.

Source: Aisnil

Upcoming Events (directly click to view):

Global 100 Big Data Tools summary (Top 50)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: