The basic processing flow of
big data is not much different from the traditional data processing flow. The main difference is that because big data processes a large amount of unstructured data, parallel processing can be used in each processing link. At present, distributed processing methods such as Hadoop, MapReduce and Spark have become common processing methods in all aspects of big data processing.
Hadoop is a distributed computing platform that allows users to easily construct and use. Users can easily develop and run applications that process massive amounts of data on Hadoop. Hadoop is a data management system. As the core of data analysis, it collects structured and unstructured data. These data are distributed at each layer of the traditional enterprise data stack. Hadoop is also a massively parallel processing framework, with super computing power, and is positioned to promote the execution of enterprise-level applications. Hadoop is another open source community, mainly providing tools and software for solving big data problems. Although Hadoop provides many functions, it should still be classified as a Hadoop ecosystem composed of multiple components, including data storage, data integration, data processing, and other specialized tools for data analysis. The Hadoop ecosystem is mainly composed of core components such as HDFS, MapReduce, Hbase, Zookeeper, Oozie, Pig, and Hive. It also includes frameworks such as Sqoop and Flume to integrate with other enterprises. At the same time, the Hadoop ecosystem is also growing, adding Mahout, Ambari, Whirr, BigTop and other content to provide update functions.
Low cost, high reliability, high scalability, high efficiency, and high fault tolerance make Hadoop the most popular big data analysis system. However, the HDFS and MapReduce components on which it survives make it once in trouble-batch processing It is only suitable for offline data processing, and it is useless in scenarios that require real-time performance. Therefore, various tools based on Hadoop came into being. In order to reduce management costs and improve the utilization of resources, there are many current unified resource management and scheduling systems, such as Apache Mesos on Twitter, YARN on Apache, Borg on Google, Torca on Tencent and Facebook Corona (open source). ApacheMesos is an open source project in the Apache incubator, uses ZooKeeper to implement fault-tolerant replication, uses LinuxContainers to isolate tasks, and supports multiple resource plan allocations (memory and CPU). Provide efficient, resource isolation and sharing across distributed applications and frameworks, support Hadoop, MPI, Hypertable, Spark, etc. YARN is also known as MapReduce 2.0. Drawing on Mesos, YARN proposes a resource isolation solution Container, which provides isolation of Java virtual machine memory. Compared with MapReduce 1.0, developers use ResourceManager, ApplicationMaster and NodeManager to replace the core JobTracker and TaskTracker in the original framework. Multiple computing frameworks such as MR, Tez, Storm, Spark, etc. can be run on the YARN platform.
Based on the real-time requirements of the business, there are Storm, CloudarImpala that supports online processing, Spark that supports iterative calculation, and stream processing framework S4.
Storm is a distributed, fault-tolerant real-time computing system developed by BackType and later captured by Twitter. Storm is a stream processing platform, mostly used for real-time calculation and updating of databases. Storm can also be used for "Continuous Computation" to continuously query the data stream and output the results to the user in the form of a stream during the calculation. It can also be used for "distributed RPC" to run expensive operations in parallel.
ClouderaImpala is developed by Cloudera, an open source Massively ParallelProcessing (MPP) query engine. The same metadata, SQL syntax, ODBC driver and user interface (HueBeeswax) as Hive can provide fast and interactive SQL queries directly on HDFS or HBase. Impala was developed under the inspiration of Dremel, instead of using the slow Hive + MapReduce batch processing, but through a similar distributed query engine (composed of QueryPlanner, QueryCoordinator and QueryExecEngine) in a commercial parallel relational database, you can Use SELECT, JOIN and statistical functions to query data directly from HDFS or HBase, thus greatly reducing the delay.
The Hadoop community is working hard to expand the existing computing model framework and platform in order to solve the many deficiencies of the existing version in terms of computing performance, computing model, system architecture and processing capabilities. This is exactly the goal of Hadoop 2.0 version "YARN". Various calculation modes can also be mixed with the memory calculation mode to realize high-real-time big data query and calculation analysis.
The master of the hybrid computing model is the
Spark ecosystem developed by UCBerkeleyAMPLab. Spark is an open source HadoopMapReduce-like general data analysis cluster computing framework, used to build large-scale, low-latency data analysis applications, built on HDFS. Spark provides a powerful in-memory computing engine, covering almost all typical big data computing modes, including iterative computing, batch computing, in-memory computing, streaming computing (SparkStreaming), data query analysis computing (Shark), and graph computing (GraphX) .
Spark uses Scala as the application framework, uses distributed memory-based data sets, and optimizes iterative workloads and interactive queries.
Unlike Hadoop, Spark and Scala are tightly integrated, and Scala manages distributed data sets like managing local collective objects. Spark supports iterative tasks on distributed data sets, and can actually run with Hadoop on Hadoop file systems (through YARN, Mesos, etc.). In addition, based on performance, compatibility, data type research, there are other open source solutions such as Shark, Phoenix, ApacheAccumulo, ApacheDrill, ApacheGiraph, ApacheHama, ApacheTez, ApacheAmbari and so on. It is expected that the mainstream Hadoop platform will coexist with various new computing models and systems and integrate with each other to form a new generation of big data processing systems and platforms for a long period of time in the future.