At present, a large number of new technologies are emerging every year in the big data field, which is an effective means for big data storage, processing analysis or visualization. Big data technology can dig out the hidden information and knowledge in large-scale data, provide a basis for human social and economic activities, improve the operational efficiency of various fields, and even the degree of intensification of the entire social economy.
The underlying infrastructure is the computing resources, memory and storage, and network interconnects, which are represented by compute nodes, clusters, enclosures, and data centers. Above this is data storage and management, including file systems, databases, and resource management systems like YARN. Then there are computational processing layers, such as Hadoop, MapReduce, and Spark, and various computing paradigms on top of it, such as batch processing, stream processing, and graph calculations, including computational models derived from programming models, such as BSP, GAS, etc. . Data analysis and visualization are based on the computational processing layer. Analysis includes simple query analysis, flow analysis, and more complex analysis (such as machine learning, graph calculations, etc.). Query analysis is based on table structure and relational functions. Flow analysis is based on data, event flow, and simple statistical analysis. Complex analysis is based on more complex data structures and methods, such as graphs, matrices, iterative calculations, and linear algebra. Visualization of general meaning is a demonstration of the results of the analysis. But through interactive visualization, you can also make exploratory questions, get new clues to the analysis, and form iterative analysis and visualization. Real-time interactive visual analysis based on large-scale data and the introduction of automation in this process are the current research hotspots.
There are two areas that vertically open the above layers and need to be viewed in a holistic and collaborative manner. One is the programming and management tools, the direction is that the machine is automatically optimized through learning, as much as possible without programming, without complicated configuration. Another area is data security, which is also throughout the technology stack. In addition to the vertical access to the layers in these two areas, there are some technical directions that span multiple layers. For example, "memory computing" actually covers the entire technology stack.
2. Big data technology ecology
The basic processing flow of big data is not much different from the traditional data processing flow. The main difference is that because big data has to process large amounts of unstructured data, parallel processing can be used in each processing step. At present, distributed processing methods such as Hadoop, MapReduce and Spark have become the general processing methods for all aspects of big data processing.
Hadoop is a distributed computing platform that allows users to easily architect and use. Users can easily develop and run applications that process massive amounts of data on Hadoop. Hadoop is a data management system that, as the core of data analysis, brings together structured and unstructured data that is distributed at every level of the traditional enterprise data stack. Hadoop is also a massively parallel processing framework with supercomputing capabilities that are positioned to drive the execution of enterprise applications. Hadoop is an open source community that provides tools and software to solve big data problems. Although Hadoop provides a lot of features, it should still be classified as a Hadoop ecosystem of multiple components, including data storage, data integration, data processing, and other specialized tools for data analysis. Hadoop ecosystem, which is mainly composed of HDFS, MapReduce, Hbase, Zookeeper, Oozie, Pig, Hive and other core components. It also includes Sqoop, Flume and other frameworks for integration with other enterprises. At the same time, the Hadoop ecosystem is growing, adding content such as Mahout, Ambari, Whirr, BigTop to provide updates.
Low cost, high reliability, high scalability, high efficiency, and high fault tolerance make Hadoop the most popular big data analytics system. However, the HDFS and MapReduce components that it relies on make it once in trouble - the way batch works Let it be only suitable for offline data processing, and it is useless in scenarios that require real-time performance. Therefore, various Hadoop-based tools have emerged. In order to reduce management costs and improve resource utilization, there are many resources unified management scheduling systems, such as Twitter's ApacheMesos, Apache's YARN, Google's Borg, Tencent's Torca, and Facebook Corona (open source). Apache Mesos is an open source project in the Apache incubator that uses ZooKeeper for fault-tolerant replication, Linux Containers to isolate tasks, and supports multiple resource plan allocations (memory and CPU). Provides efficient isolation and resource isolation and sharing across distributed applications and frameworks, supporting Hadoop, MPI, Hypertable, Spark, and more. YARN is also known as MapReduce2.0. Referring to Mesos, YARN proposed a resource isolation solution Container to provide isolation of Java virtual machine memory. Compared to MapReduce 1.0, developers use Resource Manager, Application_master, and NodeManager instead of the core JobTracker and TaskTracker in the original framework. Multiple computing frameworks such as MR, Tez, Storm, Spark, etc. can be run on the YARN platform.
Based on the real-time needs of the business, there are Storm, CloudarImpala supporting online processing, Spark supporting iterative computing, and stream processing framework S4. Storm is a distributed, fault-tolerant real-time computing system developed by BackType and then captured by Twitter. Storm is a stream processing platform that is used to calculate and update databases in real time. Storm can also be used for "Continuous Computation", which performs continuous queries on the data stream and outputs the results to the user as a stream. It can also be used for "distributed RPC" to run expensive operations in parallel. ClouderaImpala is an open source MassivelyParallelProcessing (MPP) query engine developed by Cloudera. The same metadata, SQL syntax, ODBC driver, and user interface (Hue Beeswax) as Hive provide fast, interactive SQL queries directly on HDFS or HBase. Inspired by Dremel, Impala no longer uses slow Hive+MapReduce batch processing, but a distributed query engine similar to that used in commercial parallel relational databases (by Query Planner, Query Coordinator, and Query _exec Engine) Composition), you can directly query data from HDFS or HBase with SELECT, JOIN and statistical functions, which greatly reduces the delay.
The Hadoop community is working hard to extend the existing computing model framework and platform to address the many shortcomings of the existing version in terms of computing performance, computing model, system architecture and processing power. This is the goal of Hadoop 2.0 version "YARN". Various computing modes can also be mixed with the memory computing mode to achieve high real-time big data query and computational analysis. The master of the mixed computing model is the Spark ecosystem developed by UC Berkeley AMPLab. Spark is a general-purpose data analysis cluster computing framework for open source HadoopMapReduce. It is used to build large-scale, low-latency data analysis applications based on HDFS. Spark provides a powerful memory computing engine that covers almost all typical big data computing models, including iterative calculations, batch calculations, in-memory calculations, streaming calculations (SparkStreaming), data query analysis calculations (Shark), and graph calculations (GraphX). . Spark uses Scala as an application framework, with a memory-based distributed data set that optimizes iterative workloads and interactive queries. Unlike Hadoop, Spark is tightly integrated with Scala, which manages distributed datasets like managing local collective objects. Spark supports iterative tasks on distributed datasets and can actually be run with Hadoop on Hadoop filesystems (via YARN, Mesos, etc.). In addition, based on performance, compatibility, data type research, there are other open source solutions such as Shark, Phoenix, ApacheAccumulo, ApacheDrill, ApacheGiraph, ApacheHama, ApacheTez, ApacheAmbari. It is expected that for a long time in the future, the mainstream Hadoop platform will be improved and coexist with various new computing models and systems to form a new generation of big data processing systems and platforms.
3. Big data collection and preprocessing
In the life cycle of big data, data collection is in the first step. According to the application system classification of MapReduce data generation, there are four main sources of big data collection: management information system, Web information system, physical information system, and scientific experiment system. For different data sets, different structures and patterns, such as files, XML trees, relational tables, etc., may appear as heterogeneity of data. For multiple heterogeneous data sets, further integration processing or integration processing is required, and data from different data sets are collected, sorted, cleaned, converted, and generated into a new data set to provide uniformity for subsequent query and analysis processing. Data view. For the heterogeneous database integration technology in the management information system, the entity identification technology in the Web information system, the DeepWeb integration technology, and the sensor network data fusion technology, there has been a lot of research work, and great progress has been made, and various data cleaning and Quality control tools, for example, DataFlux from SAS, USA, DataStage from IBM, USA, and Informatica Power Center from Informatica, USA.