4. Big data storage and management
Traditional data storage and management are based on structured data, so relational database systems (RDBMS) can meet the needs of various applications. Big data is often dominated by semi-structured and unstructured data, supplemented by structured data, and various big data applications are usually for different types of data content retrieval, cross-comparison, deep mining and comprehensive analysis. In the face of such application requirements, traditional databases are unsustainable both technically and functionally. Therefore, in recent years, the situation of oldSQL, NoSQL and NewSQL coexisted. In general, according to different data types, the storage and management of big data adopt different technical routes, which can be roughly divided into three categories. The first category is mainly for large scale structured data. For such big data, a new type of database cluster is usually used. They use the column storage or row-column hybrid storage and coarse-grained indexing technology, combined with the efficient distributed computing mode of MPP (Massive Parallel Processing) architecture to realize the storage and management of PB-level data. Such clusters have high performance and high scalability, and have been widely used in enterprise analysis applications; the second category is mainly oriented to semi-structured and unstructured data. In response to such application scenarios, a system platform based on the Hadoop open source architecture is better at it. They implement the storage and management of semi-structured and unstructured data through technology extension and encapsulation of the Hadoop ecosystem; the third category is structured and unstructured mixed big data, so the MPP parallel database is used. The hybrid of clusters and Hadoop clusters enables storage and management of data of hundreds of petabytes and EBs. On the one hand, MPP is used to manage high-quality structured data, providing powerful SQL and OLTP-type services; on the other hand, Hadoop is used to process semi-structured and unstructured data to support such things as content retrieval and depth. New applications such as mining and comprehensive analysis. This hybrid model will be the future of big data storage and management.
5. Big data computing mode and system
The emergence of computing models has strongly promoted the development of big data technologies and applications, making it the most successful and widely accepted mainstream big data computing model for big data processing. However, the big data processing problems in the real world are complex and diverse, and it is difficult to have a single computing model that covers all the different big data computing needs. Research and practical applications have found that MapReduce is mainly suitable for batch processing under big data lines, and it has great incompatibility in dealing with big data problems with low latency and complex data relationships and complex calculations. Therefore, in recent years, the academic community and the industry have been continuously researching and introducing a variety of different big data computing models.
The so-called big data computing model, which is based on the different data characteristics and computing features of big data, extracts and builds various high-level abstractions or models from diverse big data computing problems and requirements. For example, MapReduce is a parallel computing abstraction, "Distributed Memory Abstract RDD" in the famous Spark system of the University of California at Berkeley, and "Graph Parallel Abstraction" in CLAB's famous graph computing system GraphLab. Traditional parallel computing methods mainly define some lower-level parallel computing abstractions and models from the perspective of architecture and programming language. However, because big data processing problems have many high-level data features and computational features, big data processing needs more. Combine these high-level features with higher-level computing models.
According to the needs of big data processing diversity and the above different feature dimensions, a variety of typical and important big data computing models have emerged. In line with these computing models, there are many corresponding big data computing systems and tools. Since the description of the calculation mode is relatively abstract and empty, when describing different calculation modes, the corresponding typical calculation systems and tools will be given at the same time, as shown in Table 1, which will help to understand the calculation mode and the development of technology. The current situation is grasped and further facilitates the selection and use of suitable computing technologies and system tools in actual big data processing applications.
6. Big data analysis and visualization
In the era of big data, people are eager to implement high-performance data analysis based on machine learning algorithms on a large scale cluster composed of ordinary machines, providing services and guidance for actual business, and finally realizing the realization of data. Unlike traditional online online analytical processing OLAP, the in-depth analysis of big data is mainly based on large scale machine learning technology. In general, the training process of machine learning model can be summarized as the objective function optimized for large scale training data. And through a loop iterative algorithm implementation, as shown in Figure 4. Therefore, compared with traditional OLAP, big data analysis based on machine learning has its own unique characteristics.
(1) Iterative: Since there is usually no closed solution for the optimization problem, the determination of the model parameters is not completed at one time, and the loop iteration is required to gradually approach the optimal point.
(2) Fault tolerance: The algorithm design and model evaluation of machine learning tolerate the existence of non-optimal value points, while the characteristics of multiple iterations also allow some errors in the process of the loop, and the final convergence of the model is not affected.
(3) Non-uniformity of parameter convergence: Some parameters in the model do not change after a few rounds of iteration, while some parameters take a long time to reach convergence.
These characteristics determine that the design of an ideal big data analysis system is very different from the design of other computing systems. Direct application of traditional distributed computing systems to big data analysis, a large proportion of resources are wasted in communication, waiting, coordination Such as non-effective calculations.
The traditional distributed computing framework MPI (message passing interface) is flexible and powerful, but because the programming interface is complex and fault-tolerant support is not high, it can not support complex operations on large scale data. A series of distributed computing frameworks with simple interface and fault tolerance are developed to serve big data analysis algorithms, including MapReduce, Spark and parameter server Parameter Server.
The distributed computing framework MapReduce categorizes the processing of data into two major operations, Map and Reduce, which simplifies the programming interface and improves the fault tolerance of the system. However, MapReduce is subject to oversimplified data operation abstraction and does not support loop iteration. Therefore, it supports poor machine learning algorithms. MapReduce-based distributed machine learning library Mahout needs to decompose iteration into multiple consecutive Map and Reduce. Operation, by reading and writing the HDFS file, the operation result of the previous round of loops is passed to the next round to complete the data exchange. In this process, a large amount of training time is used for disk read and write operations, and training efficiency is very inefficient. In order to solve the above problems of MapReduce, Spark defines a richer data operation interface including Map and Reduce based on RDD. Unlike MapReduce, Job intermediate output and results can be stored in memory, eliminating the need to read and write HDFS. These features make Spark more suitable for data mining and machine learning and other iterative big data analysis algorithms. The machine learning algorithm library MMLIB based on Spark has shown its advantages over Mahout and has been widely used in practical applications.
In recent years, with the rapid expansion of the scale of data to be analyzed, the analytical model parameters have also grown rapidly, posing challenges to existing big data analysis models. For example, in the large scale topic model LDA, people expect to train more than one million topics, so in the training process may need to update hundreds of billions or even billions of model parameters, the scale far exceeds the processing of a single node. ability. In order to solve the above problems, the researchers proposed the concept of a parameter server (ParameterServer), as shown in Figure 5. In the parameter server system, large scale model parameters are stored centrally in a distributed server cluster, and large scale training data is distributed on different worker nodes, so each work node only needs to save it. A small number of parameters that can be relied upon can effectively solve the training problem of the ultra large scale big data analysis model. At present, the implementation of the parameter server mainly includes Petuum, PSLit, etc. of Carnegie Mellon University.
In the application of big data analysis, visualization helps people explore and understand complex data through interactive visual representation. Visualization and visual analysis can quickly and effectively simplify and refine data streams, help users to interactively screen large amounts of data, help users get new discoveries faster and better from complex data, and become users to understand complex data and develop In-depth analysis of indispensable means. The visualization of large scale data is mainly based on the technology of parallel algorithm design, and rationally utilizes limited computing resources to efficiently process and analyze the characteristics of specific data sets. Often, large scale data visualization techniques combine multi-resolution representations to achieve adequate interactive performance. In the parallel visualization of scientific large scale data, it mainly involves four basic technologies: data streamlining, task parallelization, pipeline parallelization and data parallelization. Microsoft has developed a large scale machine learning visualization platform (Azure Machine Learning) on its cloud computing platform Azure. The big data analysis task is presented as a directed acyclic graph and presented to users in a data flow graph. Effect. In China, Alibaba's big data analysis platform imperial kitchen also adopts a similar approach, providing an interactive big data analysis platform for business people.