Although
data analysis is hidden behind the business system, it has a very important role. The results of data analysis play a pivotal role in decision-making and business development.
With the development of
big data technology, the exposure of proper terms such as
data mining and data exploration is getting higher and higher, but before big data analysis systems similar to the Hadoop series became popular, data analysis work has undergone considerable development, especially Data analysis based on BI systems has already had a very mature and stable technical solution and ecosystem.
In the BI system, the core module is Cube. Cube is a higher-level business model abstraction. Various operations can be performed on Cube, such as drilling up, drilling down, and slicing.
Most BI systems are based on relational databases, and relational databases use SQL statements for operations, but SQL is relatively weak in multi-dimensional operation and analysis, so Cube has its own unique query language MDX.
MDX expressions have stronger multi-dimensional performance capabilities. Therefore, the analysis system with Cube as the core basically occupies half of the data statistical analysis. Most database service vendors directly provide BI package software services and can easily build a set of OLAP analysis. System, but the problems of BI gradually exposed over time:
BI systems focus more on analyzing business data with high-density and high-value structured data, and are very weak in processing unstructured and semi-structured data. For example, the storage and analysis of pictures, text, and audio.
Since the data warehouse is structured storage, when data enters the data warehouse from other systems, we usually call it the ETL process. ETL actions are strongly bound to the business, and a dedicated ETL team is usually required to connect with the business and decide how Perform data cleaning and conversion.
With the increase of heterogeneous data sources, for example, if there are data sources such as videos, texts, pictures, etc., to analyze the data content and enter the data warehouse, very complicated ETL programs are required, which causes ETL to become too large and bloated.
When the amount of data is too large, performance will become a bottleneck, and it will show obvious difficulty in the TB/PB level of data.
Constraint rules such as database paradigms focus on solving the problem of data redundancy to ensure data consistency. But for the data warehouse, we do not need to modify the data and guarantee the consistency. In principle, the original data of the data warehouse is read-only, so these constraints will instead become factors that affect performance.
The pre-supposition and processing of the data by the ETL action results in the data obtained by the machine learning part as hypothetical data, so the effect is not ideal. For example, if you need to use a data warehouse to mine abnormal data, you need to clearly define the feature data that needs to be extracted when the data is stored through ETL, otherwise it cannot be structured into the database, but in most cases it needs to be based on heterogeneous data. Extract the features.
Under a series of problems, the big data analysis platform led by the Hadoop system has gradually shown its superiority, and the ecosystem surrounding the Hadoop system has also continued to grow. For the Hadoop system, it has fundamentally solved the bottleneck problem of traditional data warehouses. , But it also brings a series of new problems:
Upgrading from a data warehouse to a big data architecture does not have a smooth evolution, which is basically equal to overturning and redoing;
Distributed storage under big data emphasizes the read-only nature of data, so storage methods similar to Hive and HDFS do not support update, and HDFS write operations do not support parallelism. These characteristics lead to certain limitations.
The data analysis platform based on big data architecture focuses on solving the bottlenecks faced by traditional data warehouses for data analysis from the following dimensions:
Distributed computing: The idea of distributed computing is to allow multiple nodes to compute in parallel, and emphasize the locality of data, and reduce the transmission of data as much as possible. For example, Spark uses RDD to express the computing logic of data, which can be done on RDD. Series optimization to reduce data transmission.
Distributed storage: The so-called distributed storage refers to the division of a large file into N copies, and each copy is placed on a machine independently. This involves operations such as file copy, fragmentation, and management. Distribution The main optimization actions for storage are in this one.
Combination of retrieval and storage: In the early big data components, storage and calculation were relatively simple, but the current direction is to do more on storage to make query and calculation more efficient, which is not efficient for calculation. What's more is to find and read data quickly, so the current storage not only stores data content, but also adds a lot of meta-information, such as index information. Similar ideas like parquet and carbondata.
In general, the current big data architectures surrounding the Hadoop system are roughly as follows:
Traditional big data architecture
It is called the traditional big data architecture because it is positioned to solve traditional BI problems. To put it simply, the data analysis business has not undergone any changes, but the system cannot be used normally due to issues such as data volume and performance and needs to be upgraded. Then this type of architecture is to solve this problem. It can be seen that it still retains the ETL action, and enters the data storage through the ETL action.
Advantages: simple and easy to understand. For the BI system, the basic idea has not changed. The only thing that has changed is the technology selection, replacing the BI components with the big data architecture.
Disadvantages: For big data, there is no such complete Cube architecture under BI. Although Kylin is currently available, the limitations of Lylin are very obvious. It is far from the flexibility and stability of Cube under BI, so the flexibility of business support Not enough, so for scenarios with a large number of reports or complex drilling, too much manual customization is required. At the same time, the architecture is still based on batch processing and lacks real-time support.
Applicable scenarios: Data analysis needs are still dominated by BI scenarios, but they cannot meet daily use due to issues such as data volume and performance.
Streaming architecture
On the basis of the traditional big data architecture, the streaming architecture is very radical. The batch processing is directly removed, and the data is processed in the form of streams throughout the entire process. Therefore, there is no ETL at the data access end and replaced with data channels. The data processed by stream processing is directly pushed to consumers in the form of messages. Although there is a storage part, the storage is more stored in the form of windows, so the storage does not occur in the data lake, but in the peripheral system.
Advantages: There is no bloated ETL process, and the effectiveness of the data is very high.
Disadvantages: For streaming architecture, there is no batch processing, so data replay and historical statistics cannot be well supported. For offline analysis, only analysis within the window is supported.
Applicable scenarios: early warning, monitoring, and data validity requirements.
Lambda architecture
Lambda architecture can be regarded as a pivotal architecture in big data systems, and most architectures are basically Lambda architecture or architectures based on its variants.
Lambda's data channel is divided into two branches: real-time streaming and offline. Real-time streaming is based on a streaming architecture to ensure its real-time performance, while offline is mainly batch processing to ensure final consistency.
What do you mean? In order to ensure the effectiveness of streaming channel processing, incremental calculation is the main auxiliary reference, while the batch processing layer performs full calculations on the data to ensure its final consistency. Therefore, the outermost layer of Lambda has an action that combines the real-time layer and the offline layer. This action is a very important action in Lambda.
Advantages: both real-time and offline, covering the data analysis scenarios very well.
Disadvantages: Although the offline layer and the real-time stream face different scenarios, their internal processing logic is the same, so there are a lot of redundant and repeated modules.
Applicable scenarios: There are both real-time and offline requirements.
Kappa architecture
The Kappa architecture is optimized on the basis of Lambda, combining the real-time and streaming parts, and replacing the data channel with a message queue. Therefore, for the Kappa architecture, stream processing is still the mainstay, but the data is stored at the data lake level. When offline analysis or recalculation is required, the data in the data lake can be replayed through the message queue again. .
Advantages: The Kappa architecture solves the redundant part of the Lambda architecture. It is designed with the extraordinary idea of replaying data, and the entire architecture is very simple.
Disadvantages: Although the Kappa architecture looks concise, it is relatively difficult to implement, especially for the data replay part.
Applicable scenarios: Similar to Lambda, this architecture is optimized for Lambda.
Unifield architecture
The above architectures are mainly centered on massive data processing, while the Unifield architecture is more radical, combining machine learning and data processing. At the core, Unifield is still based on Lambda, but it has been transformed to stream processing. A new machine learning layer has been added to the layer. It can be seen that after the data enters the data lake through the data channel, a new model training part is added, and it is used in the streaming layer. At the same time, the streaming layer not only uses the model, but also includes continuous training of the model.
Advantages: Unifield architecture provides a set of architecture solutions that combine data analysis and machine learning, which is a very good solution to the problem of how to combine machine learning with data platforms.
Disadvantages: Unifield architecture is more complicated to implement. For machine learning architecture, from software package to hardware deployment, there are very big differences from the data analysis platform, so the difficulty coefficient in the implementation process is higher.
Applicable scenarios: There is a large amount of data that needs to be analyzed, and there is a very large demand for machine learning convenience or a planned situation.
to sum up
The above are several architectures that are currently used in the field of data processing. Of course, there are many other architectures, but their ideas will be more or less similar. The field of data and machine learning will continue to develop. The above-mentioned ideas may eventually become obsolete. We can only keep pace with the times and constantly update our knowledge base.