Although
data analysis is hidden behind the business system, it has a very important role. The results of data analysis play a pivotal role in decision-making and business development. With the development of big data technology, the exposure of proper terms such as data mining and data exploration is getting higher and higher. However, before big data analysis systems similar to the Hadoop series became popular, data analysis work has undergone considerable development, especially Data analysis based on BI systems has already had a very mature and stable technical solution and ecosystem.
It can be seen that in the BI system, the core module is Cube. Cube is a higher-level business model abstraction. A variety of operations can be performed on Cube, such as drilling up, drilling down, and slicing. Most BI systems are based on relational databases.
Relational databases use SQL statements for operations, but SQL is relatively weak in terms of multi-dimensional operation and analysis. So Cube has its own unique query language MDX, which has more expressions. Strong multi-dimensional performance capabilities, so the analysis system with Cube as the core basically occupies half of the data statistical analysis. Most database service vendors directly provide BI package software services, and an Olap analysis system can be easily built. However, the problems of BI gradually emerged over time:
BI systems are more focused on analyzing business data with high-density and high-value structured data. They are very weak in processing unstructured and semi-structured data, such as image, text, and audio storage and analysis.
Because the data warehouse is structured storage, we usually call the ETL process when data enters the data warehouse from other systems. ETL actions are strongly bound to the business, and a dedicated ETL team is usually required to connect with the business and decide how Perform data cleaning and conversion.
With the increase of heterogeneous data sources, for example, if there are data sources such as videos, texts, pictures, etc., to analyze the data content and enter the data warehouse, very complicated ETL programs are required, which causes ETL to become too large and bloated.
When the amount of data is too large, performance will become a bottleneck, and it will show obvious difficulty in the TB/PB level of data.
The database paradigm and other constraint rules focus on solving the problem of data redundancy to ensure data consistency, but for data warehouses, we do not need to modify the data and ensure consistency. In principle, the data warehouse The original data is read-only, so these constraints will become a factor affecting performance.
The pre-assuming and processing of the data by the ETL action results in the data obtained by the machine learning part as hypothetical data, so the effect is not ideal. For example, if you need to use a data warehouse to mine abnormal data, you need to clearly define the feature data that needs to be extracted when the data is stored in the database through ETL, otherwise it cannot be structured into the database, but in most cases it needs to be extracted based on heterogeneous data Out characteristics.
Under a series of problems, the big data analysis platform headed by the Hadoop system has gradually shown its superiority, and the ecosystem surrounding the Hadoop system has also continued to grow. For the Hadoop system, it has fundamentally solved the bottleneck of the traditional data warehouse. The problem, but also brings a series of problems:
Upgrading from a data warehouse to a big data architecture does not have a smooth evolution, which is basically equal to overturning and redoing.
Distributed storage under big data emphasizes the read-only nature of data, so similar to Hive, HDFS storage methods do not support update, and HDFS write operations do not support parallelism. These characteristics lead to certain limitations.
Distributed computing: The idea of distributed computing is to allow multiple nodes to compute in parallel, and emphasize the locality of data, and reduce the transmission of data as much as possible. For example, Spark uses RDD to express the computing logic of data, which can be done on RDD. Series optimization to reduce data transmission.
Distributed storage: The so-called distributed storage refers to the division of a large file into N copies, and each copy is placed on a machine independently. Here it involves operations such as file copy, fragmentation, and management. The main optimization actions of distributed storage are in this area.
Combination of retrieval and storage: In the early big data components, storage and calculation were relatively simple, but the current direction is to do more on storage to make query and calculation more efficient, which is not efficient for calculation. What's more is to find data fast and read data fast, so the current storage not only stores data content, but also adds a lot of meta-information, such as index information. Similar ideas like parquet and carbondata.
Traditional big data architecture
The traditional big data architecture is called because it is positioned to solve traditional BI problems. Simply put, the business of data analysis has not changed, but the system cannot be used normally because of data volume and performance problems. Upgrade and transformation, then this type of architecture is to solve this problem. It can be seen that it still retains the ETL action, and enters the data storage through the ETL action.
Advantages: simple and easy to understand. For the BI system, the basic idea has not changed. The only change is the selection of technology, replacing the BI components with the big data architecture.
Disadvantages: For big data, there is no such a complete Cube architecture under BI. Although there is currently kylin, the limitations of kylin are very obvious. It is far from the flexibility and stability of Cube under BI, so it is flexible in business support The degree is not enough, so for scenes with a large number of reports or complex drilling, too much manual customization is required. At the same time, the architecture is still mainly batch processing and lacks real-time support.
Applicable scenarios: Data analysis needs are still dominated by BI scenarios, but they cannot meet daily use due to issues such as data volume and performance.
Streaming architecture
On the basis of the traditional big data architecture, the streaming architecture is very radical. The batch processing is directly removed, and the data is processed in the form of streams throughout the entire process. Therefore, there is no ETL at the data access end and replaced with data channels. The data processed by stream processing is directly pushed to consumers in the form of messages. Although there is a storage part, the storage is more stored in the form of windows, so the storage does not occur in the data lake, but in the peripheral system.
Advantages: There is no bloated ETL process, and the effectiveness of the data is very high.
Disadvantages: For streaming architecture, there is no batch processing, so data replay and historical statistics cannot be well supported. For offline analysis, only analysis within the window is supported.
Applicable scenarios: early warning, monitoring, and data validity period requirements.
Lambda architecture
Lambda architecture can be regarded as a pivotal architecture in big data systems, and most architectures are basically Lambda architecture or architectures based on its variants. Lambda's data channel is divided into two branches: real-time streaming and offline. Real-time streaming is based on a streaming architecture to ensure its real-time performance, while offline is mainly batch processing to ensure final consistency. What does that mean? In order to ensure the effectiveness of streaming channel processing, incremental calculation is the main auxiliary reference, while the batch processing layer performs full calculations on the data to ensure its final consistency. Therefore, the outermost layer of Lambda has a real-time layer and an offline layer. Combined action, this action is a very important action in Lambda
Advantages: both real-time and offline, covering the data analysis scenarios very well.
Disadvantages: Although the offline layer and the real-time stream face different scenarios, their internal processing logic is the same, so there are a lot of honors and duplicate modules.
Applicable scenarios: There are both real-time and offline requirements.
Kappa architecture
The Kappa architecture is optimized on the basis of Lambda, combining the real-time and streaming parts, and replacing the data channel with a message queue. Therefore, for the Kappa architecture, stream processing is still the mainstay, but the data is stored at the data lake level. When offline analysis or recalculation is required, the data in the data lake can be replayed through the message queue again. .
Advantages: The Kappa architecture solves the redundant part of the Lambda architecture. It is designed with the extraordinary idea of replaying data, and the entire architecture is very simple.
Disadvantages: Although the Kappa architecture looks concise, it is relatively difficult to implement, especially for the data replay part.
Applicable scenarios: Similar to Lambda, the architecture is optimized for Lambda.
Unifield architecture
All the above architectures are centered on massive data processing, while the Unifield architecture is more radical, combining machine learning and data processing. At the core, Unifield still focuses on Lambda, but it has been transformed. The processing layer has a new machine learning layer. It can be seen that after the data enters the data lake through the data channel, a new model training part is added, and it is used in the streaming layer. At the same time, the streaming layer not only uses the model, but also includes continuous training of the model.
Advantages: Unifield architecture provides a set of architecture solutions that combine data analysis and machine learning, which is a very good solution to the problem of how to combine machine learning with data platforms.
Disadvantages: Unifield architecture is more complicated to implement. For machine learning architecture, from software package to hardware deployment, there are very big differences from the data analysis platform, so the difficulty coefficient in the implementation process is higher.
Applicable scenarios: There is a large amount of data to be analyzed, and there is a very large demand or plan for the convenience of machine learning.
to sum up
The above architectures are the ones that are currently used in the data processing field. Of course, there are many other architectures, but their ideas will be more or less similar. The field of data and machine learning will continue to develop, and the above ideas may eventually become obsolete.