In recent years, the construction and security of
big data platforms in Internet companies have always been hotspots. The author plans to publish two articles to participate in the discussion, one architecture + one security. This article does not rely on the platform architecture of any major company, but introduces the overall architecture of the big data platform in plain language.
Let's start with two questions:
What is a big data platform? It is to integrate Internet products and back-end
big data systems, import the data generated by the application system into the big data platform, and export it to the application system after calculation.
Why is the big data platform important in the Internet industry? The big data platform integrates Internet applications and big data products, and connects real-time data with offline data, so that data can achieve larger-scale associated computing and unearth greater value from data. So as to realize data-driven business. The big data platform enables the application of big data technology products and realizes its own value.
Generally speaking: the big data platform can be divided into four parts: data collection, data processing, data output and task scheduling management.
First, data collection
According to the data source, it can be divided into the following 4 points:
1. Database data
At present, the commonly used database import tools are Sqoop and Canal.
Sqoop is a database batch import and export tool, which can import relational database data into Hadoop in batches, or export Hadoop data to relational databases.
Sqoop is suitable for batch import of relational database data. If you want to import relational database data in real time, you can choose Canal. Canal is a MySQL binlog acquisition tool open sourced by Alibaba. binlog is MySQL's transaction log, which can be used for MySQL database master-slave replication. Canal pretends to be a MySQL slave database and obtains binlog from MySQL.
2. Log data
Logs are one of the important data sources of big data platforms. On the one hand, application logs record the execution status of various programs and on the other hand record user operation tracks. Flume is a commonly used tool for big data log collection. Flume was first developed by Cloudera and later donated to the Apache Foundation to operate as an open source project.
3. Front-end program buried point
The so-called front-end buried point is the application front-end to collect data for data statistics and analysis.
Certain front-end behaviors of the user do not generate back-end requests, such as the time spent on the user's page, the user's browsing speed, and the user's click and cancel. This information is valuable for analyzing user behavior. However, these data must be obtained through the front-end buried point. Some Internet companies regard the front-end buried point data as the main source of big data. All front-end behavior of users will be collected by the buried point, and then combined with other data sources to build their own big data. Data warehouse, and then conduct data analysis and mining.
For an Internet application, when we refer to the front end, we may refer to the following categories:
App program, such as an iOS application or Android application, installed on the user's mobile phone or tablet;
PC Web front end, open with PC browser;
H5 front end, opened by mobile device browser;
WeChat applet, open in WeChat.
These different front-ends are developed in different development languages and run on different devices. Each type of front-end needs to solve its own burying problem.
The methods of burying points mainly include manual burying, automatic burying and visual burying.
Manual burying means that front-end developers manually program the front-end data to be collected to the back-end data collection system. Usually the company will develop some SDKs for front-end data reporting. The front-end engineers call the SDK where they need to be buried, and pass in relevant parameters according to the interface specifications, such as ID, name, page, control and other common parameters, as well as business logic data, etc. The SDK sends these data to the back-end server via HTTP.
The automated burying point is to automatically collect all user operation events through a front-end program SDK, and then upload them to the back-end server. Automatic burying point is sometimes called no burying point, which means that there is no need to bury the point. In fact, it is full burying point, that is, all user operations are burying point collection. The advantage of automatic burying point is that the development workload is small and the data specification is unified. The disadvantage is that the amount of collected data is large, and many data collections do not know what is useful. Computing resources are wasted in vain, especially for traffic-sensitive mobile terminal users. Because automatic burying point collection and uploading cost a lot of traffic, it may So it becomes a reason to uninstall the application, so the gain is not worth the loss. In practice, sometimes only automatic burial is done for some users, and some data is sampled for statistical analysis.
Between manual burying point and automatic burying point, there is another option to visualize burying point. Visually configure which front-end operations need to be buried, and collect data according to the configuration. The visual burying point is actually an automated burying point that can be manually intervened.
4. Crawler system
Obtain external data through web crawlers for industry data support, management decision-making, etc. As it involves sensitive content, no further expansion will be done.
Second. Data processing
The core of the big data platform is divided into offline computing and real-time computing.
1. Offline calculation
Computational processing performed by MapReduce, Hive, Spark, etc.
2. Real-time calculation
It is completed by streaming big data engines such as Storm and SparkSteaming, and the calculation can be completed in seconds or even milliseconds.
Third, data output
The data generated by big data processing and calculation is written to HDFS, but the application program will not read the data in HDFS, so the data in HDFS must be exported to the database. In addition to providing data to users, the big data platform also needs to provide various statistical data to the operation and decision-making layers in some back-end systems. These data are also written into the database and accessed by the corresponding back-end systems.
Fourth, task scheduling management
The effective integration and operation of the above three parts is the task scheduling management system. Its main functions are:
Reasonably schedule various MapReduce and Spark tasks to make the most reasonable use of resources
Perform temporary important tasks as soon as possible
Functions such as job submission, progress tracking, and data viewing
The simple task scheduling management system of the big data platform is actually a timing task system similar to Crontab, which starts different big data job scripts according to the preset time. Task scheduling of complex big data platforms also considers the dependencies between different jobs. The open source big data scheduling system is Oozie, which can also be extended on this basis.