Build your own big data platform product based on Ambari
Currently, there are two mainstream enterprise-level Big Data Platform products on the market: CDH launched by Cloudera and HDP launched by Hortonworks, among them, HDP uses the open-source Ambari as a management and monitoring tool. CDH corresponds to Cloudera Manager, and there are also large data platforms dedicated by companies such as starring in China. Our company initially used the CDH environment. Recently, the leaders found me to make my own data platform product based on Ambari. At first, I refused to receive this task because I already had a well-developed data platform product. It seems to me that small companies are wasting manpower and material resources and it is too late to start doing this. Later, I thought that if the company had its own data platform products, it could prove its technical strength before the customer, and I could personally learn more about the components of the big data ecosystem at the source code level.
An individual's data platform in the company has never arrived. It is believed that building a data platform should include three parts. It consists of three parts: infrastructure construction, big data platform construction, and business system data interface. In the initial stage of infrastructure construction, the Linux server is used as the hardware infrastructure, and the container technology can be used for better resource allocation in the future. The big data platform consists of the data access module, data storage module, data computing module, resource scheduling module, and cluster monitoring module, it is designed to meet various requirements such as data storage, streamcompute, batch processing, and interactive analysis. This article focuses on the big data platform construction plan. The data interface of the business system exposes the corresponding data interface to provide data according to different business system requirements.
1. Data Platform Architecture
The data platform supports data stream processing and data Batch Processing Based on data processing methods. The data stream processing adopts the Storm computing framework. Currently, it is recommended that only simple logic processing be performed, and the computing results are only used for real-time data presentation, the real-time ML module can be added to the subsequent mature technologies. For data warehouses, batch processing uses Distributed File System (HDFS) as the underlying storage layer to collect programs, connect to business systems, collect business system logs, and other data, data Warehouses are built based on multiple business needs for multi-dimensional data analysis. We recommend that you use the current mainstream computing Engine Spark for data computing. Non-relational data is processed through code logic, and relational data is processed using SQL, such as SparkSQL, Hive, Kylin, and other components. computation results are written to a database that supports quick reading by background applications. Computing tasks are uniformly scheduled and executed by the task scheduling system. The security mechanism of the Data Platform is implemented by configuring kerberos on the host. The self-developed XJManager is used for cluster resource monitoring. The page should contain component names and status statistics, host health information, user management, and other modules, you can install and configure the Big Data Platform on the Web page. Shows the overall project architecture:
The following describes each module:
2.1. Data Access Module
It includes sensor data collection program access, uses Flume to collect business system log data, and connects to database data of other business systems. Kafka is used as a buffer for real-time data collection. You can build an ODS system if there is operational data for the business system data to be docked. Data for data analysis (including data collection and business data integration) is used to build a data warehouse on Hadoop.
2.2. Data warehouse module
Build a Data Warehouse Based on Hadoop. Data comes from multiple data sources. Different Basic data tables are designed for different business needs. Data Warehouse is an anti-paradigm design that introduces redundancy. Designed for data analysis needs of different dimensions.
2.3. streamcompute Module
Storm is used as the stream computing framework. Storm features low latency. SparkStreaming can be replaced if the data throughput is large and there is no high timeliness requirement.
2.4. Offline computing module
The offline processing module processes structured data using SQL statements and writes code for unstructured data. SparkSQL is recommended for computing large data volumes using SQL. Other commonly used SQL-based data computing components, such as the traditional Hive, are open-source Kylin. (Impala cannot be integrated because it is not open-source ).
2.5. task scheduling module
Integrate Oozie and Ext. js to automatically deploy Oozie's web ui, configure job dependencies through xml, configure running parameters through the property file, and perform web page monitoring through ext. js.
2.6. Platform Security Module
Kerberos
2.7. Cluster Monitoring Module
- Cloudization of ambari pages
- Modify the ambari monitoring Page Style
- One-click installation and deployment of ambari
The problems to be solved during initial construction are as follows:
3.1. Understanding of Ambari source code
Ambari Source Code address: https://github.com/apache/ambari
The main modification is in the ambari-web and ambari-views modules.
3.2. Modify the style of Ambari
Modify the ambari page style, including logo modification, page menus, operation buttons, and prompts.
Is the original ambari style:
3.3. Ambari integration component
Ambari is similar to ClouderaManager of Cloudera. After using the source code, compilation can only achieve online installation of components. The online installation of components is slow, unstable, and prone to installation failures. Therefore, we recommend that you integrate and package common components, including HDFS, MapReduce2, YARN, Hive, Sqoop, Oozie, Zookeeper, Storm, Kafka, Flume, and Spark. Install and deploy components of the corresponding version in advance.
3.4 one-click deployment script writing
Currently, to install ambari offline, you must prepare three packages, ambari, HDP, and HDP-util, to build a local yum source. Then, install ambari-server through yum, and install and configure a relational database. The process is more complex than that of common users. It is recommended to write a one-click installation script. After the server is ready (without keys, firewall shutdown, and time synchronization), you only need to run the script to install and deploy it on the server.
Install Hadoop Cluster Monitoring Tool Ambari
Use Ambari to quickly deploy the Hadoop Big Data Environment
Detailed description of Ambari service configuration and Alert
Set up Ambari clear PDF on Ubuntu 14.04
How to install Ambari2.4.0 on CentOS 7
Use Ambari to install Hadoop clusters in CentOS 6.5
Use Ambari to deploy a Hadoop cluster (build an intranet HDP source)
Ambari Installation Guide
CentOS 6.5 + Ambari + HDP cluster Installation