Tags: Big Data platform Architecture design how color operating system brings expression present share picture taskThe heat of big data continues to rise, and big data has become another popular star after cloud computing. We're not going to talk about whether big data works for your company or organization, at least on the internet, which has been touted as an omnipotent super battleship.
Big Data Platform Architecture design inherits the idea of layered design, the service that the platform needs to provide is divided into different module levels according to the function, each module level interacts with the upper or lower module level only (through the interface of the hierarchical boundary), avoids the cross-layer interaction, the advantage of this design is: the interior of each function module is high cohesion, The module is loosely coupled to the module. This architecture facilitates high reliability, scalability and serviceability of the platform. For example, when we need to scale up a Hadoop cluster, we simply add a new Hadoop node server to the infrastructure layer, without any changes to the other module layers and are completely transparent to the user.
The entire big data platform is divided into five module levels, from bottom to top, according to its functions:
Operating Environment layer:
The run environment layer provides the runtime environment for the infrastructure layer, which consists of 2 components, the operating system and the runtime environment.
(1) Operating system we recommend installing REHL5.0 or above (64-bit). In addition, in order to increase the IO throughput of the disk, avoid installing the raid driver, but instead distribute the data directory of the Distributed file system on different disk partitions to improve the IO performance of the disk.
(2) The specific requirements of the runtime environment are the following table:
Name Version Description
JDK1.6 or later Hadoop requires a Java runtime environment and the JDK must be installed.
gcc/g++3.x or later when you run a mapreduce task using Hadoop pipes, you need the GCC compiler, optional.
python2.x or later when running a mapreduce task using Hadoop streaming, the Python runtime is required, optional.
The infrastructure layer consists of 2 parts: The Zookeeper cluster and the Hadoop cluster. It provides infrastructure services for the underlying platform layer, such as naming services, Distributed file Systems, MapReduce, and so on.
(1) The zookeeper cluster is used for named mappings as a named server for Hadoop clusters, and the Task Scheduler console of the base platform layer can access the Namenode in the Hadoop cluster through a named server with failover capabilities.
(2) The Hadoop cluster is the core of the big data platform and the infrastructure of the base platform layer. It provides services such as HDFs, MapReduce, Jobtracker, and Tasktracker. At present, we use the dual master node mode to avoid the single point of failure problem of Hadoop cluster.
Base platform Layer:
The base platform layer consists of 3 parts: Task Scheduler console, HBase, and hive. It provides a basic service invocation interface for the user gateway layer.
(1) The Task Dispatch console is the dispatch center of the MapReduce task, assigning the order and priority of various task execution. The user submits job tasks through the dispatch console and returns the results of their task execution through the Hadoop client at the user gateway layer. Its specific implementation steps are as follows:
When the Task Scheduler console receives the job submitted by the user, it matches the scheduling algorithm;
Request zookeeper return the Jobtracker node address of the available Hadoop cluster;
Submit the MapReduce Job task;
Whether the polling job task is completed;
If the job finishes sending a message and invokes a callback function;
Proceed to the next job task.
As a perfect Hadoop cluster implementation, the task scheduling console as far as possible to develop their own implementation, so flexibility and control will be more strong.
(2) HBase is a Hadoop-based column database that provides users with table-based data access services.
(3) Hive is a query service on Hadoop that allows users to submit query requests for class SQL through the hive client at the user gateway layer and view the returned query results through the client's UI, which provides a quasi-immediate data query statistics service for the data department.
User Gateway Layer:
The user gateway layer is used to provide the end customer with a personalized calling interface and the user's identity authentication, which is the only visible big data platform operation portal for the user. End users can interact with the big data platform only through interfaces provided by the user gateway layer. The gateway layer currently provides 3 personalization invocation interfaces:
(1) The Hadoop client is the portal to the user submitting a MapReduce job and can view the returned processing results from its UI interface.
(2) The hive client is the portal from which the user submits the HQL query service and can view the results of the query from its UI interface.
(3) Sqoop is the interface between a relational database and hbase or hive interaction data. You can import data from a relational database into hbase or hive as required to provide users with the ability to query through HQL. HBase or hive or HDFs can also lead the data back to the relational database for further data analysis by other analysis systems.
The user gateway layer can be infinitely expanded to meet the needs of different users according to the actual requirements.
Customer Application Layer:
Customer application layer is a variety of terminal applications, can include: a variety of relational databases, reports, transaction behavior analysis, statements, clearing and so on.
Now I can think of the applications that can be landed on the big data platform are:
1. Behavioral Analysis: Import transaction data from a relational database into a Hadoop cluster, then write a MapReduce job task based on the data mining algorithm and submit it to jobtracker for distributed computing, and then put its results into hive. The end user submits the results of the HQL query statistics analysis through the hive client.
2. Statement: Import transaction data from a relational database into a Hadoop cluster, then write a MapReduce job task based on business rules and submit it to jobtracker for distributed computing, and end users extract the statement result file via the Hadoop client ( Hadoop itself is also a distributed file system with the usual file access capabilities.
3. Clearing: Import the UnionPay file into HDFs, then the POSP transaction data previously imported from the relational database for MapReduce calculation (i.e. reconciliation operations), and then connect the results to another mapreduce job for the rate and splitting calculation (that is, the settlement operation), Finally, the results of the calculation are directed back to the relational database by the user triggering the merchant debit (i.e. debit operation).
Deployment Architecture Design
Key point Description:
1. At present, the entire Hadoop cluster is placed in the silver online room.
The 2.Hadoop cluster has 2 master nodes and 5 slave nodes, and 2 master nodes are backed up through zookeeper for failover functionality. Each master node shares all the slave nodes, ensuring that the backup of the distributed file system exists in all datanode nodes. All hosts in a Hadoop cluster must use the same network segment and be placed on the same rack to guarantee the IO performance of the cluster.
A 3.ZooKeeper cluster has at least 2 hosts configured to avoid a single node failure of the named service. Through zookeeper we can no longer need F5 to do load balancing, directly by the Task Scheduler console through ZK to achieve the Hadoop name node load balanced access.
4. All servers must be configured with no key SSH access.
5. External or internal users need a gateway to access the Hadoop cluster, and the gateway can provide services after some authentication to ensure access security for the Hadoop cluster.
"Big Data dry" implementation of big data platform based on Hadoop--Overall architecture design