The heat of big data continues to rise, and big data has become another popular star after cloud computing. We're not going to talk about whether big data works for your company or organization, at least on the internet, which has been touted as an omnipotent super battleship.
The heat of big data continues to rise, and big data has become another popular star after cloud computing. We're not going to talk about whether big data works for your company or organization, at least on the internet, which has been touted as an omnipotent super battleship. It seems like overnight we jumped into the big data era from the Internet era! about what is big data, honestly, so far as cloud computing, I always feel like watching the film "cloud"-foggy feeling. Maybe the companies that are selling big data to you are going to paint you a utopian picture, but at least you have to keep a clear head and carefully ask yourself, does our company really need big data?
As a third-party payment company, data is indeed the most important core asset of the company. As the company was founded soon, with the rapid development of business, the transaction data to increase the geometric level, followed by the system is overwhelmed. Business units, leaders, and even the group's bosses shout all day to report, to analyze, to improve competitiveness. And the only thing that the research and development department can do is execute a SQL statement that is so complicated that it's hard to imagine, and then the system starts to strike, memory overflows, outages ... It's a nightmare. Omg!please Release ME!!!
In fact, the pressure of the data department can be said to be hard to imagine, in order to aggregate all the discrete data into a valuable report, it may take a few weeks or longer. This is clearly incompatible with the fast-response philosophy demanded by the business unit. As the saying goes, 工欲善其事, its prerequisite. We should have Niaoqianghuanpao ...
There are a lot of articles on the Internet that describe the benefits of big data, and there are a lot of people who bother to say their own big data experience, but I would like to ask, in the end how many people how many organizations are really bigger data? What are the actual effects? Does it really bring value to the company? Is it possible to quantify the value? It seems like I don't see how many comments will be involved, perhaps the big data is too new (in fact, the concept of the bottom is not new, the old wine is loaded with new bottles), so that people are still immersed in a variety of wonderful yy.
As a rigorous technical staff, after a brief blind worship, should be quickly into the study of landing applications, which is stepping on the "cloud" architect and riding a bicycle architect of the essential difference. Said some complaints, as vent or Bo eyeball, in short, I would like to express the fact is very simple: do not be confused by new things, do not blindly worship any new things, not to conform, this is the people we do research is absolutely unacceptable.
Said a lot is also the time to get to the chase. The company's top decision to formally implement the big Data platform within the group (also specially invited some of the community of experts, very look forward to ...), as a third-party payment company implementation of Big data platform is also understandable, so also actively participate in this project. Just before the end of the research on OSGi's enterprise-class framework, we wanted to use the CSDN platform to document this big data platform implementation process. I think I will be able to provide a good reference for other individuals or companies with similar ideas!
First, the overall architecture design of big data platform
- Software Architecture Design
Big Data Platform Architecture design inherits the idea of layered design, the service that the platform needs to provide is divided into different module levels according to the function, each module level interacts with the upper or lower module level only (through the interface of the hierarchical boundary), avoids the cross-layer interaction, the advantage of this design is: the interior of each function module is high cohesion, The module is loosely coupled to the module. This architecture facilitates high reliability, scalability and serviceability of the platform. For example, when we need to scale up a Hadoop cluster, we simply add a new Hadoop node server to the infrastructure layer, without any changes to the other module layers and are completely transparent to the user.
The entire big data platform is divided into five module levels, from bottom to top, according to its functions:
Operating Environment layer:
The run environment layer provides the runtime environment for the infrastructure layer, which consists of 2 components, the operating system and the runtime environment.
(1) Operating system we recommend installing REHL5.0 or above (64-bit). In addition, in order to increase the IO throughput of the disk, avoid installing the raid driver, but instead distribute the data directory of the Distributed file system on different disk partitions to improve the IO performance of the disk.
(2) The specific requirements of the runtime environment are the following table:
Name Version Description
JDK1.6 or later Hadoop requires a Java runtime environment and the JDK must be installed.
gcc/g++3.x or later when you run a mapreduce task using Hadoop pipes, you need the GCC compiler, optional.
python2.x or later when running a mapreduce task using Hadoop streaming, the Python runtime is required, optional.
The infrastructure layer consists of 2 parts: The Zookeeper cluster and the Hadoop cluster. It provides infrastructure services for the underlying platform layer, such as naming services, Distributed file Systems, MapReduce, and so on.
(1) The zookeeper cluster is used for named mappings as a named server for Hadoop clusters, and the Task Scheduler console of the base platform layer can access the Namenode in the Hadoop cluster through a named server with failover capabilities.
(2) The Hadoop cluster is the core of the big data platform and the infrastructure of the base platform layer. It provides services such as HDFs, MapReduce, Jobtracker, and Tasktracker. At present, we use the dual master node mode to avoid the single point of failure problem of Hadoop cluster.
Base platform Layer:
The base platform layer consists of 3 parts: Task Scheduler console, HBase, and hive. It provides a basic service invocation interface for the user gateway layer.
(1) The Task Dispatch console is the dispatch center of the MapReduce task, assigning the order and priority of various task execution. The user submits job tasks through the dispatch console and returns the results of their task execution through the Hadoop client at the user gateway layer. Its specific implementation steps are as follows:
When the Task Scheduler console receives the job submitted by the user, it matches the scheduling algorithm;
Request zookeeper return the Jobtracker node address of the available Hadoop cluster;
Submit the MapReduce Job task;
Whether the polling job task is completed;
If the job finishes sending a message and invokes a callback function;
Proceed to the next job task.
As a perfect Hadoop cluster implementation, the task scheduling console as far as possible to develop their own implementation, so flexibility and control will be more strong.
(2) HBase is a Hadoop-based column database that provides users with table-based data access services.
(3) Hive is a query service on Hadoop that allows users to submit query requests for class SQL through the hive client at the user gateway layer and view the returned query results through the client's UI, which provides a quasi-immediate data query statistics service for the data department.
User Gateway Layer:
The user gateway layer is used to provide the end customer with a personalized calling interface and the user's identity authentication, which is the only visible big data platform operation portal for the user. End users can interact with the big data platform only through interfaces provided by the user gateway layer. The gateway layer currently provides 3 personalization invocation interfaces:
(1) The Hadoop client is the portal to the user submitting a MapReduce job and can view the returned processing results from its UI interface.
(2) The hive client is the portal from which the user submits the HQL query service and can view the results of the query from its UI interface.
(3) Sqoop is the interface between a relational database and hbase or hive interaction data. You can import data from a relational database into hbase or hive as required to provide users with the ability to query through HQL. HBase or hive or HDFs can also lead the data back to the relational database for further data analysis by other analysis systems.
The user gateway layer can be infinitely expanded to meet the needs of different users according to the actual requirements.
Customer Application Layer:
Customer application layer is a variety of terminal applications, can include: a variety of relational databases, reports, transaction behavior analysis, statements, clearing and so on.
Now I can think of the applications that can be landed on the big data platform are:
1. Behavioral Analysis: Import transaction data from a relational database into a Hadoop cluster, then write a MapReduce job task based on the data mining algorithm and submit it to jobtracker for distributed computing, and then put its results into hive. The end user submits the results of the HQL query statistics analysis through the hive client.
2. Statement: Import transaction data from a relational database into a Hadoop cluster, then write a MapReduce job task based on business rules and submit it to jobtracker for distributed computing, and end users extract the statement result file via the Hadoop client ( Hadoop itself is also a distributed file system with the usual file access capabilities.
3. Clearing: Import the UnionPay file into HDFs, then the POSP transaction data previously imported from the relational database for MapReduce calculation (i.e. reconciliation operations), and then connect the results to another mapreduce job for the rate and splitting calculation (that is, the settlement operation), Finally, the results of the calculation are directed back to the relational database by the user triggering the merchant debit (i.e. debit operation).
Deployment Architecture Design
Key point Description:
1. At present, the entire Hadoop cluster is placed in the silver online room.
The 2.Hadoop cluster has 2 master nodes and 5 slave nodes, and 2 master nodes are backed up through zookeeper for failover functionality. Each master node shares all the slave nodes, ensuring that the backup of the distributed file system exists in all datanode nodes. All hosts in a Hadoop cluster must use the same network segment and be placed on the same rack to guarantee the IO performance of the cluster.
A 3.ZooKeeper cluster has at least 2 hosts configured to avoid a single node failure of the named service. Through zookeeper we can no longer need F5 to do load balancing, directly by the Task Scheduler console through ZK to achieve the Hadoop name node load balanced access.
4. All servers must be configured with no key SSH access.
5. External or internal users need a gateway to access the Hadoop cluster, and the gateway can provide services after some authentication to ensure access security for the Hadoop cluster.
"Big Data dry" implementation of big data platform based on Hadoop--Overall architecture design