How to build seven KN data platform with Hadoop/spark

Source: Internet
Author: User

The data platform in most companies is a supportive platform, do not immediately be spit slot, which is similar to the operation and maintenance department. So in the selection of technology to prioritize the ready-made tools, rapid results, there is no need to worry about the technical burden. In the early days, we took the detour and thought that there was not much work, and the collection of storage and computing were developed and found to be thankless. Starting in the first half of last year, we embraced open source tools and built our own data platform.

1. Data Platform design concept

The company's primary data sources are semi-structured logs scattered across business servers, such as system logs, program logs, access logs, audit logs, and so on. Log is the most original data record, if not the log, there will be a loss of information. To say a simple example, the requirement is to statistics nginx on each domain name of the traffic, this can be completely through a simple Nginx module to complete, but if the statistics are different sources of traffic can not be done. So the original full log is required.

The trick is that the business program sends logs directly through the network, but this is not desirable because the network and receiver are not completely reliable, and when something goes wrong, it can affect the business or the log is lost. Therefore, the most natural way to minimize business intrusion is to drop logs on the local hard drive.

2. Data Platform Design Architecture


2.1 Agent Design Requirements

Here, the agent needs to be light enough, which is mainly embodied in the operations and logic two aspects. Agents are deployed on each machine, and operational costs and access costs need to be considered. Agents should not have parsing logs, filtering, statistics and other actions, these logic should give data consumers. If the agent has more logic, it is not complete, inevitably there will always be escalation change action.

2.2 Data collection process

Data collection this piece of technical choice, agent is developed with Go self, message middleware Kafka, data transmission tool flume. When it comes to data collection, which is often compared with Flume and Kafka, it seems to me that the two positions are different, flume more inclined to the data transmission itself, KAKFA is a typical message middleware used to decouple producer consumers.

Specifically, the agent does not send the data directly to the Kafka, in front of the Kafka layer is composed of flume forward. There are two reasons to do this:

1. The Kafka API is unfriendly to non-JVM language support, and forward provides a more general HTTP interface to the outside.

2. forward layer can do the logic of routing, Kafka topic and KAFKA partition key, and further reduce the logic of agent side.

Forward layer does not contain the state, can be completely horizontal expansion, do not have to worry about becoming a bottleneck. For high availability, forward usually has more than one instance, which leads to a log order problem, and the agent chooses forward instances according to certain rules (round-robin, failover, etc.), even if Kafka partition key, Due to the existence of the forward layer, the sequence of data that eventually falls into Kafka may be different from the order in which the agent is sent. We are tolerant of chaos, because the business of generating logs is basically distributed, which guarantees that the log sequence of a single machine is of little significance. If the order of business requirements, it is necessary to send data directly to Kafka, and choose good partition Key,kafka can only guarantee the order of partition level.

2.3 Collection points across the computer room

multi-engine room situation, through the above process, the first data to the local computer room Kafka cluster, Then converge to the core room of the Kafka, ultimately for consumers to use. Because Kafka's mirror is unfriendly to the network, here we choose more simple flume to complete the data transfer across the room. Flume it is flexible to transfer data at different data sources, but there are several points to note:

1.  Memory-channel high efficiency but may have the risk of data loss, File-channel high security but not high performance. We use Memory-channel, but set the capacity small enough to keep the data in memory as low as possible, with little data lost during unexpected reboots and power outages. Personal comparison exclusion File-channel, efficiency is on the one hand, the other is the expectation of flume is data transmission, the introduction of File-channel, its role will change to the storage, which is not appropriate throughout the process. Usually the sink end of the flume is a relatively good usability and extensibility system for Kafka and HDFs, without worrying about data congestion.

2. The default HTTP souce does not set the thread pool, there are performance issues, if useful, you need to modify the code yourself.

3. when the single sink speed is not up, multiple sink are required. Like the cross-room data transmission network delay high single RPC sink throughput on not to go and hdfs sink inefficient situation, we will be in a channel after more than 10 sink.

2.4 Kafka use points

Kafka is good at performance and scalability, here are a few points to note:

1. The Division of topic, the big topic to the producer advantage and low maintenance cost, small topic to consumers more friendly. If the relevant data source is completely unrelated and the number of topic is not divergent, the sub-topic is preferred.

2. The Kafka parallel unit is the total throughput of the partition,partition number of direct relationships, but the parition number is not higher, the 3 partition can eat a full piece of ordinary hard disk IO. So the number of partition is determined by the size of the data, and ultimately the hard drive is required to resist.

3. Partition key selection is inappropriate and may result in data skew. It is necessary to use partition key in order to request data. Kafka producer SDK when partition key is not specified, in a certain period of time will only go to a partition write data, in this case, when the producer number of less than partition will also cause data skew, The number of producer can be increased to solve this problem.

2.5 data offline and real-time computing

After the data is Kafka, the data is synchronized to HDFS for offline statistics. Another way for real-time computing. Due to the limited time today, we can only share some of the experience of real-time computing.

Calculate the spark streaming of our choice in real time. We currently only have statistical requirements, no iterative calculation requirements, so spark streaming use more conservative, from the KAKFA read data into MONGO, intermediate state data is very small. The benefit is a large system throughput with little memory-related issues

Spark Streaming has higher requirements for the database TPS for storing the computed results. For example, there are 100,000 domain names need statistical traffic, batch interval 10s, each domain name has 4 related statistics, calculate the average is 40,000 TPS, considering the peak may be higher, SSD MONGO can only resist 10,000 TPS, Later on, we'll consider using Redis to resist this high TPS.

A task with an external state cannot be re-entered logically, and when the speculation parameter is turned on, it may result in inaccurate calculations. Say a simple example. This task, if re-done, will result in more than the actual results falling into MONGO.

A stateful object's life cycle is poorly managed, and it is impossible for an object to do every task to new. Our strategy is to be an object within the JVM, and to do concurrency control at the code level. Similar to the following.


In the later version of Spark1.3, the Kafka Direct API was introduced to try to solve the problem of data accuracy, and the use of direct in a certain program can alleviate the accuracy problem, but there will inevitably be consistency issues. Why do you say that? The Direct API exposes the management of the Kafka consumer offset (formerly asynchronous to zookeeper), ensuring accuracy when saving calculations and saving offset in a transaction.

There are two ways to do this, one is to use the MySQL support transaction database to save the results of offset, one is to implement the two-phase commit. Both of these methods are very expensive to implement in streaming calculations. The direct API also has performance problems because it actually reads data from the Kafka when it is calculated, which has a big impact on overall throughput.

3, seven cattle data platform scale


This is what we're going to share, and finally we're on the line: Flume + Kafka + Spark8 high-precision machine, 50 billion data daily, peak 800,000 TPs.


How to build a seven KN data platform with Hadoop/spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.