Transferred from: http://www.csdn.net/article/2015-10-20/2825962
"MDCC 2015" friend Alliance data Platform leader Shaman: Architecture and practice of mobile big data platform
"Csdn Live Report" October 14-16th, "2015 Mobile Developers Conference · China "(Mobile Developer Conference 2015, abbreviated as MDCC 2015) is held at Crowne Plaza Hotel Beijing New Yunnan. The Conference by the world's largest Chinese it community csdn and China's most concerned all-round entrepreneurial Platform innovation Workshop jointly hosted by the "Internet of all things, mobile first" as the theme, inviting domestic and foreign industry leaders and technical experts on mobile development hotspot, in practice to analyze technical solutions and trends.
Friend Alliance data platform Head Shaman
Mobile internet is everywhere to ripen the big data platform, while the Chinese Internet is facing from the IT era to the DT era of change, mobile internet and big data is almost a kind of relationship. Back to app development, data and operations are especially needed at a later stage. Since 2010, the league has focused on moving big data, and over the past 5 years has not only accumulated a large number of data, but also has a wealth of technology and experience, then the friend of the big data platform has what kind of architecture and practice? Here to share with you today.
first, the structureArchitectural Ideas
The Friends Alliance architecture is primarily a reference to the lambda architecture idea proposed by Twitter. As shown, the bottom is the fast processing layer, the new data in the fast processing layer calculation, this part of the data is relatively small, can be completed quickly, generate real-time view. at the same time, new data is incorporated into the full data set, batch processing, and batch view generation. In this way, the system has both low-latency real-time processing capability and offline large data processing capability. Then through the data service layer, the two views are combined to provide services externally.
The overall architecture of the data platform
According to the business characteristics of friends, the data platform is divided into these parts from the bottom up: The most basic is the log collection, followed by offline computing and real-time analysis, calculated results, data mining, valuable data into the Data warehouse. Next, a rest service-based data service is provided that provides a variety of data applications, such as reports, data analysis reports, data downloads, and more. Both sides of the section provide auxiliary functions, including task scheduling and monitoring management.
Data pipeline
In combination with the business architecture and lambda architecture, the final system is as follows: The leftmost is the data acquisition layer, the Friends League provides mobile phone, tablet, Box SDK to the app integration, the app through the SDK to send logs to the Friend Alliance platform; first into Nginx, The load balancer is then passed to the Finagle framework-based log receiver, which then comes to the data access layer.
Data access layer let Kalfka cluster bear, behind by storm consumption, stored in MongoDB inside, through Kafka mirror function synchronization, two Kafka cluster, can separate load; compute offline and real-time two parts, real-time is storm, offline is Hadoop, Data mining with hive, analysis tasks, migrating from pig to the spark platform, a lot of data is calculated, stored on Hfds, and finally stored in HBase, through ES to provide multilevel index to compensate for the lack of HBase two index.
Second, practice
Through the above introduction, we may have a preliminary understanding of the structure and concept of the whole big data platform. Just as the father of Linux, Linus Torvalds, famously-"Talk is cheap, show me the code!" The same, actually know is relatively easy, difficult is how to achieve. So next, I'll share some of the experience that some friends have gained in practice.
Data acquisition
First of all, from the data collection, the data collection is faced with a great challenge, the most important is the large flow, high concurrency and scalability. The Friend Alliance's data platform has undergone a process of development. At the beginning of 2010, because of the requirements of fast launch, we are based on ROR development, in the background through the resque to do some offline processing. This architecture, with the advent of the internet, faces huge data pressures that soon cannot be applied.
Next, we'll switch to the Finagle server-based log server. This finagle server is an open source asynchronous server framework for Twitter, which is ideal for mobile internet access: High concurrency and small data volumes. After switching to Finagle server, the processing power of a single server has been greatly improved. At the same time, the stateless nature of the Log collection service can support scale-out, so when faced with very high pressure, you can simply increase the temporary server to solve.
Data cleansing
One of the features of big data is data diversification, which can be confusing if not cleaned. In terms of data cleansing, we spent a lot of energy and stepped on a lot of pits.
For data analysis, the first thing to do is to get a "unique identity". Android system as a unique identifier, commonly used is the IMEI, MAC, Android ID. First of all, because of the Android fragmentation problem, through the API in the acquisition of this data, often there is no acquisition of the situation.
There are some other abnormal situations, such as a lot of fake machines have no legal IMEI, so many machines will share an IMEI, causing the IMEI repetition; some rom will change the MAC address after the machine, causing the Mac to repeat; Most TV boxes themselves have no IMEI. In this case, we simply use the IMEI or Mac, or Android ID, to identify, there will be problems.
In this respect, we are using a single service to unify the calculation. In the background, there will be offline tasks to calculate and find a high repetition rate identifier, added to the blacklist. In the calculation, directly skip the blacklist of the logo, swap with another algorithm for calculation.
We also encountered a lot of "data standardization", such as: "Equipment model", not directly capture the model of the field can be solved. Take millet 3 For example, this phone will have many versions, different batch model field is not the same. In this case, if the standardization is not unified, the results of the calculation must be problematic.
In addition, there will be a multi-machine type of situation, such as M1, in 2011 released three years after the number of active devices burst. The survey found that the original rival manufacturers at the end of 2014 produced a popular product, model field is also called M1. Therefore, we need to put the equipment model, through specialized means and product name corresponding, unified standardization.
In the process of data standardization, we will also encounter the problem of "geographical recognition". Geographical recognition is identified by an IP address. Because China's IP address management is not very standardized, so often appear, last a second everyone is in Beijing, the next second to the situation in Shenzhen. To solve this problem, we choose the most common IP address of the day as the geographical identity of the day.
There are also "time-identification", which is also a big problem. The first thing we used was client time. However, the customer time is very arbitrary, the user's error setting, will lead to inconsistent time, and some cottage opportunities have bugs, the machine restarts, the time has become a direct January 1, 1970; There is also the possibility that when the data is generated, the log will be reported to the platform when the network is re-connected. This results in a delay in the data.
In order to resolve these time inconsistencies, we use server-side time uniformly. But this brings new problems: the difference between statistical time and real time, but this difference is observed from a small window of time (e.g. one hours, or a day) and is correct from a large time window.
- Normalization of data formats
The Friends League SDK has evolved over many editions, and there are several formats for reporting the logs. Early in JSON format, later using thrift format. In the Data platform processing, the two format switching is cumbersome, so before processing, we unify it into protobuf, for post-calculation.
Data calculation
At the time of data calculation, according to the tolerance degree of different service to delay, it is divided into real-time calculation, off-line calculation and quasi-real-time calculation.
One of the challenges of real-time computing is timeliness. Because real-time computing is very sensitive to latency, the millisecond level. If you put an inappropriate calculation, such as some CPU-intensive calculations, it will directly lead to real-time computation delays. So in architecture, it is necessary to consider which parts are suitable for real-time and which are unsuitable. In addition, real-time computing tends to generate IO latency when writing databases, requiring special optimizations for real-time databases. In this case, we chose MongoDB to store data in the real-time calculation section, and constantly optimize MongoDB's write request to solve this problem.
Another challenge is burst traffic. Users use the app frequency is not uniform, early in the party has a high rate of use, especially at night 10:00-12:00 this time period will bring great pressure on our system, thanks to the previous architecture design, after reaching a certain threshold, will trigger the alarm, The OPS alumni are temporarily expanding to cope with these bursts of traffic.
Because real-time computing is usually incremental, there is a problem of error accumulation. The lambda architecture determines that both real-time and off-line are two separate computing systems, so there is bound to be an error. If the results of real-time calculations are used for a long time, this error will become larger. Now the solution is in real-time processing, do not give too much time window, for example, not more than a day, more than a day after the start to clean up, the calculation of the offline part of the daily calculation, to ensure that the offline part of the data calculation is completed, so that the offline data can be used to cover real-time data, Thus eliminating this data error.
Offline computing will also face some problems. The most common problem is data skew. Data tilt this thing, almost natural existence, such as some big app data volume, and small app data volume there is a huge gap, often in the offline calculation when the long tail phenomenon, parallel Mr Work always have one or two tasks drag, or even beyond the single-machine computing capacity.
There are many reasons for data skew, and there are different solutions for different reasons. The most common reason is because the granularity is too coarse, such as when we calculate, if the app ID to partition, it is easy to cause data skew. In response to this situation, the AU solution is to make a more granular division, such as by the app ID and device ID partition, and then aggregate the results, so as to reduce the occurrence of data skew.
The second problem is the problem of data compression. Offline calculation, often the input and output will be very large, so we have to pay attention to compression at all times, using CPU time to exchange for storage space savings. This saves the IO latency in the data transfer, but can reduce the overall task completion time.
The next step is the difficulty of scheduling resources because the priorities of various tasks are different, such as some key indicators, to be counted at a certain time, and some tasks to be available several hours earlier. Hadoop's own scheduler, whether it's a fair dispatch or a capability scheduler, does not fulfill our needs, and we do so by modifying the code of the Hadoop.
Another kind of task is sensitive to delay, but it is not suitable for real-time computing. Such tasks are called quasi-real-time tasks. For example, the download service of the report, because it is IO-intensive task, put in real-time is not appropriate, but it is more sensitive to time, may be users, such as 35 minutes is acceptable, but wait one or two hours is difficult to accept. For these quasi-real-time tasks we used to do this by reserving a certain amount of resources to run Mr. Now use spark streaming specifically to do these things.
In the quasi-real-time calculation, there is also a resource occupancy problem, in the process of reservation, will lead to your resource occupancy is too low, how to balance is a problem; the 2nd many tasks of real-time computing, often also use the incremental calculation mode, need to solve the increment calculation error accumulation problem, We compensate for this defect through a full amount of time calculation.
Data storage
Data storage, according to our previous calculation mode, is also divided into online storage and offline storage two parts. In the real-time part of the calculation results are mainly in MongoDB, the write IO must be optimized. The results of offline data calculation are generally stored in hbase. However, HBase lacks a level two index. We have introduced elastic Search to help HBase with index-related work.
Data caching can solve the problem of hot and cold data when doing data service. The Friend Alliance data cache uses Redis, and twemproxy is used for load balancing. Friends in the data cache experience is the need to pre-add data, such as: Every morning after the calculation of data, before the user really access, need to preload some of the results, so wait until the user access, is already in memory.
Data increment
The whole big Data system, the most valuable part, lies in the data increment, the League of Friends at present the value of data is two major directions. The first is the internal data to get through, based on user events, combined with the user portrait, as well as and Ali Hundred River cooperation to provide more dimension information, to provide developers with more accurate push. For example, for a car e-commerce app, you can locate a part of the car's users to push the car parts related information, and then delineate a part of the car-free users to push the sale of information.
In addition, a lot of work has been done on data mining. for the existing equipment, user portrait related calculations, through the user portrait can understand the user's attributes and interests, easy to follow the data across the landscape. At the same time, also for some cheat behavior design provides equipment rating products.
Through the data Platform statistical algorithm and machine learning algorithm, all the existing equipment to be rated, which is the garbage equipment, which is the real equipment, can be well recognized. In this way, if the developer has the relevant requirements, we can provide equipment rating-related indicators to help developers evaluate these promotional channels, in the end what is credible and what is not credible.
For more exciting content, please follow Sina Weibo: @CSDN Mobile, live streaming: 2015 Mobile developers Conference.
"MDCC 2015" friend Alliance data Platform leader Shaman: Architecture and practice of mobile big data platform