The Spark technology practice of NetEase Big Data platform

Source: Internet
Author: User
Tags hadoop mapreduce

NetEase Big Data Platform Spark technology practice author Wang Jian Zong NetEase's real-time computing requirements

For most big data, real-time is the important attribute that it should have, the arrival and acquisition of information should meet the requirement of real time, and the value of information needs to be maximized when it arrives at that moment, for example, e-commerce website, the website recommendation system expects to analyze its purchase intention in real-time according to the customer's click behavior.

Real-time computing refers to the acquisition and computation of instant data for read only data, as well as online computing, and the real-time level of online computing is divided into three categories: real-time (Msec/sec level), near real-time (min/hours), and batch (days). In terms of batch processing, MapReduce (MR) has proven to be the most effective tool, and with the popularity of big data analytics technology represented by the open source of MR, Hadoop, its ability to deal with big deals has been recognized, but it is more suitable for batch processing of big data on clusters, does not apply to processing large-scale streaming data in real time. In order to meet the requirements of real-time, the flow calculation and real-time computing framework based on the data warehouse is also emerging, and the relevant real-time optimization technology around Mr is booming, and the representative system is Google Dremel, Twitter Storm and Yahoo S4.

The application types of big data are mainly divided into two aspects: batch processing and stream processing. Batch processing is the first storage post-processing (store-then-process), stream processing is direct processing (straight-through-processing), in order to improve the response time of business intelligence, now widely adopted large data processing framework, For example, Mr and Dryad are mainly focused on large-scale data analysis, which is mainly based on batch processing, in fact, the time demand is not satisfied. Popular applications include online recommendations, web-click Analysis, sensing networks, traffic analysis, and high-frequency transactions in finance, and the demand for real time Analytic processing, RTAP, is significant, and NetEase is one of the largest portals in the country, Real-time is also the company's current Internet products should have an important attribute.

NetEase Big Data Spark technology application

Spark Technology represents a new direction for future data processing, spark is a common parallel computing framework for UC Berkeley AMP Lab's Open source class Hadoop MapReduce, which enables distributed computing based on MapReduce, with Hadoop MapReduce has the advantage. Unlike MapReduce, job intermediate outputs and results can be stored in memory, eliminating the need to read and write HDFs, so spark is better suited for algorithms that require iterative mapreduce such as data mining and machine learning.

In NetEase Big data platform, data storage in HDFs, to provide data warehouse calculation and query hive, to improve the performance of data processing and reach the real-time level, NetEase uses the combination of impala and shark real-time technology. Cloudera Impala is an open source project based on the real-time search engine for Hadoop, which is 3-90 times more efficient than hive, essentially a Google Dremel imitation, but've seen Bluetooth on SQL functionality. Shark is a spark-based SQL implementation, Shark can be up to 40 times times faster than hive (as the paper describes), and can be 25 times times faster to execute a machine learning program and fully compatible with hive.

Figure 1 and Figure 2 respectively test the computing power and real-time query performance after a preliminary test, in NetEase real-time computing platform, in the big data real-time query system, Impala in the data processing speed can be compared to the hive reached 3 times to 30 times times the speedup, shark can be compared to hive to achieve 1.5 to 15 times times faster than the Impala and shark engines, usually impala will be a bit faster than shark, which might lead to thinking, since Impala is so good in real time, why do you need shark?

When designing the big data platform, we found that Impala was doing well, but it was incompatible with the data of the old hive, because many of the current big data applications are organized in hive, and shark is fully compatible with the old data, so a mixed data processing pattern must be used in the current structure. Hive and Impala work together for some time, hive is primarily predefined Queries, and the main processing of batch-related jobs, and Impala handles interactive queries (Ad-hoc Queries), enabling the big Data system to support OLTP OLAP is also supported in order to achieve the level of real time analytical processing (Analytic processing, RTAP).

Figure 1 NetEase Big Data Platform performance test (Count/sum/avg operation)

Figure 2 NetEase Big Data Platform Performance test (JOIN/AD-HOC query operation)

Summarize

If you want to evaluate the 2012-2013 IT industry hot words, the word "big data" is not a genus. Return on investment ROI has evolved to return on information, the return on information has become an important indicator of Internet companies, if the vast amount of data available is a bunch of "rubbish", no gold mining to dig, the big data can not talk about , and an important attribute to improve ROI is real-time, improve the response time of the data need to support and guarantee technology, NetEase as one of China's top internet companies, in the big data is also the earliest pioneer, especially real-time computing technology, the company has been early adoption of the latest technologies to provide services, such as Impala and shark, it is not difficult to find that NetEase's big data system can be flexibly selected to calculate the real-time engine, the overall system in real-time processing capacity can be increased by 2 to 15 times times, which to enhance the company's production efficiency has a significant effect, in the subsequent work expected to further improve the real-time level, At present can only achieve the second level, can reach the millisecond level or even microsecond level is a future research and development direction, in short, for the massive data calculation, real-time demand for the company has a strong need to use the landing spark is a good choice.

References

[1] Storm distributed and fault-tolerant Real time computation

[2] Leonardo Neumeyer, Bruce Robbins, Anish Nair, Anand Kesari. s4:distributed Stream Computing Platform. IEEE International Conference on Data Mining workshops (ICDMW).

[3] Cloudera Impala Https://github.com/cloudera/impala

Reynold S. Xin, Josh Rosen, et al Shark:sql and rich analytics at scale. Sigmod Conference 2013.

The Spark technology practice of NetEase Big Data platform

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.