Architecture practices from Hadoop to spark

Source: Internet
Author: User
Tags shuffle hadoop mapreduce spark mllib

absrtact: This article mainly introduces TalkingData in the process of building big data platform, introducing spark gradually, and build mobile big data platform based on Hadoop yarn and spark.

Now, Spark has been widely recognized and supported at home: In 2014, spark Summit China in Beijing, the scene is hot, the same year, Spark Meetup in Beijing, Shanghai, Shenzhen and Hangzhou four cities, of which only Beijing has successfully held 5 times, The content covers many areas, including spark Core, spark streaming, Spark MLlib, Spark SQL, and more. As an early focus and introduction to Spark's mobile internet Big Data Services, TalkingData is also actively involved in the various activities of the domestic spark community, and has repeatedly shared the company's experience with spark in meetup. In this paper, we mainly introduce the process of introducing spark into the big data platform and building the mobile big data platform based on Hadoop yarn and spark talkingdata.

Initial knowledge of Spark

As a company that is pioneering in the field of mobile internet big data, it is essential for the company's technical team to focus on the development and progress of the big Data technology field. While collating the strata 2013 public Handout, a master titled "An Introduction on the Berkeley Data Analytics stack_bdas_featuring spark,spark streaming, And Shark's tutorial has attracted the attention and discussion of the entire technical team, where Spark's memory-based RDD model, support for machine learning algorithms, a unified model of real-time processing and off-line processing throughout the technology stack, and Shark all come to light. At the same time we are concerned with impala, but the contrast Spark,impala can be understood as an upgrade to hive, while spark attempts to build an ecosystem around the RDD for big data processing. For a fast growth of data volume, the business is to big data processing as the core and in the ever-changing entrepreneurial company, the latter is undoubtedly more worthy of further attention and research.

On Spark

In the middle of 2013, with the rapid development of business, more and more mobile device side data were collected by various business platforms. So does this data contain more value than providing the business metrics that are needed for different businesses? To better explore the potential value of the data, we decided to build our own data center to bring together data from each business platform to process, analyze, and mine the data that covers the device, thus exploring the value of the data. The primary function settings for the initial data center are as follows:

1. Cross-market aggregation of Android application rankings;

2. Application recommendations based on user interest.

Based on the technical mastery and functional requirements of the time, the technical architecture used in the data center is 1.

The entire system is built on the Hadoop 2.0 (Cloudera CDH4.3), using the most original Big Data computing architecture. Through the Log collection program, the logs of different business platforms are aggregated into the data center, and the data is formatted and stored in HDFS via ETL. Among them, the implementation of ranking and recommendation algorithm uses MapReduce, the system only has the offline batch calculation, and through the Azkaban-based scheduling system for offline task scheduling.

The first version of the data Center architecture is basically designed to meet the "most basic data use" purpose. However, as the value of data is explored more and more, more and more real-time analysis needs are presented. At the same time, more machine learning algorithms need to be added to support different data mining needs. For real-time data analysis, it is clearly not possible to "develop a mapreduce task separately for each analysis requirement", so introducing hive is a simple and straightforward choice. Given that the traditional MapReduce model does not support iterative computing well, we need a better parallel computing framework to support machine learning algorithms. And these are the areas of spark that we've been keeping a close eye on-and with its friendly support for iterative computing, Spark is certainly the perfect choice. At the end of September 2013, with the launch of Spark 0.8.0, we decided to evolve the initial architecture, introducing hive as the basis for instant queries, introducing the Spark computing framework to support machine learning type calculations, and validating Whether spark, the new computing framework, can replace the traditional MapReduce-based computing framework. Figure 2 is the architectural evolution of the entire system.

In this architecture, we deploy spark 0.8.1 on yarn and isolate the spark-based machine learning task by separating queue, calculating the rank of the daily MapReduce task and hive-based instant analysis task.

To introduce spark, the first step is to get the spark package that supports our Hadoop environment. Our Hadoop environment is CDH 4.3 released by Cloudera, and the default spark release package does not contain versions that support CDH 4.3, so it can only be compiled on its own. The Spark official documentation recommends compiling with MAVEN, but compiling is not as good as it might be. Various package dependencies cannot be successfully downloaded from some of the dependent central libraries for well-known reasons. So we took the simplest, straightforward approach to compiling with the AWS Cloud host. It is important to note that before compiling, be sure to follow the documentation recommendations, set:

Otherwise, you will encounter a memory overflow problem during compilation. The parameters for CDH 4.3,MVN build are:

It is easy to deploy and run spark in a Hadoop environment after you have compiled the spark packages that are needed for success. After the compiled spark catalog is packaged and compressed, you can run spark by extracting it on a machine that can run Hadoop client. To verify that Spark is working properly on the target Hadoop environment, you can verify that you are running SPARKPI in example by referencing the official Spark documentation:

After the spark deployment is complete, the rest is to develop a spark-based program. Although Spark supports Java, Python, the most appropriate language for developing spark programs is Scala. After a period of groping, we mastered the features of the functional programming language of the Scala language and finally realized the great benefits of using Scala to develop spark applications. The same function, with the MapReduce hundreds of line to achieve the calculation, in Spark, Scala through just dozens of lines of code can be completed. At run time, the same computational function, on Spark, is dozens of times times better than MapReduce. For machine learning algorithms that require iterations, the spark's RDD model is more obvious than MapReduce's, not to mention the basic Mllib support. After months of practice, data mining related work has been completely migrated to spark, and a more efficient LR algorithm for our datasets has been implemented on spark.

Full embrace of Spark

Into the 2014, the company's business has made considerable progress, compared to the data center platform, the Daily processing of data volume also doubled. The daily ranking calculation takes longer, and the real-time calculation based on hive can only support the calculation of the day scale, if the scale of the week, the calculation of the duration is very difficult to endure, the scale of the month is basically no way to complete the calculation. Based on the knowledge and accumulation on spark, it's time to move the entire data center to spark.

In April 2014, Spark Summit China was held in Beijing. With the purpose of learning, our technical team also participated in this spark event in China. Through this event, we learned that many of our peers in the country have started using spark to build their big data platform, and Spark has become one of the most active projects in ASF. In addition, more and more big data-related products are gradually converging with spark or migrating to spark. Spark will undoubtedly become a better ecosystem than Hadoop mapreduce. Through this conference, we are more determined to embrace spark fully.

Based on yarn and spark, we started to re-architect the big data platform that the data center relies on. The entire new data platform should be able to host:

1. Quasi-real-time data collection and ETL;

2. Support streaming data processing;

3. More efficient off-line computing capability;

4. High-speed multidimensional analysis capability;

5. More efficient real-time analysis capability;

6. Efficient machine learning ability;

7. Unified data Access interface;

8. A unified view of the data;

9. Flexible task scheduling.

The entire new architecture takes full advantage of yarn and spark, and incorporates some of the company's technology buildup, as shown in Architecture 3.

In the new architecture, Kafka is introduced as a channel for log pooling. Several business systems collect logs from the mobile device side, which are written to Kafka in real time, thus facilitating subsequent data consumption.

With spark streaming, the data in the Kafka can be easily consumed. Throughout the architecture, Spark streaming is primarily doing the following.

1. Save the original log. The original log in Kafka is stored in HDFs in JSON format.

2. After data cleansing and conversion, cleaning and standardization, it is transformed into parquet format and stored in HDFs for easy follow-up of various data computing tasks.

3. Well-defined flow computing tasks, such as label processing based on frequency rules, are stored directly in MongoDB.

The ranking calculation task was re-implemented on Spark, leveraging Spark's performance improvements, and the efficient data access that parquet columnstore. The same computational task, when the amount of data raised to 3 times times the original situation, the time cost is only 1/6 of the original.

In addition to the performance gains from spark and parquet Columnstore, instant multidimensional data analysis, once difficult to meet business needs, has finally become possible. Using hive to perform a multi-dimensional instant analysis on a daily scale requires only 2 minutes to complete on a new architecture. The results can be calculated on a weekly scale of only 10 minutes. The results can be calculated within two hours of the monthly scale multidimensional analysis that was not completed on hive. In addition, the gradual improvement of spark SQL also reduces the difficulty of development.

Using the resource management capabilities provided by yarn for multidimensional Analysis, the self-developed bitmap engine has also been migrated to yarn. For a dimension that has already been determined, you can pre-create the bitmap index. and multi-dimensional analysis, if the required dimensions have been pre-established bitmap index, the bitmap engine by the bitmap calculation to achieve, so as to provide real-time multi-dimensional analysis capabilities.

In the new architecture, in order to manage the data more conveniently, we introduce the meta-data management system based on Hcatalog, the data definition, storage and access are all through the meta-data management system, which realizes the unified view of the data and facilitates the management of the data assets.

Yarn only provides the ability to dispatch resources, in a big data platform, distributed task scheduling system is also indispensable. In the new architecture, we have developed a DAG-enabled distributed task scheduling system that combines the resource scheduling capabilities provided by yarn to achieve pipeline of scheduled tasks, immediate tasks, and different tasks.

Based on the new architecture around yarn and Spark, a self-service big data platform for the data business unit enables the data business unit to easily leverage this platform for multidimensional analysis, data extraction, and custom label processing. The self-service system improves the ability of data utilization, and also greatly improves the efficiency of data utilization.

Use some of the pits that spark encounters

The introduction of any new technology will be unfamiliar to the familiar, from the initial new technology to bring surprises, to later encounter difficulties when the helpless and melancholy, and then to solve the problem after the joy, big data upstart spark also can not mundane. Here's a list of some of the pits we've encountered.

"Pit One: When you run a large data set, you will encounter Org.apache.spark.SparkException:Error communicating with Mapoutputtracker"

This error is very cryptic, from the error log, the spark cluster is partition, but if you look at the physical machine running, you will find that the disk I/O is very high. Further analysis will find that the reason for this is that spark generated too many temporary files during the shuffle process when working with large datasets, causing the operating system disk I/O load to be too large. Once the reason is found, it is easy to solve and set Spark.shuffle.consolidateFiles to True. This parameter is false in the default settings, for Linux Ext4 file system, it is recommended that you set the default to True. The description of the spark official documentation also recommends that the Ext4 file system be set to true to improve performance.

"Pit II: Run times fetch failure wrong"

On big data sets, running the Spark program, in many cases, will encounter the wrong fetch failure. Since spark itself is designed to be fault-tolerant, most of the fetch failure will pass through retries, so the entire spark task will run normally, but the execution time will increase significantly due to retry attempts. The root cause of fetch failure is different. From the error itself, it is because the task cannot read shuffle data from the remote node, the specific reason is to use:

Look at the running logs of Spark to find the root cause of the fetch failure. Most of these problems can be solved by proper parameter configuration and optimization of the program. The 2014 Spark Summit, the topic of Chen Chao, has very good advice on how to optimize spark performance.

Of course, there are other different problems with spark, but since Spark is open source, most problems can be solved with source code reading and the help of the open source community.

Plan for the next step

Spark has made great strides in the 2014, and the big data ecosystem around spark has grown. Spark 1.3 introduces a new Dataframe API, a new Dataframe API that will make spark more user-friendly with data. The distributed cache system, also derived from Amplab, tachyon because of its good integration with spark, has gradually attracted people's attention. Given that in a business scenario, many of the underlying data needs to be reused by several different spark tasks, the next step is to introduce tachyon in the schema as the cache layer. In addition, with the increasing popularity of SSDs, our plan is to introduce SSD storage into each machine in the cluster, configure Spark's shuffle output to SSD, and utilize SSD's high-speed random reading and writing ability to further improve the efficiency of big data processing.

In machine learning, the H2O machine learning engine also has a good integration with spark, resulting in sparkling-water. We believe that using Sparking-water as a startup company, we can also use the power of deep learning to further explore the value of data.

Conclusion

In 2004, Google's MapReduce paper unveiled the era of big data processing, and Hadoop's MapReduce in the past nearly 10 became synonymous with big data processing. and Matei Zaharia A 2012 paper on Rdd "resilient distributed datasets:a fault-tolerant abstraction for in-memory Cluster Computi Ng "reveals the advent of a new era of big data processing technology. With the development of new hardware technology, the wide demand of low-latency big data processing, and the increasing popularity of data mining in the field of big database, Spark, as a brand-new big data ecosystem, has gradually replaced the traditional mapreduce and become the hottest of the next generation of large-size processing technology. The evolution of our architecture from MapReduce to spark over the past two years has also largely represented a significant part of the technological evolution of practitioners in the big data sector. It is believed that as the spark ecosystem improves, more and more enterprises will migrate their data to spark. And with more and more big data engineers familiar with and understand spark, the domestic spark community will be more and more active, spark as an open-source platform, I believe there will be more and more Chinese into spark-related projects contributor, Spark will also become more mature and powerful.

Source: http://www.csdn.net/article/2015-06-08/2824889

Architecture practices from Hadoop to spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.