Sun Yuanhao: Spark engine-based high-speed memory analysis and mining tools

Last Update:2014-12-22 Source: Internet

Author: User

Keywords We already at the moment tool applications memory analysis

Tags analysis apache application applications based big data business community

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

April 19, 2014 Spark Summit China 2014 will be held in Beijing. The Apache Spark community members and business users at home and abroad will be gathered in Beijing for the first time. Spark contributors and front-line developers from AMPLab, Databricks, Intel, Taobao, NetEase, and others will share their Spark project experience and best practices in production environments.

The following is a reporter interviewed the original:

Why do you want to learn Spark technology?

We did the SQL engine on Hadoop in 2012, but the project did not go on. Because we found that the best support for fault tolerance and scalability support for SQL parallelization is still the M / R engine, not the Dremel or MPP engine. So I started to look at how to rebuild or re-implement M / R and started to understand Spark. Spark's architectural design is very elegant, abstraction of RDD and operation primitives, much like the parallel architecture we designed for multicore or GPU in the early years, such as CUDA. I think Spark is our ideal M / R calculation engine, start to devote all our energy to the development of Spark.

Spark has unique advantages for solving any problems?

We now have Spark as an M / R execution engine embedded in our product, which has been successful in two broad categories of application practices, one for interactive data statistics and analysis via PL / SQL, combined with visual tools to provide users with High-speed big data exploration ability. Such applications traditionally use data warehousing, but because Spark provides faster performance and big data processing capabilities, users can get a quicker feedback experience. Another type of application is to do data mining, because Spark makes full use of memory to cache, use DAG to eliminate unnecessary steps, so it is more appropriate to do an iterative operation. And a considerable part of the machine learning algorithm is a convergence algorithm through multiple iterations, so it is suitable to use Spark to achieve. We parallelize some commonly used algorithms with Spark, and can easily be called from the R language, reducing the learning cost of data mining for users.

Currently the biggest enterprise application Spark what is the difficulty?

I think there is no major technical difficulty at present. We have deployed our own version of Spark in some of our users' core business systems and are running 7x24 hours of uninterrupted operation with proven stability. We have also successfully applied Spark to the data warehouse without having to program nearly full visualizations. If we say that the biggest difficulty we face now is mainly in customer perception. Hadoop has been deployed by many customers in the past two years. The lesson we learned is that Hadoop is good at handling data above 100TB. However, it is inefficient to deal with small-scale data, and the operation and maintenance problems caused by the shortage of talents make users Some misunderstanding of Hadoop turned to a hybrid architecture. As Spark technology advances, the combination of Hadoop plus Spark has actually dramatically increased processing efficiency and has been able to solve all sorts of data processing issues for big, medium, and small, but turning around business users' perceptions requires more success stories and technologies Promotion.

- According to your understanding, the current development of Spark how?

Spark's current philosophy is to build SparkSQL as an example of a project that combines SQL, Machine Learning, Graph Computing, Streaming Computing, and many more in a single computing framework. Some projects around Spark, such as TachyonSparkR, BlinkDB, etc. are also rapidly developing. Tachyon has become the default component in the RHEL standard yum library. Applications are beginning to be more and more widely at home and abroad. Some large foreign Internet companies have already deployed Spark. Yahoo was an early major contributor to Hadoop and is now deploying Spark across multiple projects. We have already deployed Spark in traditional industries such as operators and e-commerce in China. We expect more successful cases this year.

- Please talk about your upcoming discussion at this conference.

This conference I will introduce two typical Spark applications, one is how to take full advantage of the advantages of Spark for interactive SQL data analysis; another application is how to combine the use of R language and Spark for distributed data mining.

- Which listeners should know the most about these topics?

The following audiences may be interested in this topic: End users wishing to analyze and monetize enterprise-owned big data; Users or developers who have used Hadoop but who experience poor performance; Users with rapidly increasing volumes of data from TB to PB or Data volume below 10TB but want to experience the new technology users.

Original link: Http://www.csdn.net/article/2014-04-08/2819193-Hadoop-Spark

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More