Design ideas for Apache Spark

Last Update:2015-04-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As you know, Apache Spark is now the hottest open source Big Data project, and even EMC's specialized data pivotal is starting to abandon its more than 10-year-old Greenplum technology to spark technology development, and from the industry as a whole, Spark fires are only as much as OpenStack in the IaaS world. So this article as a technical article, we then directly into its core mechanism it.

What is memory technology?
about memory computing, like cloud computing and big data, in fact, whether in Baidu Encyclopedia or Wikipedia are not very accurate description, but there are a few common key points, I here to summarize: one is the data in memory, At least the data involved in the current query work is placed in memory, and the second is multi-threaded and multi-machine parallelism, that is, using the advantages of modern x86 Xeon CPU threads as much as possible to speed up the entire query; The third is to support multiple types of workloads, in addition to common and basic SQL queries, It also usually supports data mining, and what's more, it supports full stack, the common programming model, such as SQL query, stream computing and data mining.
design ideas for Apache Spark
As you know, Apache Spark is now the hottest open source Big Data project, and even EMC's specialized data pivotal is starting to abandon its more than 10-year-old Greenplum technology to spark technology development, and from the industry as a whole, Spark fires are only as much as OpenStack in the IaaS world. So this article as a technical article, we then directly into its core mechanism it.

Figure 1. Spark's core mechanism diagram
in Spark's core mechanism, there are two main levels: first, the RDD (resilient distributed Datasets), the RDD is the most basic abstraction of spark, is the abstract use of distributed memory, Implements an abstract implementation of a distributed dataset in a way that operates a local collection that represents a collection of data that has been partitioned, immutable, and capable of being parallelized, and is typically cached in memory, and the results of each operation on the RDD dataset can be stored in memory and the next operation can be entered directly from memory , eliminating the large amount of disk IO caused by shuffle operations in the map reduce framework. This is a relatively common machine learning algorithm for iterative operations, and interactive data mining, the efficiency of the increase is relatively large.
Secondly, it is the operator (Operator) that executes on the RDD, and in the support operator of Spark, there are mainly two kinds of conversion (transformation) and operation (action). In the aspect of conversion support operators have map, Filter,groupby and join, and in operation support operators have count,collect and save, etc.
the common format for spark storage data is key-value, the Hadoop standard sequence File, but it is also heard to support a columnstore format like parquet. The advantages of Key-value format are flexibility, up to data mining algorithms, detail data query, down to the complex SQL processing can be carried, the disadvantage is also obvious is the storage space is wasted, and similar to the Parquet format compared to the same Key-value format data is generally about twice times the size of the original data, while the column is generally 1/3 to 1/4 of the original data. At the efficiency level, due to the use of high-level JVM-based languages such as Scala, it is obvious that a certain amount of loss is noticeable, and the standard Java program executes at a rate that is nearly 60% slower than the C/C + + O0 mode.
in terms of technological innovation, the individual feels spark is far from innovative, as it is actually a more typical in-memory data grid memory grid, regardless of the IBM WebSphere EXtreme scale from 7-8 years ago to the new one in recent years, and used for 12306 of pivotal GemFire have a more similar architecture, are mainly through a number of machines into a large memory grid, the data stored in close to Key-value mode, and this memory grid based on a number of mechanisms to ensure that the data will be kept in memory in a lasting and stable manner, And to keep the data updated and restored, while using some common operators on the grid to perform flexible queries, and users can write programs to call these operators directly.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Design ideas for Apache Spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Design ideas for Apache Spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support