Design ideas for Apache Spark

Source: Internet
Author: User

As you know, Apache Spark is now the hottest open source Big Data project, and even EMC's specialized data pivotal is starting to abandon its more than 10-year-old Greenplum technology to spark technology development, and from the industry as a whole, Spark fires are only as much as OpenStack in the IaaS world. So this article as a technical article, we then directly into its core mechanism it.



What is memory technology?
about memory computing, like cloud computing and big data, in fact, whether in Baidu Encyclopedia or Wikipedia are not very accurate description, but there are a few common key points, I here to summarize: one is the data in memory, At least the data involved in the current query work is placed in memory, and the second is multi-threaded and multi-machine parallelism, that is, using the advantages of modern x86 Xeon CPU threads as much as possible to speed up the entire query; The third is to support multiple types of workloads, in addition to common and basic SQL queries, It also usually supports data mining, and what's more, it supports full stack, the common programming model, such as SQL query, stream computing and data mining.
design ideas for Apache Spark
As you know, Apache Spark is now the hottest open source Big Data project, and even EMC's specialized data pivotal is starting to abandon its more than 10-year-old Greenplum technology to spark technology development, and from the industry as a whole, Spark fires are only as much as OpenStack in the IaaS world. So this article as a technical article, we then directly into its core mechanism it.

Figure 1. Spark's core mechanism diagram
in Spark's core mechanism, there are two main levels: first, the RDD (resilient distributed Datasets), the RDD is the most basic abstraction of spark, is the abstract use of distributed memory, Implements an abstract implementation of a distributed dataset in a way that operates a local collection that represents a collection of data that has been partitioned, immutable, and capable of being parallelized, and is typically cached in memory, and the results of each operation on the RDD dataset can be stored in memory and the next operation can be entered directly from memory , eliminating the large amount of disk IO caused by shuffle operations in the map reduce framework. This is a relatively common machine learning algorithm for iterative operations, and interactive data mining, the efficiency of the increase is relatively large.
Secondly, it is the operator (Operator) that executes on the RDD, and in the support operator of Spark, there are mainly two kinds of conversion (transformation) and operation (action). In the aspect of conversion support operators have map, Filter,groupby and join, and in operation support operators have count,collect and save, etc.
the common format for spark storage data is key-value, the Hadoop standard sequence File, but it is also heard to support a columnstore format like parquet. The advantages of Key-value format are flexibility, up to data mining algorithms, detail data query, down to the complex SQL processing can be carried, the disadvantage is also obvious is the storage space is wasted, and similar to the Parquet format compared to the same Key-value format data is generally about twice times the size of the original data, while the column is generally 1/3 to 1/4 of the original data. At the efficiency level, due to the use of high-level JVM-based languages such as Scala, it is obvious that a certain amount of loss is noticeable, and the standard Java program executes at a rate that is nearly 60% slower than the C/C + + O0 mode.
in terms of technological innovation, the individual feels spark is far from innovative, as it is actually a more typical in-memory data grid memory grid, regardless of the IBM WebSphere EXtreme scale from 7-8 years ago to the new one in recent years, and used for 12306 of pivotal GemFire have a more similar architecture, are mainly through a number of machines into a large memory grid, the data stored in close to Key-value mode, and this memory grid based on a number of mechanisms to ensure that the data will be kept in memory in a lasting and stable manner, And to keep the data updated and restored, while using some common operators on the grid to perform flexible queries, and users can write programs to call these operators directly.

Design ideas for Apache Spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.