Is Apache spark the next big guy in a large data field?

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Very large data and then

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The authors observed that http://www.aliyun.com/zixun/aggregation/14417.html ">apache Spark recently issued some unusual events databricks will provide $ 14M USD supports Spark,cloudera decision to support Spark,spark is considered a big issue in the field of large data.

Good first Impressions

The author believes that he has been dealing with Scala's API (spark using Scala) for some time, and, to tell you the truth, was very impressive at first because the spark looked so small and good. The basic abstraction is a resilient distributed dataset (RDD), and a largely immutable set of distributions that can be stored in Hadoop via HDFs and provide function operations such as the Scala style map foreach, based on local file definitions.

The first reaction is "Wait, is this a basic distributed collection?" "Hadoop can be much more than that, distributed file systems, especially map Reduce, support a variety of data formats, data sources, unit tests, cluster variants, support, and so on."

Of course, Spark also supports more complex operations such as joins, Group-by, or reduce-by operations, which can model complex data streams.

As time went on, it became clear that Spark's simplicity was a Java API for Hadoop. In Hadoop even the simplest of your cases has a lot of code. But conceptually, Hadoop is simple because it offers only two basic operations, parallel Mao and a reduce operation. If you are expressing it in the same way for some of the similar distributed collections, there is only a smaller interface (such as scalding projects that actually build things like this, and the code looks very similar to spark).

In order to convince itself, the author continued to study and found that spark actually provides an extraordinary set of operations, RDD is the basic building block of Spark, a similar distribution of immutable sets. Operations such as Map Lake foreach are easy to operate in parallel and implement join operations based on a common key for two rdd and sets. You can also use user-defined functionality to implement aggregation reduce operations based on a key.

In the word count example, you can map all the text in a paragraph of text, and then reduce them by words, and finally sum up the number of words. Rdd is able to read from disk and then stay in memory, improving performance, which is much faster than the majority of Hadoop is disk-based.

Interestingly, the spark fault-tolerant approach. Instead of persistent or checkpoint intermediate results, Spark remembers the sequence of operations that caused a dataset (Banq Note: Similar to eventsourcing, remembering the series events that caused the state). Therefore, when a node fails, spark rebuilds the stored dataset. They think this is actually not bad because other nodes will help rebuild. So, in essence, there are smaller interfaces than the basic primitive hadoop,spark (which can still be as bloated as the future), but there are many projects on top of Hadoop (such as Twitter's scalding,) that achieve a similar level of expressiveness. The other major difference is that spark is by default in memory, which naturally results in performance improvements, and even allows the running of iterative algorithms. Spark does not have built-in iterative support, however, it's just that they claim it's so fast that you can run iterations if you want to

Spark also comes with a data flow processing model, which is a file that outlines the design that is pretty good. Spark therefore differs from Twitter's storm framework. Storm is basically a pipe that you push into independent events and then get the results in a distributed way. Instead, spark events are collected and processed in batches at short intervals (assuming every 5 seconds). The data collected becomes a RDD and is then processed using a common set of spark applications.

This pattern is more robust for slow nodes and fault tolerance, while a 5-second interval is usually faster enough than most applications. I'm not quite sure about this because distributed computing is always very complex, and it is certainly true that this method uses the Non-real-time stream part to unify the real-time streaming process well.

Because of the RDD, if you need to make a small change to some data items, you have to make a copy of the entire dataset yourself, which can be done in parallel, but of course there is cost, based on the Copy-on-write implementation may be more effective here, but not yet implemented.

Original link: http://www.jdon.com/46098

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More