Comparing Hadoop analysis Spark is a popular reason

Source: Internet
Author: User
Keywords Hadoop
Tags .mall analysis apache api based basic cloudera data

  

The authors observed that Apache Spark recently sent some unusual events, Databricks will provide $14m USD support Spark,cloudera decided to support Spark,spark is considered a big issue in the field of large data.

Good first Impressions

The author believes that he has been dealing with Scala's API (spark using Scala) for some time, and, to tell you the truth, was very impressive at first because the spark looked so small and good. The basic abstraction is a resilient distributed dataset (RDD), and a largely immutable set of distributions that can be stored in Hadoop via HDFs and provide function operations such as the Scala style map foreach, based on local file definitions.

The first reaction is "Wait, is this a basic distributed collection?" Hadoop can be much more than this, and distributed file systems, especially map Reduce, support a variety of data formats, data sources, unit tests, cluster variants, and so on.

Of course, Spark also supports more complex operations such as joins, Group-by, or reduce-by operations, which can model complex data streams.

As time went on, it became clear that Spark's simplicity was a Java API for Hadoop. In Hadoop even the simplest of your cases has a lot of code. But conceptually, Hadoop is simple because it offers only two basic operations, parallel Mao and a reduce operation. If you are expressing it in the same way for some of the similar distributed collections, there is only a smaller interface (such as scalding projects that actually build things like this, and the code looks very similar to spark).

In order to convince itself, the author continued to study and found that spark actually provides an extraordinary set of operations, RDD is the basic building block of Spark, a similar distribution of immutable sets. Operations such as Map Lake foreach are easy to operate in parallel and implement join operations based on a common key for two rdd and sets. You can also use user-defined functionality to implement aggregation reduce operations based on a key.

In the word count example, you can map all the text in a paragraph of text, and then reduce them by words, and finally sum up the number of words. Rdd is able to read from disk and then stay in memory, improving performance, which is much faster than the majority of Hadoop is disk-based.

Interestingly, the spark fault-tolerant approach. Instead of persistent or checkpoint intermediate results, Spark remembers the sequence of operations that caused a dataset (Banq Note: Similar to eventsourcing, remembering the series events that caused the state). Therefore, when a node fails, spark rebuilds the stored dataset. They think this is actually not bad because other nodes will help rebuild. So, in essence, there are smaller interfaces than the basic primitive hadoop,spark (which can still be as bloated as the future), but there are many projects on top of Hadoop (such as Twitter's scalding,) that achieve a similar level of expressiveness. The other major difference is that spark is by default in memory, which naturally results in performance improvements, and even allows the running of iterative algorithms. Spark does not have built-in iterative support, however, it's just that they claim it's so fast that you can run iterations if you want to

Spark also comes with a data flow processing model, which is a file that outlines the design that is pretty good. Spark therefore differs from Twitter's storm framework. Storm is basically a pipe that you push into independent events and then get the results in a distributed way. Instead, spark events are collected and processed in batches at short intervals (assuming every 5 seconds). The data collected becomes a RDD and is then processed using a common set of spark applications.

This pattern is more robust for slow nodes and fault tolerance, while a 5-second interval is usually faster enough than most applications. I'm not quite sure about this because distributed computing is always very complex, and it is certainly true that this method uses the Non-real-time stream part to unify the real-time streaming process well.

Because of the RDD, if you need to make a small change to some data items, you have to make a copy of the entire dataset yourself, which can be done in parallel, but of course there is cost, based on the Copy-on-write implementation may be more effective here, but not yet implemented.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.