The authors observed that Apache Spark recently sent some unusual events, Databricks will provide $14m USD support Spark,cloudera decided to support Spark,spark is considered a big issue in the field of large data.
Good first Impressions
The author believes that he has been dealing with Scala's API (spark using Scala) for some time, and, to tell you the truth, was very impressive at first because the spark looked so small and good. The basic abstraction is a resilient distributed dataset (RDD), and a largely immutable set of distributions that can be stored in Hadoop via HDFs and provide function operations such as the Scala style map foreach, based on local file definitions.
The first reaction is "Wait, is this a basic distributed collection?" Hadoop can be much more than this, and distributed file systems, especially map Reduce, support a variety of data formats, data sources, unit tests, cluster variants, and so on.
Of course, Spark also supports more complex operations such as joins, Group-by, or reduce-by operations, which can model complex data streams.
As time went on, it became clear that Spark's simplicity was a Java API for Hadoop. In Hadoop even the simplest of your cases has a lot of code. But conceptually, Hadoop is simple because it offers only two basic operations, parallel Mao and a reduce operation. If you are expressing it in the same way for some of the similar distributed collections, there is only a smaller interface (such as scalding projects that actually build things like this, and the code looks very similar to spark).
In order to convince itself, the author continued to study and found that spark actually provides an extraordinary set of operations, RDD is the basic building block of Spark, a similar distribution of immutable sets. Operations such as Map Lake foreach are easy to operate in parallel and implement join operations based on a common key for two rdd and sets. You can also use user-defined functionality to implement aggregation reduce operations based on a key.
In the word count example, you can map all the text in a paragraph of text, and then reduce them by words, and finally sum up the number of words. Rdd is able to read from disk and then stay in memory, improving performance, which is much faster than the majority of Hadoop is disk-based.
Interestingly, the spark fault-tolerant approach. Instead of persistent or checkpoint intermediate results, Spark remembers the sequence of operations that caused a dataset (Banq Note: Similar to eventsourcing, remembering the series events that caused the state). Therefore, when a node fails, spark rebuilds the stored dataset. They think this is actually not bad because other nodes will help rebuild. So, in essence, there are smaller interfaces than the basic primitive hadoop,spark (which can still be as bloated as the future), but there are many projects on top of Hadoop (such as Twitter's scalding,) that achieve a similar level of expressiveness. The other major difference is that spark is by default in memory, which naturally results in performance improvements, and even allows the running of iterative algorithms. Spark does not have built-in iterative support, however, it's just that they claim it's so fast that you can run iterations if you want to
Spark also comes with a data flow processing model, which is a file that outlines the design that is pretty good. Spark therefore differs from Twitter's storm framework. Storm is basically a pipe that you push into independent events and then get the results in a distributed way. Instead, spark events are collected and processed in batches at short intervals (assuming every 5 seconds). The data collected becomes a RDD and is then processed using a common set of spark applications.
This pattern is more robust for slow nodes and fault tolerance, while a 5-second interval is usually faster enough than most applications. I'm not quite sure about this because distributed computing is always very complex, and it is certainly true that this method uses the Non-real-time stream part to unify the real-time streaming process well.
Because of the RDD, if you need to make a small change to some data items, you have to make a copy of the entire dataset yourself, which can be done in parallel, but of course there is cost, based on the Copy-on-write implementation may be more effective here, but not yet implemented.