At the moment, http://www.aliyun.com/zixun/aggregation/13383.html ">spark has gained popularity, and a distributed computing approach based on map reduce makes spark similar to Hadoop, It is more versatile than Hadoop, with more efficient iterations and more fault-tolerant capabilities, and future spark will be a very successful parallel computing framework.
The author of "Editor's note", Mikio Braun, is a postdoctoral student at Berlin University of Technology, and he describes his own process of spark understanding, and analyzes the principles and applications of spark. As a common parallel processing framework, spark has some advantages like Hadoop, and Spark uses better memory management, which is more efficient than Hadoop in iterative computing, and Spark provides a wider range of data set operation types that greatly facilitates user development. Checkpoint's application makes spark have a strong fault-tolerant capability, and many superior performance and wider application than Hadoop make Spark's further development worth looking forward to.
The following is the translation:
Apache Spark now has a big reputation. The Databricks company, which was set up to support the Spark project, raised $14 million from Andereessen Horowittz, and Cloudera has decided to give its full support to spark, as well as many other companies that have actively joined the event. So I think this is exactly what I should really know about this agitation.
I studied the Scala API for a while (spark in Scala), and frankly I was disappointed at first because spark looked really unattractive. The basic abstraction is the resilient distributed Datasets (RDDS) and the basic distributed immutable set, which can provide commonly used HDFS collection operations (such as mappings, based on local files or by scala-style file definitions stored on Hadoop). foreach, etc.).
My first reaction was, "No mistake, is this a basic distributed collection?" "。 Hadoop is much richer in comparison: The Distributed File system, known as map Reduce, supports all types of data formats, data sources, unit tests, clustering variables, and so on.
Others quickly point out that there is more, in fact, that spark also provides more complex operations (such as joins, groups of operations, or conventions) so that you can model rather complex data streams (although there are no iterations).
As time went on, I realized that what spark called simplicity actually said mostly about the Java API in Hadoop rather than the spark itself. Even simple examples often have a large number of boilerplate code in Hadoop. Conceptually, however, Hadoop is very simple, offering only two basic operations: parallel mapping (map) and Protocol (REDUCE) operations. In the same way, to represent a similar distributed set, there will actually be a smaller interface (some projects like scalding are dealing with similar things, and the code looks very similar to spark).
Spark actually provides a set of important operations, and I'm convinced that after that, I've done a more in-depth study of this paper, which describes the common architecture. RDDs is the basic building block of spark, and it really looks like a distributed invariant set. These defined operations, such as map or foreach, are easily processed in parallel, and join operations require two Rdds and collection of entries based on a common key, as well as the use of user-specified key-based functions to aggregate entries according to the operational specification. In the word count example, the count is one time to map the text to all the words, and then use the key to the statute, in order to achieve word count. Rdds can be read from disk, then kept in memory for faster speeds, and they can be cached so that you don't have to reread them every time. That's a lot faster than Hadoop, which is mostly based on disk speed.
Fault-tolerant mechanism is one of the highlights of spark. Instead of persisting the intermediate results or establishing checkpoints, spark remembers the sequence of operations that produced some datasets. Therefore, when a node fails, spark reconstructs the dataset based on the storage information. They thought it would be nice because other nodes would help rebuild.
So in essence, Spark has smaller interfaces than pure Hadoop (and may become bloated in the future), but there are many based projects (such as Twitter's scalding) that have achieved similar levels of performance. The other major difference is that spark is in memory by default, which naturally brings significant performance improvements, and even allows the running of iterative algorithms. Although Spark has no built-in support for iterations, as they claim: As long as you want it, it can be as fast as you can to iterate.
Spark Flow--Regression of micro-batch processing
Spark is also equipped with a flow data processing model, which of course I am interested. There is also a very good design summary of the paper. Compared with Twitter's storm framework, Spark uses an interesting and unique approach. Storm is basically like a pipe in a standalone transaction where transactions are distributed. Instead, Spark uses a model to collect transactions and then handle events in batches in a short period of time (we assume 5 seconds). The data collected becomes their own rdd and then processed using a common set of spark applications.
The authors claim that this pattern is more robust in slow node and fault situations, and that 5-second intervals are usually fast enough for most applications. I'm not sure about that, because distributed computing is always complicated, and I don't believe you can say that some things are better than others. It is true that this method also unifies the streaming and non-streaming parts well.
Concluding
Spark seems promising to me, coupled with the support and attention given by spark, and I firmly believe that it will mature and will play a more important role in this field. Of course, it is not possible to apply to all scenarios, as the authors admit, an operation that changes only a few entries based on RDD stability is not appropriate. In principle, you must back up the entire dataset, even if you just want to change an entry. This can be done in parallel, but at a high cost. Copy-on-write may be more effective here, but not yet implemented.
The top layer is a research project in TU Berlin, with similar goals, yet developed through more complex operations such as iterations, not only to store a series of operations for fault tolerance, but also to use them for global scheduling optimization and parallelism.