Spark: "Flash" of large data

Source: Internet
Author: User
Tags hadoop mapreduce

Spark has formally applied to join the Apache incubator, from the "Spark" of the laboratory "" EDM into a large data technology platform for the emergence of the new sharp. This article mainly narrates the design thought of Spark. Spark, as its name shows, is an uncommon "flash" of large data. The specific characteristics are summarized as "light, fast, spirit and skillful".

Light: The Spark 0.6 core code has 20,000 lines, Hadoop 1.0 is 90,000 rows, and 2.0 is 220,000 rows. On the one hand, thanks to the simplicity and richness of the Scala language, spark has made good use of the infrastructure of Hadoop and Mesos (Berkeley's other project to enter the incubator, which focuses on the dynamic resource management of the cluster). Although very light, it is not compromised in fault-tolerant design. Matei, the creative person, says: "Do not treat mistakes as exceptions." "The implication is that fault tolerance is part of the infrastructure."

Fast: Spark can achieve a second-level delay for a small dataset, which is unthinkable for Hadoop MapReduce (MapReduce) (Because of the "heartbeat" interval mechanism, only a few seconds of delay for a task to start). For large datasets, the spark version is 10 times times faster than the implementations based on MapReduce, Hive, and Pregel for typical iterative machine learning, ad hoc query (Ad-hoc query), and graph computing. Among them, memory calculation, data locality (locality) and transmission optimization, scheduling optimization, such as the Habitat Shoukong, also with the design at the beginning of the light weight concept is not irrelevant.

Spirit: Spark offers different levels of flexibility. At the implementation level, it perfectly interprets Scala's trait dynamic mixing (mixin) strategy (such as a replaceable cluster scheduler, a serialized library), and in the primitive (primitive) layer, which allows the extension of new data operators (operator), New data sources (such as HDFS support Dynamodb), new language bindings (Java and Python), and in the paradigm (PARADIGM) layer, spark supports a variety of paradigms such as memory calculation, multiple Iteration batch processing, ad hoc query, stream processing, and graph calculation.

Qiao: Skillfully in occasion and borrow force. Spark borrowed Hadoop to seamlessly integrate with Hadoop, then shark (the Data Warehouse implementation on spark) borrowed Hive's potential; The graph calculates the API of Pregel and Powergraph and the point-splitting idea of powergraph. Everything comes with the potential of Scala (which is widely hailed as the future of Java): Spark programming look ' n ' feel is authentic Scala, whether it's syntax or API. In the realization, but also can be smart to borrow power. To support interactive programming, spark only needs to make changes to Scala's shell (in contrast, Microsoft's interactive programming of MapReduce to support JavaScript console, not just across the Java and JavaScript thinking barriers, In the implementation of the need for a fuss).

Said a lot of benefits, or to point out that spark is not perfect. It has inherent limitations, it does not support fine-grained, asynchronous data processing, but also the next day, even if there is a great gene, after all, just started, in the performance, stability and the scalability of the paradigm there is a lot of space.

Computational Paradigms and abstractions

Spark first is a computational paradigm of coarse-grained data parallelism (parallel).

The difference between data parallelism and task parallel is embodied in the following two aspects.

The body of the calculation is the data collection, not the individual data. The length of the set depends on the implementation, such as SIMD (single instruction multiple data) vector instruction is generally 4 to 64,GPU SIMT (single instruction multithreading) is generally 32,SPMD (single program multiple data) can be wider. Spark deals with large data, and therefore uses a very coarse set of particles called resilient distributed Datasets (RDD).

All the data in the set passes through the same operator sequence. The data is well programmable, easy to obtain high parallelism (related to data scale, not to the parallelism of program logic), and can be easily mapped to the underlying parallel or distributed hardware. Traditional Array/vector programming languages, SSE/AVX intrinsics, Cuda/opencl, and Ct (c + + for throughput) belong to this category. The difference is that Spark's vision is the entire cluster, not a single node or parallel processor.

The paradigm of data parallelism determines that spark cannot be perfectly supported for fine-grained, asynchronous update operations. Graph computing has such operations, so at this time spark is not as good as graphlab (a large scale diagram computing framework); There are applications that require fine-grained log updates and checkpoint data,    It is not as good as Ramcloud (Stanford's Memory Storage and Computing research project) and Percolator (Google incremental computing). This, in turn, allows spark to work the areas of application that it specializes in, and attempts to Dryad (Microsoft's early Big Data platform) have been unsuccessful.

Spark's Rdd, using the Scala collection type of programming style. It also uses functional semantics (functional semantics): One is closure, and the other is RDD, which is not modifiable. Logically, each RDD operator generates a new RDD, with no side effects, so the operator is called deterministic, and since all operators are idempotent, the operator sequence can only be performed when an error occurs.

The spark computation abstraction is a data flow, and is a data stream with a working set (working set). Flow processing is a data flow model, MapReduce is also, the difference is that mapreduce need to maintain the working set in multiple iterations. The abstraction of a working set is common, such as multiple iterative machine learning, interactive data mining, and graph computing. To ensure fault tolerance, MapReduce uses stable storage (such as HDFS) to host the working set, at a cost of slow speed. Haloop uses a loop-sensitive scheduler to ensure that the reduce output of the previous iteration and the map input dataset for this iteration are on the same physical machine, which reduces network overhead but avoids disk I/O bottlenecks.

The breakthrough of Spark is to use memory to host the working set under the premise of ensuring fault tolerance. Memory is accessed faster than multiple levels of disk, which can greatly improve performance. The key is to achieve fault tolerance, traditionally in two ways: logs and checkpoints. Considering that the checkpoint has data redundancy and network communication overhead, Spark uses the log data update.    Fine-grained log updates are not cheap, and spark is not good at it, as I said before. Spark records Coarse-grained rdd updates, so overhead can be negligible. In view of the functional semantics and idempotent characteristics of spark, the fault tolerance by replaying log updates does not have any side effects.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.