Introduction to spark principles

Last Update:2015-04-28 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Spark is an open-source cluster computing system based on memory computing, which is designed to make data analysis faster. So the machine running spark should be as large as possible in memory, such as 96G or more.
2. All operation of Spark is based on RDD, the operation is divided into 2 major categories: transformation and action.
3. Spark provides an interface for interaction, similar to the use of the shell.
4. Spark can optimize the iteration workload because the intermediate data is stored in memory.
5. Spark is implemented in the Scala language, it can be interactively manipulated using Scala, Python, and can be programmed using Scala, Python, and Java.
6. Spark can run on HDFs via Mesos, but hadoop2.x provides yarn, which makes it easier for spark to run in Hdfs,yarn also provides memory, CPU cluster management functions.
7. There are many types of data set operations offered by Spark, unlike Hadoop, which provides only map and reduce two operations. such as Map,filter, Flatmap,sample, Groupbykey, Reducebykey, Union,join, Cogroup,mapvalues, Sort,partionby and many other types of operation, They refer to these operations as transformations. It also provides count,collect, reduce, lookup, save, and many other actions. These various types of data set operations provide convenience to upper-level applications. The communication model between processing nodes is no longer the only Data shuffle a pattern like Hadoop. Users can name, materialize, control the partitioning of intermediate results, and so on. It can be said that the programming model is more flexible than Hadoop.

==========================================================

The following is excerpted from: http://www.itpub.net/thread-1864721-3-1.html

1. What are the similarities and differences between Spark Vshadoop?

Hadoop: Distributed batch computing, emphasizing batch processing, often used for data mining, analysis

Spark: An open-source cluster computing system based on memory computing, designed to make data analysis faster, Spark is an open-source cluster computing environment similar to Hadoop, but there are some differences between the two, and these useful differences make spark Performance is more advantageous in some workloads, in other words, Spark enables a memory distribution dataset that optimizes iterative workloads in addition to providing interactive queries.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, and Scala can manipulate distributed datasets as easily as local collection objects.

Although the Spark was created to support an iterative job on a distributed dataset, it is actually a supplement to Hadoop that can be run in parallel in the Hadoop file system. This behavior can be supported by a third-party cluster framework named Mesos. Developed by the University of California, Berkeley AMP Lab (Algorithms,machines,and People Lab), Spark can be used to build large, low-latency data analytics applications.

While there are similarities between Spark and Hadoop, it provides a new cluster computing framework with useful differences. First, Spark is designed for a specific type of workload in cluster computing, that is, workloads that reuse working data sets (such as machine learning algorithms) between parallel operations. To optimize these types of workloads, Spark introduces the concept of memory cluster computing, which caches data sets in memory in memory cluster calculations to shorten access latencies.

In terms of big data processing, it is believed that Hadoop is familiar, and Hadoop, based on Googlemap/reduce, provides the developer with a map, reduce primitive, which makes the parallel batch process very simple and graceful. There are many types of data set operations offered by Spark, unlike Hadoop, which provides only map and reduce two operations. such as Map,filter, Flatmap,sample, Groupbykey, Reducebykey, Union,join, Cogroup,mapvalues, Sort,partionby and many other types of operation, They refer to these operations as transformations. It also provides count,collect, reduce, lookup, save, and many other actions. These various types of data set operations provide convenience to upper-level applications. The communication model between processing nodes is no longer the only Data shuffle a pattern like Hadoop. Users can name, materialize, control the partitioning of intermediate results, and so on. It can be said that the programming model is more flexible than Hadoop.

2. Is spark more advantageous in terms of fault tolerance than other tools?

From Spark's paper "Resilient distributed datasets:afault-tolerantabstraction for In-memory Cluster Computing," there is no sense of how well-fault tolerance is done. It is mentioned that the distributed data set calculation, do checkpoint two ways, one is checkpoint data, one is loggingthe updates. It seems that spark used the latter. But later, though, the latter seems to save storage space. However, since the data processing model is a DAG-like operation, because of a node error in the graph, the dependency complexity of the lineage chains may cause the recalculation of all compute nodes, so the cost is not low. They later said, is to save data, or to save the update log, do checkpoint or by the user to decide it. It's the equivalent of saying nothing and kicking the ball to the user. So I'm looking at it. Depending on the type of business, the measurement is the cost of storing data IO and disk space and the price of recalculation, choosing a strategy that is less expensive. Instead of persisting to intermediate results or establishing checkpoints, spark remembers the sequence of operations that produced some datasets. Therefore, when a node fails, spark reconstructs the dataset based on the stored information. They thought it would be nice, because the other nodes would help rebuild.

3. What are the characteristics of spark for data processing capability and efficiency?

Spark provides high performance and large data processing capabilities, allowing users to quickly get a better feedback experience. Another type of application is data mining, because spark makes full use of memory for caching and uses dags to eliminate unnecessary steps, so it's more appropriate to do iterative operations. and a considerable number of machine learning algorithms are iterative convergence algorithm, so it is suitable to use spark to achieve. We put some common algorithms in parallel with Spark, which can be easily called from the R language, and reduce the learning cost of data mining for users.

Spark has a streaming data processing model, and Spark uses a fun and unique approach compared to Twitter's storm framework. Storm is basically a pipe that is placed in a separate transaction where the transaction is distributed. Instead, Spark takes a model to collect transactions, and then handles the events in batches in a short period of time (we assume 5 seconds). The collected data becomes their own rdd, and is then processed using a common set of applications in the Spark application. The authors claim that this pattern is more robust in slow-node and fault situations, and that the 5-second interval is usually fast enough for most applications. This method also unifies the streaming and non-streaming parts well.

As big data-related technologies and industries mature, multiple types of big data analysis jobs are often required in a single organization: The bulk computing that traditional Hadoop MapReduce excels at, the iterative computing that the various machine learning algorithms represent, the flow computing, the graph calculation used in social networks, SQL relational queries, interactive ad hoc queries, and more. Before Spark appeared, in order to complete the above several big data analysis tasks in an organization, we had to deal with multiple sets of independent systems, on the one hand introduced the complexity of operation and maintenance, and on the other hand, it was unavoidable to carry out expensive data dump frequently among multiple systems.

Spark is a cluster computing platform originating from the University of California, Berkeley, Amplab, which is a rare all-rounder based on memory computing, performance exceeding Hadoop, starting from multiple iterations of batch processing, eclectic data warehousing, streaming and graph computing. Spark is now an Apache Foundation's top open source project with huge community support (more active developers than Hadoop MapReduce), and technology is maturing.

1, spark because the memory distribution data set enabled, fully utilize the distributed memory technology to make its computing efficiency at least in Hadoop, in the Scala language, and with the release of Hadoop 2.0, spark can also run directly on yarn.
2. Fault-tolerant Features: Spark introduces an elastic distributed data set (RDD). An RDD is a collection of read-only objects that are distributed across a set of nodes. Collections are resilient and can be rebuilt if part of the data set is lost. The process of rebuilding part of a dataset relies on a fault-tolerant mechanism that maintains "descent" (that is, information that allows a partial data set to be rebuilt based on the data derivation process).
3, it is obvious that the efficiency of memory calculation is much higher than hadoop with a large number of disk IO operations
4, mini-book, in the shortest possible time to master as much content, it does not seem too tired.

Our company now handles data primarily on Hadoop but also builds 10 spark clusters
Hadoop can use relatively inexpensive PC machines but spark tries to use a higher memory configuration. We're using 64G of memory.
Online data says try to use more than 96G memory but we don't have a good machine to test.
The spark we used compared to Hadoop thought there were several advantages
The first spark is based on in-memory computing, and the speed is obvious. The 10 spark clusters can be as fast as the 50 of our Hadoop cluster, but the size of Hadoop's cluster memory is 8G and 16G.
The second spark is based on Scala compared to Hadoop based Java Spark is more suitable for data mining because Scala is a technology machine mining
Third Hadoop programming mode processing data is dead. Only map and reduce and spark programming mode is more flexible
It's said that Spark's algorithm is more powerful than the Hadoop algorithm. We don't know how to look at it. It's really a lot faster to handle data.

1. What are the similarities and differences between Spark VS Hadoop?
Spark is a distributed computation based on the map reduce algorithm, with the advantages of Hadoop mapreduce, but unlike MapReduce, the job intermediate output and results can be stored in memory, thus eliminating the need to read and write HDFs, so spark can A good fit for the algorithm of map reduce, such as data mining and machine learning, which needs to be iterated.
2. Is spark more advantageous in terms of fault tolerance than other tools?
Existing data flow systems are less efficient at processing two applications: iterative algorithms, which are common in graph applications and machine learning fields, and interactive data mining tools. In both cases, storing the data in memory can greatly improve performance. To effectively implement fault tolerance, the RDD provides a highly restricted shared memory that the RDD is read-only and can only be created through bulk operations on other rdd. Nonetheless, the RDD is still sufficient to represent many types of computations, including MapReduce and dedicated iterative programming models (such as Pregel). The RDD implemented by Spark is 20 times faster than Hadoop in iterative computing, and can query 1TB data sets interactively within a 5-7-second delay.
3. What are the characteristics of spark for data processing capability and efficiency?
Compared to Hadoop, the results are as follows:
(1) For iterative machine learning applications, Spark is 20 times faster than Hadoop. This speedup is due to the fact that the data is stored in memory, and the Java object cache avoids the deserialization operation (deserialization).
(2) User-written application execution results are good. For example, Spark Analytics reports are 40 times faster than Hadoop.
(3) If the node fails, spark can recover quickly by rebuilding the lost RDD partition.
(4) Spark can interactively query 1TB-sized datasets within the 5-7s delay range.

1. What are the similarities and differences between Spark VS Hadoop?
As a common parallel processing framework, spark has some advantages similar to Hadoop, and Spark uses better memory management
With more efficiency than Hadoop for iterative computing, Spark also provides a wider range of data set operation types, greatly facilitating
User development, Checkpoint's application makes spark highly fault tolerant, with many superior performance and wider than Hadoop
The application surface of Spark is worth looking forward to further development.

2. Is spark more advantageous in terms of fault tolerance than other tools?
Fault tolerance is achieved through checkpoint in distributed dataset calculations, while checkpoint
There are two ways, one is checkpoint data, and the other is logging the updates.
Users can control which method is used to implement fault tolerance.

3. What are the characteristics of spark for data processing capability and efficiency?
Because spark processes data using memory, it is very fast,
Spark streaming: Greatly improves the power and stability of spark stream processing,
Enables users to use the same set of code for big data stream processing and batch processing.

Introduction to spark principles

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More