What is Spark?

Last Update:2015-07-25 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is Spark

Spark is an open-source cluster computing system based on memory computing that is designed to make data analysis faster. Spark is very small, developed by Matei, a team based in the AMP Lab at the University of California, Berkeley. The language used is Scala, the core part of the project's code is only 63 scala files, very short and concise.
Spark is an open-source cluster computing environment similar to Hadoop, but there are some differences between the two that make spark more advantageous in some workloads, in other words, Spark enables a memory distribution dataset, In addition to being able to provide interactive queries, it can also optimize iterative workloads.
Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, and Scala can manipulate distributed datasets as easily as local collection objects.
Although the Spark was created to support an iterative job on a distributed dataset, it is actually a supplement to Hadoop that can be run in parallel in the Hadoop file system. This behavior can be supported by a third-party cluster framework named Mesos. Developed by the UC Berkeley AMP Lab (algorithms, machines, and People Lab), Spark can be used to build large, low-latency data analytics applications.
Spark Cluster Computing Architecture
While there are similarities between Spark and Hadoop, it provides a new cluster computing framework with useful differences. First, Spark is designed for a specific type of workload in cluster computing, that is, workloads that reuse working data sets (such as machine learning algorithms) between parallel operations. To optimize these types of workloads, Spark introduces the concept of memory cluster computing, which caches data sets in memory in memory cluster calculations to shorten access latencies.
Spark also introduces an abstraction called an elastic distributed data set (RDD). An RDD is a collection of read-only objects that are distributed across a set of nodes. These collections are resilient and can be rebuilt if part of the data set is lost. The process of rebuilding part of a dataset relies on a fault-tolerant mechanism that maintains "descent" (that is, information that allows a partial data set to be rebuilt based on the data derivation process). The RDD is represented as a Scala object, and it can be created from a file, a parallel slice (spread across nodes), another form of the RDD, and ultimately the persistence of an existing rdd, such as a request being cached in memory.
applications in Spark are called drivers that enable operations performed on a single node or performed in parallel on a set of nodes. Similar to Hadoop, Spark supports single-node clusters or multi-node clusters. For multi-node operations, Spark relies on the Mesos cluster manager. Mesos provides an effective platform for resource sharing and isolation of distributed applications. This setting allows Spark to coexist with Hadoop in a shared pool of nodes.

Spark is a common parallel computing framework for open source class Hadoop MapReduce for UC Berkeley AMP Labs, and Spark's distributed computing, based on the map reduce algorithm, has the benefits of Hadoop MapReduce But unlike MapReduce, where the job intermediate output and results can be stored in memory, which eliminates the need to read and write HDFs, Spark is better suited for map reduce algorithms such as data mining and machine learning that need to be iterated. The schema is as follows:

　　Spark vs. Hadoop

The intermediate data of Spark is put into memory, which is more efficient for iterative operations.

Spark is more suitable for more ML and DM operations than iterative operations. Because in spark, there is an abstraction of the RDD.

Spark is more generic than Hadoop

There are many types of data set operations offered by Spark, unlike Hadoop, which provides only map and reduce two operations. such as map, filter, FLATMAP, sample, Groupbykey, Reducebykey, Union, join, Cogroup, Mapvalues, Sort,partionby and many other types of operations, Spark calls these operations transformations. It also provides various actions such as count, collect, reduce, lookup, save, and more.

These various types of data set operations are convenient for users who develop upper-level applications. The communication model between processing nodes is no longer the only data shuffle a pattern like Hadoop. Users can name, materialize, control the storage of intermediate results, partition, etc. It can be said that the programming model is more flexible than Hadoop.

However, because of the features of the RDD, Spark does not apply to applications that have an asynchronous fine-grained update state, such as Web service storage or incremental web crawlers and indexes. Is that the application model for that incremental modification is inappropriate.

Fault tolerance

In distributed DataSet computing, fault tolerance is achieved through checkpoint, while checkpoint has two ways, one is checkpoint data and the other is logging the updates. Users can control which method is used to implement fault tolerance.

Availability of

Spark improves usability by providing rich Scala, Java,python APIs, and interactive shells.

The combination of Spark and Hadoop

Spark can read and write data directly to HDFS and also supports spark on YARN. Spark can run in the same cluster as mapreduce, sharing storage resources and computing, and the Data Warehouse shark implementation borrows hive, almost completely compatible with hive.

Spark's application scenario

Spark is a memory-based iterative computing framework for applications that require multiple operations of a particular data set. The more times you need to repeat the operation, the greater the amount of data to read, the greater the benefit, the smaller the amount of data, and the less computationally intensive the benefit is relatively small (this is an important factor in the large database architecture to consider using spark)

Because of the features of the RDD, Spark does not apply to applications that have an asynchronous fine-grained update state, such as Web service storage or incremental web crawlers and indexes. Is that the application model for that incremental modification is inappropriate.

In general, the application of Spark is more extensive and common.

Operating mode

Local mode

Standalone mode

Mesoes mode

Yarn Mode

Spark Eco-System

shark (Hive on Spark): Shark basically provides the same H-iveql command interface as hive based on the spark framework, for maximum retention and hive compatibility, Shark uses the Hive API to implement query parsing and Logic Plan generation, and the final Physicalplan execution phase uses spark instead of Hadoop MapReduce. By configuring the shark parameter, shark can automatically cache a specific rdd in memory, enabling data reuse to speed up the retrieval of a particular data set. At the same time, Shark uses UDF user-defined function to realize specific data analysis learning algorithm, which can combine SQL data query and operation analysis to maximize the reuse of RDD.

Spark streaming: Building a framework for processing stream data on spark, the basic principle is to divide the stream data into small pieces of time (a few seconds) to process this small amount of data in a way similar to batch batch processing. Spark streaming is built on spark on the one hand because the spark's low latency Execution engine (100ms+) can be used for real-time computing, and the RDD dataset makes it easier to do efficient fault-tolerant processing than other record-based processing frameworks such as Storm. In addition, the small batch processing method makes it compatible with both batch and real-time data processing logic and algorithms. Facilitates some specific applications that require joint analysis of historical and real-time data.

Bagel:pregel on Spark, which can be calculated using spark, is a very useful small project. Bagel comes with an example that implements Google's PageRank algorithm.

What the hell is Hadoop,hbase,storm,spark?Hadoop=hdfs+hive+pig+ ...

H DFS : Storage System
MapReduce : Computing Systems
Hive : MapReduce for SQL Developers (via HIVEQL) Hadoop-based Data Warehouse framework
Pig : Hadoop-based language development
HBase : NoSQL database
Flume : A framework for collecting and processing Hadoop data
Oozie : A workflow processing system that allows users to define a series of jobs in multiple languages, such as Mapreduce,pig and Hive
Ambari : A Web-based deployment/management/monitoring tool set for Hadoop clusters
Avro : A data serialization system that allows schema encoding for Hadoop files
Mahout : A data Mining library that contains some of the most popular data digging algorithms and implements them with the MapReduce model
Sqoop : A connectivity tool in Hadoop from non-Hadoop data stores (such as relational databases and data warehouses)
Hcatalog : A centralized meta-data management and Apache Hadoop shared service, It allows a unified view of all the data in a Hadoop cluster and allows different tools, including pig and hive, to handle any data element without having to know the body's data storage in the cluster.

bigtop: To create a more formal program or framework for Hadoop's sub-projects and related components to improve the Hadoop platform as a whole for packaging and interoperability testing.

Apache Storm: A distributed real-time computing system, storm is a task parallel continuous computing engine. Storm itself is not typically run on a Hadoop cluster, it uses Apache zookeeper and its own master/slave processes, coordinates topologies, hosts and worker states, and guarantees the semantics of information. In any case, storm must still be able to consume from HDFs files or write from files to HDFs.

Apache Spark: A fast, general-purpose engine for large-scale data processing, spark is a parallel, common batch processing engine. Workflows are defined in a similar and nostalgic-style mapreduce, but more capable than traditional Hadoop mapreduce. Apache Spark has its streaming API project, which allows continuous processing through short interval batches. Apache Spark itself does not require Hadoop operations. However, its data parallel pattern requires stable data optimization using a shared file system. The stability source can range from S3,nfs or more typically to HDFS. Hadoop YARN is not required to execute a spark application. Spark has its own stand-alone master/server process. However, this is common for running applications that use the yarn container spark. In addition, spark can also be transported on Mesos clusters .

The above reference article: http://www.d1net.com/bigdata/news/316561.html

Http://www.d1net.com/bigdata/news/316561.html

What is Spark?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More