What is Spark ?, Spark?

Source: Internet
Author: User
Tags stream api hadoop mapreduce

What is Spark ?, Spark?

What is Spark?

Spark is an open-source cluster computing system based on memory computing. It aims to make data analysis faster. Spark is very small and exquisite. It is developed by a small team dominated by Matei at the University of Berkeley's AMP lab. The language used is Scala. The core code of the project contains only 63 Scala files, which are very short and concise.
Spark is an open-source cluster computing environment similar to Hadoop, but there are still some differences between the two. These useful differences make Spark superior in some workloads, in other words, Spark enables memory distributed datasets. In addition to interactive queries, it can also optimize iterative workloads.
Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, and Scala can easily operate distributed datasets like local collection objects.
Although Spark is created to support iterative jobs on distributed datasets, it is actually a supplement to Hadoop and can be run in parallel in the Hadoop file system. This behavior can be supported through a third-party cluster framework named Mesos. Spark is developed by the UC Berkeley's AMP Lab (Algorithms, Machines, and People Lab) to build large-scale, low-latency data analysis applications.
Spark Cluster Computing Architecture
Although Spark is similar to Hadoop, it provides a new cluster computing framework with useful differences. First, Spark is designed for specific types of workloads in cluster computing, that is, the workloads that reuse work datasets (such as machine learning algorithms) between parallel operations. To optimize these types of workloads, Spark introduces the concept of memory cluster computing, which can cache datasets in the memory cluster computing to shorten access latency.
Spark also introduced an abstraction called an elastic distributed dataset (RDD. RDD is a set of read-only objects distributed in a group of nodes. These sets are flexible. If some of the datasets are lost, you can recreate them. The process of recreating some datasets depends on the Fault Tolerance Mechanism, which can maintain the "lineage" (that is, the information of some datasets can be rebuilt based on the data derivative process ). RDD is represented as a Scala object and can be created from the file; one is a parallel slice (distributed across nodes); the other is a transformation form of RDD; in addition, the persistence of the existing RDD will be completely changed, for example, the request is cached in the memory.
Applications in Spark are called drivers that perform operations on a single node or concurrently on a group of nodes. Similar to Hadoop, Spark supports single-node clusters or multi-node clusters. For multi-node operations, Spark depends on the Mesos Cluster Manager. Mesos provides an effective platform for resource sharing and isolation of distributed applications. This setting allows Spark and Hadoop to coexist in a shared pool of nodes.

Spark is a general parallel computing framework of Hadoop MapReduce open-source by UC Berkeley AMP lab. Spark is based on the distributed computing implemented by the map reduce algorithm and has the advantages of Hadoop MapReduce; however, unlike MapReduce, Job intermediate output and results can be stored in the memory, so HDFS is no longer required to be read or written, therefore, Spark is better suited to map reduce algorithms that require iteration, such as data mining and machine learning. The architecture is shown in:

Comparison between Spark and Hadoop

Spark's intermediate data is stored in the memory, which is more efficient for iterative operations.

Spark is more suitable for ML and DM operations with more iterative operations. In Spark, there is an abstract concept of RDD.

Spark is more common than Hadoop

There are many types of dataset operations provided by Spark, unlike Hadoop which only provides Map and Reduce operations. For example, map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, mapValues, sort, and partionBy. Spark calls these operations Transformations. It also provides multiple actions operations, such as Count, collect, reduce, lookup, and save.

These diverse dataset operation types facilitate the development of upper-layer applications. The communication model between each processing node is no longer the only Data Shuffle mode like Hadoop. Users can name, materialized, and control the storage and partition of intermediate results. The programming model is more flexible than Hadoop.

However, due to the characteristics of RDD, Spark is not applicable to applications with asynchronous fine-grained updates, such as web service storage or incremental web crawlers and indexes. It is not suitable for the incremental modification application model.

Fault Tolerance

In distributed dataset computing, checkpoint is used to implement fault tolerance. checkpoint has two methods: checkpoint data and logging the updates. You can control which method to implement fault tolerance.

Availability

Spark improves availability by providing a wide range of Scala, Java, Python APIs, and interactive shells.

Integration of Spark and Hadoop

Spark can directly read and write HDFS data, and also supports Spark on YARN. Spark and MapReduce can run in the same cluster, share storage resources and computing, and use Hive to implement data warehouse Shark, which is almost completely compatible with Hive.

Applicable scenarios of Spark

Spark is a memory-based Iterative Computing framework that is suitable for applications that require multiple operations on specific datasets. When repeated operations are required, the larger the data volume to be read, the larger the benefit is. The smaller the data volume but the greater the computing intensity, the benefits are relatively small (this is an important factor in the large database architecture to consider using Spark)

Due to the characteristics of RDD, Spark is not applicable to applications with asynchronous fine-grained updates, such as web service storage or incremental web crawlers and indexes. It is not suitable for the incremental modification application model.

In general, Spark is widely used and widely used.

Running Mode

Local Mode

Standalone Mode

Mesoes Mode

Yarn Mode

Spark Ecosystem

Shark (Hive on Spark): basically, Shark provides the same H iveQL command interface as Hive on the basis of Spark framework. To maximize compatibility with Hive, shark uses Hive APIs to implement query Parsing and Logic Plan generation. In the final PhysicalPlan execution stage, Spark is used to replace Hadoop MapReduce. By configuring Shark parameters, Shark can automatically cache specific RDD in the memory to achieve data reuse and thus accelerate the retrieval of specific datasets. At the same time, Shark uses UDF user-defined functions to implement specific data analysis and learning algorithms, so that SQL data query and operation analysis can be combined to maximize the reuse of RDD.

Spark streaming: A Framework for processing Stream data on Spark. The basic principle is to divide Stream data into small time segments (several seconds ), this small amount of data is processed in batches like batch. Spark Streaming is built on Spark. On the one hand, Spark's low-latency execution engine (100 ms +) can be used for real-time computing, and on the other hand, compared with other Record-based processing frameworks (such as Storm ), RDD datasets are easier to implement efficient fault tolerance Processing. In addition, the small batch processing method allows it to be compatible with the logic and Algorithms for batch and real-time data processing. It facilitates some specific application scenarios that require joint analysis of historical data and real-time data.

Bagel: Pregel on Spark, which can be used for graph computing. This is a very useful small project. Bagel comes with an example to implement Google's PageRank algorithm.

What is Hadoop, HBase, Storm, and Spark? Hadoop = HDFS + Hive + Pig +...

HDFS: Storage System
MapReduce: Computing System
Hive: MapReduce provided to SQL developers (through HiveQL), Hadoop-based Data Warehouse framework
Pig: Developed based on Hadoop
HBase: NoSQL Database
Flume: A framework for collecting and processing Hadoop data
Oozie: A workflow processing system that allows users to define a series of jobs in multiple languages (such as MapReduce, Pig, and Hive)
Ambari: A web-based tool set for deploying, managing, and monitoring Hadoop clusters.
Avro: A data serialization system that allows encoding of the schema of A Hadoop File
Mahout: A data mining database that contains some of the most popular data mining algorithms and implements them using the MapReduce model.
Sqoop: A Connection Tool for moving data from non-Hadoop data storage (such as relational databases and data warehouses) to Hadoop
HCatalog: A centralized metadata management and Apache Hadoop sharing service. It allows a unified view of all data in a Hadoop cluster and allows different tools, including Pig and Hive, to process any data element, without the need to know the data storage of the body in the cluster.

BigTop: To create a more formal program or framework Hadoop sub-project and related components, improve the Hadoop platform as a whole packaging and interoperability test.

Apache Storm: A Distributed Real-time computing system. Storm is a task parallel continuous computing engine. Storm itself is not typically running on a Hadoop cluster. It uses Apache ZooKeeper to work with its own master/Slave processes, coordinate topology, host and worker statuses, and ensure information semantics. In any case, Storm can consume or write data from HDFS files to HDFS.

Apache Spark: A fast universal engine for large-scale data processing. Spark is a universal data processing engine for parallel data processing. Workflows are defined in a similar and nostalgic MapReduce, but they are more competent than traditional Hadoop MapReduce. Apache Spark has its stream API project, which allows continuous processing through a short interval batch. Apache Spark does not require Hadoop operations. However, its data parallel mode requires stable Data Optimization Using a shared file system. The range of the stable source can be from S3, NFS or, more typically, HDFS. Hadoop YARN is not required to run Spark applications. Spark has its own independent master/server processes. However, this is a common application that runs Spark using the YARN container. In addition, Spark can run on the Mesos cluster.


The above references: http://www.d1net.com/bigdata/news/316561.html

Http://www.d1net.com/bigdata/news/316561.html

Copyright Disclaimer: you are welcome to reprint it. I hope you can add the original article address while reprinting it. Thank you for your cooperation.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.