Introduction of Spark

Source: Internet
Author: User
Keywords spark spark composition spark characteristics
First. Introduction to the official website
1. What is Spark
Official website address: http://spark.apache.org/

Apache Spark ™ is a unified analysis engine for large-scale data processing. Spark is also used for AI

Spark is a fast and universal cluster computing platform. It is a general-purpose in-memory parallel computing framework developed by the AMP Laboratory at the University of California, Berkeley, used to build large, low-latency data analysis applications. It extends the widely used MapReduce computing
model. Efficiently support more computing modes, including interactive query and stream processing. One of the main features of spark is the ability to perform calculations in memory and rely on disks for complex operations in a timely manner. Spark is still more efficient than MapReduce.

2. Why learn Spark
Intermediate result output: MapReduce-based computing engines usually output intermediate results to disk for storage and fault tolerance. Out of the task pipeline, consider that when some queries are translated into MapReduce tasks, multiple Stages are often generated, and these connected Stages rely on the underlying file system (such as HDFS) to store the output of each Stage.

Spark is an alternative to MapReduce, and is compatible with HDFS and Hive, and can be integrated into the Hadoop ecosystem to make up for the deficiencies of MapReduce.

Second, the four characteristics of Spark
1. High efficiency
The operating speed is increased by 100 times.
Apache Spark uses the most advanced DAG scheduler, query optimization program and physical execution engine to achieve high performance of batch and streaming data.


2. Ease of use
Spark supports APIs of Java, Python and Scala, and also supports more than 80 advanced algorithms, allowing users to quickly build different applications. Moreover, Spark supports interactive Python and Scala shells, and it is very convenient to use Spark clusters in these shells to verify the solution to the problem.


3. Versatility
Spark provides a unified solution. Spark can be used for batch processing, interactive query (Spark SQL), real-time streaming (Spark Streaming), machine learning (Spark MLlib) and graph computing (GraphX). These different types of processing can be used seamlessly in the same application. Spark's unified solution is very attractive. After all, any company wants to use a unified platform to deal with the problems it encounters, reducing the development and maintenance labor costs and deployment platform material costs.


4. Compatibility
Spark can be easily integrated with other open source products. For example, Spark can use Hadoop's YARN and Apache Mesos as its resource management and scheduler, and can process all the data supported by Hadoop, including HDFS, HBase and Cassandra. This is especially important for users who have already deployed Hadoop clusters, because they can use Spark's powerful processing capabilities without any data migration. Spark can also not rely on a third-party resource management and scheduler. It implements Standalone as its built-in resource management and scheduling framework, which further reduces the threshold for using Spark, making it easy for everyone to deploy and use Spark . In addition, Spark also provides tools for deploying Standalone Spark clusters on EC2.

Mesos: Spark can run in Mesos (Mesos is similar to a resource scheduling framework in yarn)

standalone: Spark can allocate resources to itself (master, worker)

YARN: Spark can run on yarn

 Kubernetes: Spark receives Kubernetes resource scheduling


Third. Composition of Spark
Spark Composition (BDAS): The full name of the Berkeley data analysis stack, a platform for displaying big data applications between large-scale integration algorithms, machines, and people. It is also a technical solution for processing big data, cloud computing, and communication.

Its main components are:

SparkCore: Abstract distributed data into a flexible distributed data set (RDD), implement application task scheduling, RPC, serialization and compression, and provide APIs for upper-level components running on it.

SparkSQL: Spark Sql is a package for Spark to operate on structured data. It allows me to query data using SQL statements. Spark supports multiple data sources, including Hive tables, parquests, and JSON.

SparkStreaming: It is a component for streaming calculation of real-time data provided by Spark.

MLlib: Provides an implementation library of commonly used machine learning algorithms.

GraphX: Provides a distributed graph computing framework, which can efficiently perform graph computing.

BlinkDB: Approximate query engine for interactive SQL on massive data.

Tachyon: A memory-centric distributed file system with high fault tolerance.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.