Spark Background Introduction
1. What is Spark
On the Apache website, there is a very simple phrase, "Spark is a fast and general engine", that spark is a unified computing engine and highlights fast. So what is the specific thing to do? is to do large-scale processing, that is, the processing of big data.
"Spark is a fast and general engine for large-scale processing" This is very simple, but it highlights some of the features of Spark: The first feature is that spark is a parallel, COMPUTE-intensive compute engine for memory.
So the memory, because Spark is based on map reduce, but its spatial data is not present in the HDFs, but in memory, so he is a memory-based calculation, so that spark is computationally fast, and it can be deployed on the cluster, So it can be distributed to each node, parallel computing; there are also many learning packages for machine learning and data mining on spark, and users can use the learning package for iterative computation of data, so it is a computationally intensive computing tool.
2, Spark's development history
After you know what Spark is, let's take a look at the evolution of Spark.
Spark 2009 was created as a research project, became an Apache incubation project in 13, and in 14 became the top project of Apache, Spark2.0 has not yet been formally released, currently only a draft version.
3, the latest features of Spark2.0
Spark2.0 is just out of the way, today the main explanation of its two parts, one is its new feature, that is, it has some of the latest features, and the other part is the community, you know Spark is an open source community, the community for the development of spark must be.
In this part of feature, you can see that there are more important two parts in Spark2.0, one of which is the structured API.
Spark2.0 unifies the Dataframe and datasets, and introduces a new sparksession. Sparksession provides a new entry point, which unifies the SQL and SQL context, is transparent to the user, the user does not need to distinguish with what context or how to create, directly with the sparksession on it. There is also a structured flow, streaming. In Spark2.0, the flow and bash are unified, so that the user is transparent, not to distinguish what is stream processing, what is batch processing data.
The following features, such as Mllib, are believed to be very attractive to data scientists. Mllib can store the user-trained model, and then import the required training model when needed, and from R, the original SPARKR support only single-node, not support distributed computing, but R's distributed development in Spark2.0 is very powerful feature. In addition, in Spark2.0, support for SQL 2003 allows spark to basically enable all SQL statements when it processes structured data.
4. Why use Spark
In the traditional approach, MapReduce requires a large amount of disk I/O, which can be seen in the comparison graph, where MapReduce will have a large amount of data in HDFs, and spark because it is in-memory, does not require a large amount of disk I/O, which can be very fast.
In terms of performance, spark can increase 20-100 times faster on a common task, so the 1th spark performance is fast, the second is more efficient, people who have used Scala to develop the program should have feelings, the spark syntax is very powerful, It was possible to use 10 lines to describe a matching code, and Scala could do it in one line, so it's very efficient, including it also supports some major programming languages, Java,python,scala, and R.
In addition, Spark2.0 can take advantage of existing assets. You know that Hadoop's ecosystem is very attractive, and spark is well-integrated with the ecosystem of Hadoop. We mentioned the contribution of the community, and the contributors of the Community constantly improvement spark, making spark grow better and faster.
These features have led to spark's growing popularity, and more data scientists, including academics, are willing to use Spark,spark to make big data calculations simpler, more efficient, and smarter.
5. IBM support for Spark
Within IBM, there is a growing focus on spark, which is focused on community nurturing, product and spark core. On the community side, Big Data University's online courses are rich in content, including the development of data scientists, including the most basic languages, including spark, the Hadoop Ecosystem Foundation, so it trained over 1 million of data scientists and sponsored AMP labs, AMP Lab is the developer of the spark open source community.
The second is the contribution to spark core, because within IBM, the Spark Technology Center has been established and over 300 engineers are developing spark core. and IBM Open source machine Learning library, also become a partner of Databricks.
On the product side, there are some spark products in the CDL that are integrated into IBM's own AOP environment (note: AOP is also an open source package), including the elements of spark that are integrated into big insight, IBM has invested more than 3,500 employees in spark-related work.
Spark Basics
1. Spark Core Components
In the Spark build-in component, the most basic is Spark Core, which is the basis for all application architectures. Sparksql, spark Streaming, MLLib, and Graphx are all sub-schemas of the application components provided by the Spark build-in component.
Sparksql is the processing of structured data, Spark streaming is the processing of real-time streaming data, Mllib is the processing of machine learning Library, GRAPHX is the processing of parallel graph computation.
Regardless of the sub-schema on the application, it is based on the application framework on the RDD. The user can actually develop a sub-framework in different domains based on the RDD, using the Spark build-in component to execute.
2. The architecture of the spark application
In each spark application, there is only one driver program, and a few executor. You can see the work node on the right, we can think that work node is a physical machine, all the applications start from Driver, Driver program Initializes a sparkcontext, as the application's portal, Each spark application has only one sparkcontext. Sparkcontext as the portal, and then initializes some job scheduling and task scheduling, through the cluster Manager to assign the task to each node, the worker node above the executor to perform the task. A spark application has multiple executor and can execute multiple tasks on one executor, which is the framework for Spark's parallel computing.
In addition, executor can also present data to cache or HDFS, in addition to working with tasks.
3. Spark operation mode
Generally we see the top four spark modes in: Local, standalone, yarn, and Mesos. Cloud is an external base of Spark's operating environment.
Local means the native mode where the user can execute the Spark program locally, local[n] refers to how many threads are used; Standalone is a running mode of Spark's own, requiring the user to deploy spark to the relevant node on its own; Yarn and Mesos are resources management, it is also within the Hadoop ecosystem, if using yarn and Mesos, then this is the two to do resource management, spark to do resource scheduling.
Regardless of the mode of operation, it is also subdivided into two, one is the client mode: One is the cluster mode, then how to distinguish between these two modes? You can use the driver program in the schema diagram. Driver program If in the cluster, that is the cluster mode, if outside the cluster, that is the client mode.
4. Elastic Distributed Data Set Rdd
The RDD has several features, one of which is immutable, and the other is that it is partitioned. In Java or C + +, the basic data sets and arrays used can be changed, but the RDD cannot be changed, it can only produce new rdd, that is to say Scala is a functional programming language. The functional programming language does not advocate in-place changes to all existing data, but instead produces a new data on the data that is already available, mainly doing transform work, that is, mapping work.
The RDD cannot be changed, but it can be distributed across different partition, enabling the user to implement an abstract implementation of the distributed dataset in a way that operates a local collection. The RDD itself is an abstract concept, it is not real, then it is assigned to each node, it is transparent to the user, as long as the user according to their own local data set method to operate the RDD can be, do not care how it is assigned to each partition above.
In operation, there are two main ways of Rdd, one is transform, the other is action. Transform's operation is to convert an RDD into a new RDD, but it has a feature of deferred execution; the second action is action, and the user either writes the data or returns some information to the application. Transform is triggered when you execute action, which is the meaning of deferred execution.
Take a look at the code on the right, this is a Scala code, in the first line, it goes to create a spark context, to read a file. Then this file did three operations, the first is the map, the second is the filter, the third is save, the front two action is a transform,map meaning is a map, filter is filter, save is write. When I "write" This level of execution to map and filter This step, it will not be executed, and so on when my save action starts, it will execute to go to the first two.
5. Execution of Spark Program
Now that we understand how rdd and spark work, let's look at how the Spark program works in general.
Or the previous three lines of code, the first two steps are transform, and the last step is action. Then this series of RDD will do a series of transform, starting from the first step, the DAG is a scheduler, Spark context Initializes a task scheduler, the Task Scheduler will be a series of changes in the RDD into different stages, The Task Scheduler divides the different stages into task sets, which are dispatched by Cluster Manager to distribute the task set to different executor to execute.
6. Spark DataFrame
Many people will ask, already have the RDD, why still have dataframe? The DataFrame API was released in 2015, and after Spark1.3, it is a named column that organizes distributed datasets.
The original spark was primarily for big data, which was mostly unstructured data. Unstructured data requires the user to organize the mapping themselves, and Dataframe provides some ready-made users to manipulate the data on the big data platform by manipulating the relational tables. This allows a lot of data scientists to use the only and the way that the relational database was originally used to manipulate large data platforms.
Dataframe supports a number of data sources, such as JSON, Hive, JDBC, and so on.
There is another reason for the existence of Dataframe: We can analyze the above table, the blue part represents the performance of the RDD to manipulate the same set of numbers in different languages. As you can see, the RDD performance on Python is poor, and Scala's performance is better. But from the green part, when using Dataframe to write programs, their performance is the same, that is, rdd in the operation of different languages, performance is not the same, but with Dataframe to operate, performance is the same, and performance is generally higher than the RDD.
Here is a simple example of dataframe.
On the right is also a piece of code written in Scala, which is sqlcontext, because it supports JSON files, directly points jsonfile, and reads the JSON file. The following directly to this Dataframe
Df.groupby ("Ages"). Count (). Show (), show the way out is a table. This operation is very simple, the user does not need to do the map operation, if it is to operate with the RDD, the user needs to deal with each piece of data in the sequence.
7. Spark programming language
In the programming language, spark currently supports the following four types:
8. How to use Spark
On the use, if there is a spark cluster locally, there are two ways to do it: one is to use Spark-shell, that is, interactive command line, interactive command operation is very simple, like Java, a line of a row, it will be interactive to tell you, a line of what is included This place can also copy a piece of code in the past, while running the side debugging. In general, interactive commands are available in local mode.
The second is the direct use of spark-submit, generally in the development of engineering projects used more; Spark-submit has several necessary parameters, one is master, the operation mode must have, and a few parameters must have, such as the location of the Class,java package. You can also see how many arguments the submit has, and what each parameter means, based on the help command behind Spark-submit.
Spark can also be used with web-based notebook to provide jupyter and Zepplin two notebook on IBM Workbench.
For more information, solutions, cases, tutorials, and more on big data and analytics related industries, see >>>
IBM experts personally interpret Spark2.0 operation guide