What is Spark?
On the Apache website, there is a very simple phrase, ' Spark is a fast and general engine ', which means that spark is a unified computing engine and highlights fast. So what's the specific thing? is to do large-scale processing, that is, big data processing.
Spark is a fast and general engine for large-scale processing. This is a very simple sentence, but it highlights some of the Spark's features: the first is that spark is a parallel, memory, compute-intensive computing engine.
So the memory, because Spark is based on MapReduce, but its spatial data is not present in the HDFs, but in memory, so he is a memory-based calculation, which led to the calculation of spark fast, and it can be deployed as a cluster, So it can be distributed to each node, parallel computing; there are also many learning packages for machine learning and data mining on spark, and users can use the learning package to iterate over the data, so it is a computationally intensive computing tool.
Why use Spark?
In traditional methods, MapReduce requires a large amount of disk i/o,mapreduce to store a large amount of data on HDFs, and because of its memory, spark does not require a large amount of disk I/O, which increases processing speed.
In terms of performance, spark can increase 20-100 times faster on a common task, so the 1th spark performance is fast, the second is more efficient, people who have used Scala to develop the program should have feelings, and the spark syntax is very powerful, A piece of matching code that could have been described in 10 lines, Scala could do it one line, so it's very efficient, including it also supports some major programming languages, Java, Python, R, and so on.
In addition, spark can be combined with the Hadoop ecosystem.
Spark Basics
1. Spark Core Components
A. sparksql-processing of structured data
B. Spark streaming-Create scalable and fault-tolerant streaming applications
C. Mllib-spark's Machine learning Library
D. GraphX-Parallel graph calculation
In the Spark build-in component, the most basic is Spark Core, which is the basis for all application architectures. Spark SQL, spark streaming, Spark MLlib, Graphx are all sub-schemas of the application components provided by the Spark build-in component.
Regardless of the sub-schema on the application, it is based on the application framework on the RDD. The user can actually develop a sub-framework in different domains based on the RDD, using the Spark build-in component to execute.
2. The architecture of the spark application
In each spark application, there is only one driver program, and a few executor. You can see work node on the right, you can think that work node is a physical machine, all the applications are starting from Driver, Driver program Initializes a sparkcontext, as the application's portal, Each spark application has only one sparkcontext. Sparkcontext as the portal, and then initializes some job scheduling and task scheduling, through the cluster Manager to assign the task to each node, the worker node above the executor to perform the task. A spark application has multiple executor and can execute multiple tasks on one executor, which is the framework for Spark's parallel computing.
In addition, executor can also present data to cache or HDFS, in addition to working with tasks.
3. Spark Run mode
What we're seeing is the first four spark run modes: Local, Standalone, yarn, and Mesos. Cloud is a spark runtime environment for external base.
Local means native mode, where the user can execute the Spark program locally, local[n] refers to how many threads are used; Standalone is a running mode that spark itself brings, requiring the user to deploy spark to the relevant node on its own Yarn and Mesos are resources management, it is also within the Hadoop ecosystem, if using yarn and Mesos, then this is the two to do resource management, spark to do resource scheduling.
Regardless of which mode of operation, it is also subdivided into two, one is the client mode, one is the cluster mode. So how do you differentiate between these two models? You can use the driver program in the schema diagram. Driver program If in the cluster, that is the cluster mode, if outside the cluster, that is the client mode.
4. Elastic Distributed Data Set Rdd
The RDD has several features: one is immutable and the other is partitioned. In Java or C + +, the basic datasets and arrays used can be changed, but the RDD cannot be changed, it can only produce new rdd, which means Scala is a functional programming language. The functional programming language does not advocate in-place changes to all existing data, but instead produces a new data on the data that is already available, mainly doing transform work, that is, mapping work.
The RDD cannot be changed, but it can be distributed across different partition, enabling the user to implement an abstract implementation of the distributed dataset in a way that operates a local collection. The RDD itself is an abstract concept, it is not real, then it is assigned to each node, it is transparent to the user, as long as the user according to their own local data set method to operate the RDD can be, do not care how it is assigned to each partition above.
In operation, there are two main ways of Rdd, one is transform, the other is action. Transform's operation is to convert an RDD into a new RDD, but it has a feature of deferred execution; the second action is action, and the user either writes the data or returns some information to the application. Transform is triggered when you execute action, which is the meaning of deferred execution.
Take a look at the code on the right, this is a Scala code, in the first line, it goes to create a sparkcontext; the second line reads a file, then the file does three operations, the first one is map, the second is filter, the third is saveastextfile, The first two action is a transform,map meaning is the map, filter is filtered, Saveastextfile is left after the write operation. Before the Saveastextfile write operation occurs, the previous map and filter will not be executed, and until the Saveastextfile write action starts, the map and filter will be executed in the two actions.
5. Operation of the Spark program
After understanding how the RDD and spark works, let's take a holistic look at how the Spark program executes.
Or the previous three lines of code, the first two steps are transform, and the last step is action. Then the RDD does a series of Transform;dag is a scheduler, Sparkcontext will initialize a task scheduler, the Task Scheduler will be a series of RDD switch into different stage, by the Task Scheduler to separate the stage into different tasks, By Cluster Manager to dispatch these tasks, these taskset distributed to different executor to execute.
6. Spark DataFrame
Many people will ask, already have the RDD, why still want to dataframe? The DataFrame API was released in 2015, and after Spark1.3, it is a named column that organizes distributed datasets.
The original spark was primarily for big data, which was mostly unstructured data. Unstructured data requires the user to organize the mapping themselves, and Dataframe provides some ready-made users to manipulate the data on the big data platform by manipulating the relational tables. So many data scientists can use the original relational database knowledge and ways to manipulate the big data platform.
Dataframe supports a number of data sources, such as JSON, Hive, JDBC, and so on.
Dataframe There is another reason for existence: we can analyze the table, the blue part represents the RDD to manipulate the performance of the same set of numbers in different languages. As you can see, the RDD has poor performance on Python, and Scala performs better. But from the green part (on behalf of Dataframe), when using Dataframe to write programs, their performance is the same, that is, rdd in different language operation, performance is not the same, but the use of dataframe to operate, performance is not affected by the language, And performance is higher than the overall performance of the RDD.
A simple dataframe example is shown below.
Val df = sqlcontext.jsonfile ("/path/to/your/jsonfile") df.registertemptable ("People") sqlcontext.sql ("Select Age, Count (*) from people GROUP by age "). Show ()
First read the JSON file to get the Dataframe data, then convert the Dataframe data into a virtual temporary table, then the SQL statement query and display the query results.
Spark Brief Learning