Because Spark is implemented in Scala, spark natively supports the Scala API. In addition, Java and Python APIs are supported.
For example, the Python API for the Spark 1.3 version. Its module-level relationships, for example, are as seen in:
As you know, Pyspark is the top-level package for the Python API, which includes several important subpackages. Of
1) Pyspark. Sparkcontext
It abstracts a connection to the spark cluster that can be used to create the Rdd object, which is the main entrance to the API.
2) Pyspark. Sparkconf
It allows you to dynamically create a configuration of the Spark app in the submitted app code and pass it to Pyspark as a conf parameter. The constructor for the Sparkcontext instance.
If Conf is not created dynamically. The Pyspark. The Sparkcontext instance reads the default global configuration from conf/spark-defaults.conf.
3) Pyspark. RDD
RDDs can stored in memory between queries without requiring replication. Instead, they rebuild lost data on failure using Lineage:each RDD remembers how it is built from other datasets (by Tran Sformations like map, join or groupBy) to rebuild itself.
The Rdd is the core abstraction of spark programming, which represents an abstract elastic distributed data set. Spark supports two types of operations for RDD: Transformations and Actions, which include a list of functions that can refer to the "Transformations" and "actions" sections of the official document.
according to the description of the "RDD Operations" section of the Spark Programming Guide document. the operation to create a new dataset from an already existing dataset is called transformation. The operation to calculate the data set and return the result to driver program is called action.
For example, map is processing an existing RDD based on the number of function parameters passed in. The result of its execution is a new rdd, so it is a transformation operation, and the reduce is calculated based on the incoming function parameters for the existing RDD, and the result is no longer an rdd. Instead of a detailed value (for reduce, the result is a detailed number, and the other action (s) may be a list or other data structure), so reduce is an action action.
It is particularly emphasized that the Spark uses the lazy evaluation Strategy for all transformations operations, that is, Spark does not immediately evaluate every transformation encountered to get a new rdd when scheduling. Instead, a series of transformations operations on an rdd are recorded, and only when the action is finally encountered will spark calculate each transformations that was previously recorded.
This lazy evaluation design idea allows spark to execute more efficiently because the scheduler can merge or otherwise transform transformations from the initial rdd to the last action path. and only finally the action operation results will be returned to driver program, saving the intermediate results of transformations operation between the cluster worker node and driver program transfer overhead.
By default. When you invoke the action action, each transformation operation that the initial RDD passes through is run once, when multiple actions pass through a series of identical transformations operations. Such recompute appear to be not efficient. So. When you actually develop the Spark compute task script. Transformations results that are shared by multiple actions are best called persist or cache caches. This will save a lot of computational time.
4) Pyspark. Broadcast
The scope of the variable that is broadcast by broadcast is visible with the executor process on each node that is requested, and after it is broadcast. Variables are always present in the executor process for each worker node. Until the end of the task. This avoids the overhead of frequent transfers of the RDD data set between the driver and the executor processes of the worker node.
In particular, for some applications that use only read shared variables (such as the need to load a dictionary and all compute nodes to access the dictionary), the broadcast can effectively achieve the purpose of variable sharing.
5) Pyspark. Accumulator
It is a way that spark supports the sharing of variables (the 1th approach is the broadcast described above), the process on the worker node can update the variable with the Add () operation, and the updated variable will propagate itself back to driver program.
6) Pyspark. Sparkfiles
When applying a file that is used by Sparkcontext.addfile () to submit tasks to the cluster, the related methods of calling the Sparkfiles class can parse the file paths and access the files.
7) Pyspark. Storagelevel
It can specify the storage level of the RDD, such as just using memory, using only disk, memory as primary disk, and so on. The specific control logo can be a reference to the documentation here.
"References"
1. Spark Programming Guide-rdd Operations
2. Pyspark Package
3. Spark Programming Guide:rdd Transformations
4. Spark Programming Guide:rdd Actions
5. Pyspark Package:pyspark. Storagelevel
======================= EOF ====================
Spark Research note 5th-Spark API Brief Introduction