Introduction: Spark was developed by the Amplab lab, which is essentially a fast memory-based iterative framework, and "iterative" is the biggest feature of machine learning, making it ideal for machine learning. Thanks to the strong performance in data science, the Python language fans all over the world, and now meet the powerful distributed memory computing framework Spark, two areas of the strong come together, naturally can touch a more powerful spark (spark can be translated into Sparks), so this article mainly describes the Pyspark.
This article is selected from the "Full stack data door".
Full stack frame
Spark is developed by the Amplab lab, which is essentially a fast memory-based iterative framework, and" Iteration "is the biggest feature of machine learning, making it ideal for machine learning. The  
framework was developed by the Scala language and native provides 4 types of Api,scala, Java, Python, and R supported by the latest version. Python is not the "pro-son" of spark, but it is slightly less supportive, but basically the usual interfaces are supported. Thanks to the strong performance in data science, the Python language fans all over the world, and now meet the powerful distributed memory computing framework Spark, two areas of the strong come together, naturally can touch a more powerful spark (spark can be translated into Sparks), so Pyspark is the protagonist of this section.  
- Spark Core:rdd and its operators.
- Spark-sql:dataframe and SQL.
- Spark ML (MLlib): Machine learning Framework.
- Spark Streaming: Real-time computing framework.
- Spark GraphX: Figure calculation Framework.
- Pyspark (SPARKR): Python and R framework above spark.
From off-line computing of RDD to real-time computing of streaming, from Dataframe and SQL support to Mllib machine learning Framework, from GRAPHX graphs to support for statisticians ' favorite R, you can see that spark is building its own full-stack data ecosystem. From the current academic and industrial feedback, Spark has done the same.
Environment construction
is a mule is a horse, pull out for a walk to know. Trying to use Spark is a very simple thing, and a machine can be tested and developed.  
visit the website http://spark.apache.org/downloads.html, download the precompiled version, unzip is ready to use. Select the latest stable version, note that "pre-built" the beginning of the version, such as the current latest version is 1.6.1, usually download spark-1.6.1-bin-hadoop2.6.tgz file, the file name with "-bin-" is a precompiled version, No need to install the Scala environment, and do not need to compile, directly extracted to a directory.  
Remember to source A. bashrc file for the environment variable to take effect:
Then execute the command pyspark or Spark-shell, if you see Spark's handsome text logo and the corresponding command line prompt >>>, then successfully entered the interactive interface, that is, the configuration is successful.
Both Pyspark and Spark-shell can support interactive testing, which is now ready for testing. Compared to Hadoop, it's basically a 0 configuration to start testing.
Spark-shell Test:
Pyspark Test:
Distributed deployment
The environmental tests above were successful, demonstrating that the development and testing environment for Spark has been configured. But what about the distribution? I dragged the other people's library down, just want to try Spark's distributed environment, you show me this ah?  
represents the use of all CPU cores and can also be used like local[4], meaning that only 4 cores are used.  
stand-alone local mode code, only a small number of changes can be run in a distributed environment. The distributed deployment of Spark is supported in several ways, as shown below.
Standalone: self-bringing clusters (easy testing and promotion of Spark's own framework).  
Mesos: a new resource management framework.  
Yarn:hadoop on the new resources and computing management framework, can be understood as Hadoop operating system,  
can support a variety of different computing frameworks.  
EC2: Deployment of the Amazon machine environment.  
from the ease of the extent, standalone distributed the simplest, directly unzip the package copied to each machine up, configure the master file and slave file, indicating which machine to do the master, which machine do salve. Then on the master machine, start the cluster with your own script.  
The advantage of distributed is the multi-CPU and the larger memory, from the perspective of the CPU to see spark three ways.
- Native single CPU: "Local", data files in this machine.
- Native Multi-CPU: "Local[4]", data files in this machine.
- Standalone cluster multi-CPU: "spark://master-ip:7077", requires each machine to access data files.
Yarn Cluster Multi-CPU: Using "yarn-client" submission, each machine needs to have access to the data files.
The deployment of the interactive environment is also related to the above deployment, directly using Spark-shell or Pyspark is the local way, if you need to start the single-machine multicore or cluster mode, you need to specify the –master parameter, as shown below.
If you use Pyspark and are accustomed to Ipython's interactive style, you can also add environment variables to initiate Ipython interactivity, or use the notebook provided by Ipython:
The Ipython style is as follows:
Sample Analysis
Environment deployment is the most headache for the novice, the front environment has been deployed, the next is the point. Because Scala is much more complex than Python, learn to use Pyspark to write programs first.
Spark has two of the most basic concepts, SC and RDD. SC is the abbreviation of Sparkcontext, as the name implies, is the spark context, SC connected to the cluster and do the corresponding parameter configuration, all subsequent operations in this context context, is the basis of all spark. When you start the interactive interface, note that there is a hint:
Sparkcontext available as SC, hivecontext available as SqlContext.
This means that the SC variable represents the Sparkcontext context, which can be used directly and initialized when the interaction is initiated.
If you are a non-interactive environment, you need to initialize it in your own code:
The RDD is the abbreviation for the resilient distributed Datasets (elastic distributed DataSet) and is the most important data processing object in Spark. There are a number of ways to generate an RDD, the most important of which is to read the file to generate:
after reading the Joy.txt file is an RDD, the content of the RDD is a string that contains the entire contents of the file.  
WordCount example is as follows:
In the code above, I personally prefer to use parentheses to close the line instead of adding a newline at the end of the row. The anonymous function lambda is used extensively in  
- FLATMAP: Select the map (map) operation for each line in the lines data to form a list of words separated by a space. Then perform the flat (expand) operation to expand the list of multiple rows to form a large list. The data structure at this time is: [' one ', ' one ', ' one ', ' three ',...].
- Map: Generates a Key-value pair for each element in the list, where value is 1. The data structure at this time is: [(' One ', 1), (' Two ', 1), (' Three ', 1),...], where the ' one ', ' double ', ' three ' key, may appear duplicates.
- Reducebykey: Adds the elements of the above list to the same value as key, whose data structure is: [(' One ', 3), (' One ', ' 8 '),
(' three ', 1), ...], where a key such as ' one ', ' two ', ' three ' does not recur.
Finally, the Wc.collect () function is used, which tells Spark to take out all the data in the WC and parse the result as a list containing tuples.
The way Spark is implemented is not only simple, but elegant, compared to a version that was manually implemented in Python.
Two types of operators
Spark is based on the context of SC, the base data set is RDD, and the rest is the operation of the RDD.
The operation of the RDD has transform and action, also known as the two basic operators of the RDD.
Transform is the meaning of transformation, deformation, will be the RDD through some form of conversion, get another rdd, such as the list of data using map conversion, into another list.
Of course, one of the important factors that spark can stand out in the Map-reduce model of Hadoop is its powerful operators. Instead of forcing it to qualify as a map and reduce model, SPARK provides a more powerful transformation capability that makes its code simple and elegant.
Some common transform are listed below.
- Map (): map, similar to the Python map function.
- Filter (): filters, similar to the Python filter function.
- Reducebykey (): Merge by Key.
- Groupbykey (): Aggregates by key.
A very important feature of the RDD is the lazy (lazy) principle. After executing a transform on an rdd, it does not run immediately, but when it encounters an action, it is time to build a running Dag diagram, and the Dag diagram is the reason why spark is fast.
- First (): Returns the value inside the RDD.
- Take (n): Remove the first n values from the RDD.
- Collect (): Returns all the RDD elements.
- SUM (): Sum.
- Count (): The number of the request.
Returning to the previous WordCount example, the program performs the various transform of the preceding RDD only when it encounters Wc.collect (), which takes all of the data, and it also guarantees operational efficiency by building a DAG graph that relies on the implementation.
Map and reduce
The initial data is a list, each element in the list is a tuple, and the tuple contains three elements, each representing the ID, name, and age fields. The RDD handles this basic and complex data structure, so you can use Pprint to print the results, making it easier to understand the data structure, with the following code:
Parallelize This operator serializes a Python data structure into an RDD that accepts a list parameter and also supports partitioning the data into several partitions (partition) at serialization time. Partitions are the smallest granular structure of the spark runtime, and multiple partitions are distributed in parallel in the cluster.
Using the Python type method to print the data type, it is known that base is an RDD. On top of this rdd, a map operator is used to increase age by 3 years and the other values remain unchanged. Map is a high-order function that takes a function as an argument, applies the function to each element, and returns a new element after the application function. The anonymous function lambda is used here, which itself accepts a parameter V, increments the age field v[2] by 3, and returns the other fields as they are. From the result, the return of a Pipelinerdd, which inherits from the Rdd, can be simply understood as a new RDD structure.
To print the structure of the RDD, you must use an action operator to trigger a job where collect is used to get all of its data.
Next, use map to take out the age field in the data v[2], and then use a reduce operator to calculate the sum of all ages. The parameter of reduce is still a function, and the function must accept two parameters to iterate over the elements in the RDD to aggregate the results. The effect is the same as reduce in Python, and finally returns only one element, where the sum of its age is computed using x+y, and therefore returned as a numeric value, as shown in the execution result.
Amplab's Ambition
Amplab In addition to the most famous spark, they also want to build a complete data analysis ecosystem based on memory, which can be referenced in https://amplab.cs.berkeley.edu/software/.
Their goal is to Bdas (Berkeley data analytics stack), a full stack of memory-based big data analytics. The Mesos described earlier is cluster resource manager. There is also the Tachyon, a memory-based distributed file system, similar to the HDFs file system of Hadoop, while the spark streaming is similar to storm real-time computing.
Powerful full-stack spark, supporting half of the big data.
This article is selected from the "Full stack Data Door", click this link can be viewed in the blog post view of the book.
Want to get more good articles in time, you can search "blog point of View" or scan the QR code below and follow.
Strong Alliance--python language combined with spark framework