Strong Alliance--python language combined with spark framework

Last Update:2017-08-12 Source: Internet

Author: User

Tags pyspark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction: Spark was developed by the Amplab lab, which is essentially a high-speed iterative framework based on memory, and "iterative" is the most important feature of machine learning, so it is suitable for machine learning.

Thanks to its strong performance in data science, the Python language fans all over the world, and now meets the powerful distributed memory computing framework Spark, two areas of the strong come together, naturally can touch a more powerful spark (spark can translate into sparks). So this paper mainly describes the Pyspark.

This article is selected from the "Full stack data door".

Full stack frame

Developed by the Amplab lab, Spark is essentially a high-speed iterative framework based on memory, and "iteration" is the most important feature of machine learning. So it's good for machine learning.
The framework is developed by the Scala language. Native provides 4 types of Api,scala, Java, Python, and recent versions of R supported. Python is not a "pro-son" Spark. There are some differences in support, but basically the interfaces that are often used are supported.

Thanks to its strong performance in data science, the Python language fans are all over the world. Now it's time to meet the powerful distributed memory computing framework Spark, two areas of the strong come together. Nature can touch more powerful sparks (spark translates into Sparks), so Pyspark is the protagonist of this section.
In the Hadoop release, both CDH5 and HDP2 have integrated spark, and only the integration version number is slightly lower than the official version number.

The latest HDP2.4 has been integrated with the 1.6.1 (official up-to-date 2.0), which can be seen. The Hortonworks update is fast. Follow the upstream pace.
In addition to Hadoop's Map-reduce computing framework, spark can emerge and slowly build its own full-stack ecosystem. That's really a good idea of what the full stack of spark technology is. Spark currently offers the following 6 major features.

Spark Core:rdd and its operators.
Spark-sql:dataframe and SQL.
Spark ML (MLlib): Machine learning Framework.
Spark Streaming: Real-time computing framework.
Spark GraphX: Figure calculation Framework.
Pyspark (SPARKR): Python and R framework above spark.

From off-line calculation of RDD to streaming real-time computing. From the support of Dataframe and SQL to the Mllib machine learning Framework, from the GRAPHX graph to the support of statisticians ' favorite R, you can see that spark is building its own full-stack data ecosystem. From the current academic and industrial feedback, Spark has done the same.

Environment construction

A mule is a horse. Just pull out for a walk and you'll know.

Trying to use Spark is a very easy thing to do, and a machine can be tested and developed.
Visit the site http://spark.apache.org/downloads.html, download pre-compiled version number, decompression is able to use.

Select the latest stable version number. Note Select the version number that starts with "pre-built". For example, the current version number is 1.6.1, and the spark-1.6.1-bin-hadoop2.6.tgz file is usually downloaded. The file name with "-bin-" is a precompiled version number and does not require an additional Scala environment. There is no need to compile. Unzip it directly to a folder.

If you unzip to the folder/opt/spark. Then add a path to the. bashrc file in the $home folder:

Remember to source A. bashrc file for the environment variable to take effect:

Then run the command Pyspark or Spark-shell, assuming you see Spark's handsome text logo and the corresponding command line prompt >>>. The successful entry to the interactive interface, the configuration is successful.

Both Pyspark and Spark-shell can support interactive testing. It is now possible to test. Compared to Hadoop, the 0 configuration is basically the ability to start testing.
Spark-shell Test:

Pyspark Test:

Distributed deployment

The above environment tested successfully, proving that Spark's development and test environment have been configured. But what about the distribution? I dragged the other people's library down, just want to try Spark's distributed environment, you show me this ah?
It says a single-machine environment deployment, which can be used for development and testing, just one of the deployment methods that spark supports. This is the local approach, with the advantage of being able to execute programs and develop them on a single laptop computer. Although it is a stand-alone, it has a very useful feature. That is the ability to achieve multiple processes. For example, a 8-core machine. Just specify –master local[] When you need to execute the code, and you will be able to execute the program in a 8-process manner.

Represents the use of all CPU cores and can also be used like local[4], meaning that only 4 cores are used.
The local mode of the single-machine write code, only need to make a small amount of changes can be executed in a distributed environment.

The distributed deployment of Spark is supported in several ways, such as those seen below.

Standalone: self-bringing clusters (facilitates testing and the promotion of Spark's own framework).
Mesos: A new resource management framework.
Yarn:hadoop's new resource and computing management framework, which can be understood as a Hadoop operating system,
can support a variety of different computing frameworks.

EC2: Deployment of the Amazon machine environment.

From the point of difficulty. Standalone distributed is the simplest. Copy the unpacked package directly to each machine, configure the master file and the slave file to indicate which machine is the master. Which machines do salve. Then on the master machine. You can start the cluster with your own script.
From the usage rate. Yarn is supposed to be used the most, since the spark set kit is typically used directly in the release version number. Both spark and yarn have been integrated in CDH and HDP without special attention.

The advantage of distributed is that it is multi-CPU with larger memory and three ways to see spark from the CPU point of view.

Native single CPU: "Local". The data file is on this machine.
Native Multi-CPU: "Local[4]", data files in this machine.
Standalone cluster multi-CPU: "spark://master-ip:7077", need each machine can access to data files.

Yarn Cluster Multi-CPU: Commit using "yarn-client". Each machine is required to access the data files.
The deployment of the interactive environment is also related to the above deployment, the direct use of Spark-shell or Pyspark is the local way to start, assuming the need to start a single-machine multi-core or cluster mode, you need to specify –master parameters. For example, see below.

Suppose you use Pyspark, and are accustomed to Ipython's interactive style, you can also add environment variables to initiate Ipython interactivity, or use the notebook provided by Ipython:

Ipython styles such as the following are seen:

Demo Sample Analysis

Environment deployment is the most headache for the novice, the front environment has been deployed, the next is the point. Since Scala is much more complex than Python, it is first learned to use Pyspark to knock code.
Spark has two of the most basic concepts, SC and RDD.

SC is the sparkcontext abbreviation, as the name implies, is the spark context, SC connected to the cluster and do the corresponding parameter configuration. All subsequent operations are carried out in this context and are the basis of all spark. When you start the interactive interface, note that there is a hint:

Sparkcontext available as SC, hivecontext available as SqlContext.

It means. SC This variable represents the Sparkcontext context, which can be used directly, and is initialized when interactive is initiated.
Assume a non-interactive environment. You need to initialize it in your own code:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvynjvywr2awv3mjawng==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma== /dissolve/70/gravity/southeast "alt=" "Figure 8" "title=" "style=" border:0px; Vertical-align:middle; display:table; Margin:auto; max-width:100% ">
The RDD is an abbreviation for the resilient distributed Datasets (elastic distributed DataSet) and is the most basic data processing object in Spark. There are many ways to generate an RDD, the most basic of which is to read the file to generate:

After reading the Joy.txt file, it is an rdd, and the content of the RDD is a string. All the contents of the file are included.
Remember the WordCount code you wrote earlier using Python? The streaming interface via Hadoop refers to running on the Map-reduce computing framework. That code is not very well understood, now the simple version number came.
The code for the WordCount sample is seen in the following examples:

Watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvynjvywr2awv3mjawng==/font/5a6l5l2t/fontsize/400/fill/ I0jbqkfcma==/dissolve/70/gravity/southeast "alt=" "Figure 10" "title=" "style=" border:0px; Vertical-align:middle; display:table; Margin:auto; max-width:100% ">
In the code above, I personally prefer to use parentheses for the closure of a branch, rather than adding a continuation line at the end of a row. &NBSP;
Pyspark the use of anonymous function lambda. Because it is usually easy to handle. The core code interpretation is as follows.

FLATMAP: Select the map (map) operation for each line in the lines data, which is a list of words that are cut into a series with a space.
Then run the flat (expand) operation. Expand the list of multiple rows to form a large list. The data structure at this time is: [' one ', ' one ', ' one ', ' three ',...].
Map: Generates a Key-value pair for each element in the list, where value is 1.
The data structure at this time is: [(' One ', 1), (' One ', 1), (' Three ', 1),...], where the ' one ', ' double ', ' three ' key, may appear repeatedly.
Reducebykey: Adds the elements of the above list to the same value as key, whose data structure is: [(' One ', 3), (' One ', ' 8 '),
(' three ', 1), ...], where ' one ', ' double ', ' three ', this key does not appear repeated.

Finally, the Wc.collect () function is used, which tells Spark to take out all the data in the WC and parse the result as a list of tuples.
The way spark is implemented is simple and elegant compared to the version numbers that are manually implemented with Python.

Two types of operators

Spark is based on the context of SC, the base data set is RDD, and the rest is the operation of the RDD.
There is transform and action for the operation of the RDD. Also known as the two basic operators of the RDD.

Transform is the meaning of transformation, deformation. The RDD will be converted in some form, with another rdd, such as a map conversion to the data in the list. into a different list.
Of course, one of the important factors that spark can stand out in the Map-reduce model of Hadoop is its powerful operators.

Spark does not force it to be scoped to the map and reduce models. Instead, it provides a more powerful transformation capability, making its code simple and elegant.
Some of the frequently used transform are listed below.

Map (): map, similar to the Python map function.
Filter (): filters, similar to the Python filter function.
Reducebykey (): Merge by Key.
Groupbykey (): Aggregates by key.

An important feature of the RDD is the lazy (lazy) principle.

After performing a transform on an rdd. Do not execute immediately, but when you encounter an action, you go to a layer to build a DAG diagram of execution. The DAG diagram is also the reason why spark is fast.

First (): Returns the value inside the RDD.
Take (n): Remove the first n values from the RDD.
Collect (): Returns all the RDD elements.
SUM (): Sum.
Count (): The number of the request.

Returning to the previous WordCount sample, the program only runs the various transform of the front rdd when it encounters Wc.collect (), which requires all the data, and it also guarantees operational efficiency by building a DAG graph that runs dependent.

Map and reduce

The initial data is a list. Each element in the list is a tuple, and the tuple consists of three elements. Represents the ID, name, and age fields, respectively.

The RDD is the processing of this basic and complex data structure. This makes it possible to use Pprint to print results, making it easier to understand data structures. Its code such as the following:

Parallelize This operator serializes a Python data structure into an RDD that takes a list of parameters and also supports partitioning the data into several partitions (partition) at serialization time. Partitions are the smallest granular structure when spark executes, and multiple partitions are distributed in parallel in the cluster.

Using the Python type method to print the data type, it is known that base is an RDD. On top of this rdd, a map operator is used to add age 3 and the other values remain unchanged.

Map is a high-order function. It accepts a function as a parameter. Applies the function to each element, returning the new element after the applied function. The anonymous function lambda is used here, which itself accepts a parameter of V. Add the Age field v[2] 3. The other fields are returned as they are. Judging from the results. Returns a Pipelinerdd, which inherits from the Rdd and can be simply understood as a new RDD structure.
To print the structure of the RDD, you must use an action operator to trigger a job. Collect is used here to get all of its data.

Next, use map to extract the age field in the data v[2], and then use a reduce operator to calculate the sum of all ages.

The parameter of reduce is still a function, and the function must accept two parameters to iterate over the elements in the RDD to aggregate the results. The effect is the same as reduce in Python, and finally returns only one element. Here the sum of its age is computed using x+y, so it is returned as a numeric value, running the result for example as seen.

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvynjvywr2awv3mjawng==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma== /dissolve/70/gravity/southeast "target=" _blank "style=" Color:rgb (0,101,126); outline:0px; background:0px 0px ">

Amplab's Ambition

Amplab In addition to the most famous spark, they want to build a complete ecosystem of data analysis based on memory, and can take a look at Https://amplab.cs.berkeley.edu/software/'s introduction.

Their goal is to Bdas (Berkeley data analytics stack), a full stack of memory-based big data analytics. The Mesos described earlier is cluster resource manager. There is also the Tachyon, a memory-based distributed file system, similar to the HDFs file system of Hadoop, while the spark streaming is similar to storm real-time computing.
Powerful full-stack spark. Prop up half of the big data.

This article is selected from the "Full stack Data Door", click this link can be viewed in the blog post view of the book.
　　　　　　　　　　　　　　　　　　　　

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvynjvywr2awv3mjawng==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma== /dissolve/70/gravity/southeast "alt=" picture Descriptive narrative "title=" "style=" border:0px; Vertical-align:middle; display:table; Margin:auto; max-width:100% ">

To get a lot of other wonderful articles in time, you can search for "blog view" or scan the QR code below and follow.
　　　　　　　　　　　　　　　　　　　　　　　　　

Strong Alliance--python language combined with spark framework

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More