Spark 0 Basic Learning Note (i) version--python

Source: Internet
Author: User
Tags pyspark

Since Scala is just beginning to learn, or more familiar with Python, it's a good way to document your learning process, mainly from the official help documentation for Spark, which is addressed in the following sections:

Http://spark.apache.org/docs/latest/quick-start.html

The article mainly translated the contents of the document, but also in the inside to add some of their own in the actual operation encountered problems and solutions, and some supplementary knowledge, together to learn.

Environment: Ubuntu 16.04 Lts,spark 2.0.1, Hadoop 2.7.3, Python 3.5.2,

Interactive analysis with the Spark shell

1. Basic

First open the API that spark interacts with Python

$ cd/usr/local/spark$. /bin/pyspark

One of the most important concepts of spark is the RDD (resilient distributed dataset), an elastic distributed data set. The RDD can be created using the inputformats of Hadoop, or converted from other rdd.

Here, as an example of getting started, we use the README.MD (this file location is/usr/local/spark/readme.md) file that comes with the Spark installation folder to learn how to create a new rdd.

To create a new RDD:

>>> textfile = Sc.textfile ("readme.md")

The RDD supports two types of operations, actions, and transformations:

Actions: Return a value after running a calculation on a dataset

Transformations: Transform, create a new dataset from an existing dataset

The RDD can have a sequence of actions (actions) that can return a value (values), a transform (transformations), or a pointer to a new RDD. Learn some of the simple actions of the RDD below:

>>> textfile.count ()  #  counts, returns the number of items in the RDD, here is the total # number of readme.md 99>>> Textfile.first ()  # The first item in the RDD, here is the first line of file readme.md u'# Apache Spark' 

Note: If you previously started Pyspark from/usr/local/spark, and then read the Readme.md file, the following error occurs if you execute the Count statement:

Py4j.protocol.Py4JJavaError:An error occurred while calling Z:org.apache.spark.api.python.pythonrdd.collectandserve .

: Org.apache.hadoop.mapred.InvalidInputException:Input path does not exist:hdfs://localhost:9000/user/spark/ Readme.md

This is because when using a relative path, the system reads the readme.md file from the Hdfs://localhost:9000/directory by default, but the Readme.md file is not in this directory, so Sc.textfile () must use an absolute path. The code is now modified to:

>>> textfile = Sc.textfile ("file:///usr/local/spark/readme.md")99

Try using a transform (transformation) below. For example, use the filter transformation to return a new rdd with items in the rdd containing the "Spark" string.

>>> Lineswithspark = Textfile.filter (lambda line: "Spark" on line)

We can also link actions and transformation together:

>>> Textfile.filter (Lambda line). Count ()  #  How good is a string containing "Spark" 19

2. More RDD operation

Many complex calculations can be accomplished with the action and conversion of the RDD. For example, we want to find a sentence that contains the last word:

>>> Textfile.map (Lambda Line:len (Line.split ())). Reduce (lambdaifelse b) 22

In this statement, the MAP function executes the statement of Len (Line.split ()) on all line, returning the number of words each line contains, that is, map the line to an integer value, and then create a new RDD. Then call reduce to find the maximum value. The parameters in the map and reduce functions are anonymous functions (lambda) in Python, in fact, we can also pass the more top-level functions in Python. For example, we first define a function that is relatively large, so that our code is easier to understand:

def Max (A, b): ...      if a > B: ...          return a ...      .. Else :. . .          return b ..... >>> Textfile.map (Lambda  Line:len (Line.split ())). Reduce (max)22

Hadoop has unleashed a frenzy of MapReduce. In Spark, it's easier to implement MapReduce

>>> wordcounts = Textfile.flatmap (lambda line:line.split ()). Map (Lambda Word: (Word, 1)). Reducebykey (Lambda A, b:a+b)

In the above statement, using Flatmap, map and Reducebykey three transformations, calculate the number of occurrences of each word in the file readme.md, and return a new RDD, each item in the format (string, int), that is, the number of words and corresponding occurrences. which

FlatMap (func): Similar to map, but each input item can be map to 0 or more output items, which means that the return value of func should be a seq instead of a separate item, In the above statement, the anonymous function returns every word contained in a sentence.

Reducebykey (func): Can be used to store data set using "Key-value" (K, V) and return a new set of Datasets (K, v), where the value of each key is the result of the aggregation using the func operation, This is equivalent to the meaning of a dictionary in Python. In the above statement, the equivalent of when a word appears once, the number of occurrences of the word is added 1, each word is a key,reducbykey in the anonymous function to calculate the number of occurrences of the word.

To collect the calculated results of the above statement, you can use the Collect action:

>>> Wordcounts.collect () [(U'when', 1), (U'R,' , 1), (U 'including', 3), (U'computation', 1), ...]

3. Cache Caching

Spark also supports storing datasets in a cluster-wide memory cache. This is useful for data that requires repeated access, such as when we need to perform a query operation in a small dataset, or we need to execute an iterative algorithm (such as PageRank). Following, using the Lineswithspark dataset obtained from the previous command, demonstrates the caching process:

>>> Lineswithspark.cache (pythonrdd[)at the RDD at Pythonrdd.scala:48>>>  Lineswithspark.count ()19>>> lineswithspark.count ()19

Using spark to cache a 100-row file might not make sense. But interestingly, this series of operations can be used on very large datasets and even datasets with thousands of nodes.

4. Self-contained application (self-contained applications)

Let's say we want to use the Spark API to write a self-contained application that we can do with Scala,java or Python.

Below, briefly describe how to use the Python API (Pyspark) to write an application named simpleapp.py.

In the directory where Spark is located, enter:

./bin/spark-submit--master Local[4] simpleapp.py

The output is:

Lines with a:61, Lines with b:27

In addition, spark comes with many examples, which can be viewed in the Spark directory by entering the following commands:

# for Scala and Java, use Run-example: . /bin/run-Example Sparkpi#  for Python examples, use Spark-submit directly:. /bin/spark-submit examples/src/main/python/pi.py#  for R examples, use Spark-submit directly: . /bin/spark-submit Examples/src/main/r/dataframe. R

Spark 0 Basic Learning Note (i) version--python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.