Since Scala is just beginning to learn, or more familiar with Python, it's a good way to document your learning process, mainly from the official help documentation for Spark, which is addressed in the following sections:
Http://spark.apache.org/docs/latest/quick-start.html
The article mainly translated the contents of the document, but also in the inside to add some of their own in the actual operation encountered problems and solutions, and some supplementary knowledge, together to learn.
Environment: Ubuntu 16.04 Lts,spark 2.0.1, Hadoop 2.7.3, Python 3.5.2,
Interactive analysis with the Spark shell
1. Basic
First open the API that spark interacts with Python
$ cd/usr/local/spark$. /bin/pyspark
One of the most important concepts of spark is the RDD (resilient distributed dataset), an elastic distributed data set. The RDD can be created using the inputformats of Hadoop, or converted from other rdd.
Here, as an example of getting started, we use the README.MD (this file location is/usr/local/spark/readme.md) file that comes with the Spark installation folder to learn how to create a new rdd.
To create a new RDD:
>>> textfile = Sc.textfile ("readme.md")
The RDD supports two types of operations, actions, and transformations:
Actions: Return a value after running a calculation on a dataset
Transformations: Transform, create a new dataset from an existing dataset
The RDD can have a sequence of actions (actions) that can return a value (values), a transform (transformations), or a pointer to a new RDD. Learn some of the simple actions of the RDD below:
>>> textfile.count () # counts, returns the number of items in the RDD, here is the total # number of readme.md 99>>> Textfile.first () # The first item in the RDD, here is the first line of file readme.md u'# Apache Spark'
Note: If you previously started Pyspark from/usr/local/spark, and then read the Readme.md file, the following error occurs if you execute the Count statement:
Py4j.protocol.Py4JJavaError:An error occurred while calling Z:org.apache.spark.api.python.pythonrdd.collectandserve .
: Org.apache.hadoop.mapred.InvalidInputException:Input path does not exist:hdfs://localhost:9000/user/spark/ Readme.md
This is because when using a relative path, the system reads the readme.md file from the Hdfs://localhost:9000/directory by default, but the Readme.md file is not in this directory, so Sc.textfile () must use an absolute path. The code is now modified to:
>>> textfile = Sc.textfile ("file:///usr/local/spark/readme.md")99
Try using a transform (transformation) below. For example, use the filter transformation to return a new rdd with items in the rdd containing the "Spark" string.
>>> Lineswithspark = Textfile.filter (lambda line: "Spark" on line)
We can also link actions and transformation together:
>>> Textfile.filter (Lambda line). Count () # How good is a string containing "Spark" 19
2. More RDD operation
Many complex calculations can be accomplished with the action and conversion of the RDD. For example, we want to find a sentence that contains the last word:
>>> Textfile.map (Lambda Line:len (Line.split ())). Reduce (lambdaifelse b) 22
In this statement, the MAP function executes the statement of Len (Line.split ()) on all line, returning the number of words each line contains, that is, map the line to an integer value, and then create a new RDD. Then call reduce to find the maximum value. The parameters in the map and reduce functions are anonymous functions (lambda) in Python, in fact, we can also pass the more top-level functions in Python. For example, we first define a function that is relatively large, so that our code is easier to understand:
def Max (A, b): ... if a > B: ... return a ... .. Else :. . . return b ..... >>> Textfile.map (Lambda Line:len (Line.split ())). Reduce (max)22
Hadoop has unleashed a frenzy of MapReduce. In Spark, it's easier to implement MapReduce
>>> wordcounts = Textfile.flatmap (lambda line:line.split ()). Map (Lambda Word: (Word, 1)). Reducebykey (Lambda A, b:a+b)
In the above statement, using Flatmap, map and Reducebykey three transformations, calculate the number of occurrences of each word in the file readme.md, and return a new RDD, each item in the format (string, int), that is, the number of words and corresponding occurrences. which
FlatMap (func): Similar to map, but each input item can be map to 0 or more output items, which means that the return value of func should be a seq instead of a separate item, In the above statement, the anonymous function returns every word contained in a sentence.
Reducebykey (func): Can be used to store data set using "Key-value" (K, V) and return a new set of Datasets (K, v), where the value of each key is the result of the aggregation using the func operation, This is equivalent to the meaning of a dictionary in Python. In the above statement, the equivalent of when a word appears once, the number of occurrences of the word is added 1, each word is a key,reducbykey in the anonymous function to calculate the number of occurrences of the word.
To collect the calculated results of the above statement, you can use the Collect action:
>>> Wordcounts.collect () [(U'when', 1), (U'R,' , 1), (U 'including', 3), (U'computation', 1), ...]
3. Cache Caching
Spark also supports storing datasets in a cluster-wide memory cache. This is useful for data that requires repeated access, such as when we need to perform a query operation in a small dataset, or we need to execute an iterative algorithm (such as PageRank). Following, using the Lineswithspark dataset obtained from the previous command, demonstrates the caching process:
>>> Lineswithspark.cache (pythonrdd[)at the RDD at Pythonrdd.scala:48>>> Lineswithspark.count ()19>>> lineswithspark.count ()19
Using spark to cache a 100-row file might not make sense. But interestingly, this series of operations can be used on very large datasets and even datasets with thousands of nodes.
4. Self-contained application (self-contained applications)
Let's say we want to use the Spark API to write a self-contained application that we can do with Scala,java or Python.
Below, briefly describe how to use the Python API (Pyspark) to write an application named simpleapp.py.
In the directory where Spark is located, enter:
./bin/spark-submit--master Local[4] simpleapp.py
The output is:
Lines with a:61, Lines with b:27
In addition, spark comes with many examples, which can be viewed in the Spark directory by entering the following commands:
# for Scala and Java, use Run-example: . /bin/run-Example Sparkpi# for Python examples, use Spark-submit directly:. /bin/spark-submit examples/src/main/python/pi.py# for R examples, use Spark-submit directly: . /bin/spark-submit Examples/src/main/r/dataframe. R
Spark 0 Basic Learning Note (i) version--python