1.1 spark Interactive Analysis
Start HDFS and yarn of hadoop before running the spark script. Spark shell provides
It also has a powerful tool to analyze data interactively. The two languages have such exchange capabilities: Scala and python. The following shows how to use python to analyze data files.
Go to the spark installation home directory and enter the following command. The Python command line mode will start.
./Bin/pyspark
The main abstraction of spark is an elastic distributed dataset (Resilient distributed Dataset,RDD). RDD can input formats (inputformat, such as HDFS files) from hadoop or convert other RDDs. Here, we will upload the README file under the spark main directory to the hadoop File System/user/APP (the app is my Linux User Name) directory. The specific command is as follows:
Hdfs dfs-mkdir-P/user/apphdfs DFS-putreadme. md/usr/APP
Create an elastic distributed dataset using python, which is defined as follows.
>>> Textfile = SC. textfile ("readme. md ")
RDD hasActions, Which can return values and convertTransformationsYou can also return a pointer to the new RDD. Below are several actions for RDD.
>>> Textfile. Count () # returns the number of data items of this RDD. 126 >>> textfile. First () # returns the first data of this RDD U' # apachespark'
Now let's use a transformation (transformation). We will use the filter to convert and return a new RDD, accompanied by a subset of the data items in the file.
>>> Lineswitheat Ark = textfile. Filter (lambdaline: "spark" inline)
We can chain conversion and action:
>>> Textfile. Filter (lambda line: "spark" inline). Count () # How many linescontain "spark "? 15
RDD actions and transformations can be used for more complex computing. Let's look at the example below.
>>> Textfile. Map (lambdaline: Len (line. Split (). Reduce (lambda A, B: AIF (A> B) elseb) 15
The first item of this parameter maps the data of a row to an integer value and creates a new RDD. Reduce is called on RDD to find the maximum number of rows. The parameters of MAP and reduce are Python anonymous functions Lambdas. At the same time, you can pass in any top-level Python function you want. The following is a python function.
>>> Defmax (A, B ):... if A> B :... return... else :... return B...> textfile. map (lambda line: Len (line. split ())). reduce (max) 15