Spark Quick Start-Interactive Analysis

Source: Internet
Author: User
Tags hdfs dfs

1.1 spark Interactive Analysis

Start HDFS and yarn of hadoop before running the spark script. Spark shell provides

It also has a powerful tool to analyze data interactively. The two languages have such exchange capabilities: Scala and python. The following shows how to use python to analyze data files.

Go to the spark installation home directory and enter the following command. The Python command line mode will start.


 
./Bin/pyspark


The main abstraction of spark is an elastic distributed dataset (Resilient distributed Dataset,RDD). RDD can input formats (inputformat, such as HDFS files) from hadoop or convert other RDDs. Here, we will upload the README file under the spark main directory to the hadoop File System/user/APP (the app is my Linux User Name) directory. The specific command is as follows:

 
Hdfs dfs-mkdir-P/user/apphdfs DFS-putreadme. md/usr/APP


Create an elastic distributed dataset using python, which is defined as follows.

 
>>> Textfile = SC. textfile ("readme. md ")


RDD hasActions, Which can return values and convertTransformationsYou can also return a pointer to the new RDD. Below are several actions for RDD.

 
>>> Textfile. Count () # returns the number of data items of this RDD. 126 >>> textfile. First () # returns the first data of this RDD U' # apachespark'


Now let's use a transformation (transformation). We will use the filter to convert and return a new RDD, accompanied by a subset of the data items in the file.

 
>>> Lineswitheat Ark = textfile. Filter (lambdaline: "spark" inline)


We can chain conversion and action:

 
>>> Textfile. Filter (lambda line: "spark" inline). Count () # How many linescontain "spark "? 15


RDD actions and transformations can be used for more complex computing. Let's look at the example below.

 
>>> Textfile. Map (lambdaline: Len (line. Split (). Reduce (lambda A, B: AIF (A> B) elseb) 15


The first item of this parameter maps the data of a row to an integer value and creates a new RDD. Reduce is called on RDD to find the maximum number of rows. The parameters of MAP and reduce are Python anonymous functions Lambdas. At the same time, you can pass in any top-level Python function you want. The following is a python function.

>>> Defmax (A, B ):... if A> B :... return... else :... return B...> textfile. map (lambda line: Len (line. split ())). reduce (max) 15


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.