International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Spark Quick Start-Interactive Analysis

Last Update:2014-07-22 Source: Internet

Author: User

Tags hdfs dfs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.1 spark Interactive Analysis

Start HDFS and yarn of hadoop before running the spark script. Spark shell provides

It also has a powerful tool to analyze data interactively. The two languages have such exchange capabilities: Scala and python. The following shows how to use python to analyze data files.

Go to the spark installation home directory and enter the following command. The Python command line mode will start.

 ./Bin/pyspark

The main abstraction of spark is an elastic distributed dataset (Resilient distributed Dataset,RDD). RDD can input formats (inputformat, such as HDFS files) from hadoop or convert other RDDs. Here, we will upload the README file under the spark main directory to the hadoop File System/user/APP (the app is my Linux User Name) directory. The specific command is as follows:

 Hdfs dfs-mkdir-P/user/apphdfs DFS-putreadme. md/usr/APP

Create an elastic distributed dataset using python, which is defined as follows.

 >>> Textfile = SC. textfile ("readme. md ")

RDD hasActions, Which can return values and convertTransformationsYou can also return a pointer to the new RDD. Below are several actions for RDD.

 >>> Textfile. Count () # returns the number of data items of this RDD. 126 >>> textfile. First () # returns the first data of this RDD U' # apachespark'

Now let's use a transformation (transformation). We will use the filter to convert and return a new RDD, accompanied by a subset of the data items in the file.

 >>> Lineswitheat Ark = textfile. Filter (lambdaline: "spark" inline)

We can chain conversion and action:

 >>> Textfile. Filter (lambda line: "spark" inline). Count () # How many linescontain "spark "? 15

RDD actions and transformations can be used for more complex computing. Let's look at the example below.

 >>> Textfile. Map (lambdaline: Len (line. Split (). Reduce (lambda A, B: AIF (A> B) elseb) 15

The first item of this parameter maps the data of a row to an integer value and creates a new RDD. Reduce is called on RDD to find the maximum number of rows. The parameters of MAP and reduce are Python anonymous functions Lambdas. At the same time, you can pass in any top-level Python function you want. The following is a python function.

>>> Defmax (A, B ):... if A> B :... return... else :... return B...> textfile. map (lambda line: Len (line. split ())). reduce (max) 15

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

mongodb quick start mysql quick start php quick start quick start manual canvas quick start python quick start guide quell quick start guide

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Quick Start-Interactive Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support