Learning spark--use Spark-shell to run Word Count

Last Update:2017-04-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the Hadoop, zookeeper, hbase, spark cluster environment has set up the environment, 工欲善其事 its prerequisite, now the device has been, the next is to open up, first from Spark-shell began to uncover spark artifact veil.

Spark-shell is the command line interface of Spark, we can directly hit some commands above, just like Windows CMD, into the Spark installation directory, execute the following command to open Spark-shell:

Bin/spark-Shell --Master spark://hxf:7077 --Executor-Memory 1024m --Driver-Memory 1024m -- Total-Executor-cores 4

Executor-memory is slave memory, Driver-memory is Master's memory, Total-executor-cores is all the number of cores

Terminal display As you can see Spark-shell has helped us initialize two variables SC, SPARK,SC is spark Context,spark is spark session, have not eaten pork seen pig run, Like these include context ah session AH don't think it is very important, the same spark implementation is relying on these two variables, at present, the first familiar, and later

The Spark Management page displays such as:

OK, now let's start knocking the first example to count the number of occurrences of each word in the Spark directory readme.md this file:

First give the complete code, convenient for everyone to have a whole idea:

val textFile = sc.textFile("file:/data/install/spark-2.0.0-bin-hadoop2.7/README.md")val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)wordCounts.collect()

The code is simple, but the first time you see it may not be very understanding, the following explanation

1. How spark reads the raw data

First read the README.MD:

val textFile = sc.textFile("README.md")

This code is to read raw data into Spark's own data format RDD, which generally reads raw data in two ways

1. Test usage: Call Sparkcontext's Parallelize method

val rdd = sc.parallelize(Array(1to10))

This gets the 1 to 10 array, which is used to test the program, formally developed without this

2, formal usage: All Hadoop can use the data source spark can be used, of course, our most commonly used is the Sparkcontext Textfile method, such as reading the files on the HDFs:

val rdd = sc.parallelize("hadoop://hxf:9000/test/test.log")

2. Spark's underlying data type Rdd

The result obtained by Textfile is called the Rdd, which is the basic data type of spark.

The RDD is the abbreviation for the resillient distributed dataset, meaning the elastic distributed data set, which is not very well understood, but we can literally understand that the RDD is distributed and is a collection of data, assuming that there are multiple files under the distributed system, These files have many lines, and the Rdd refers to a collection of all the rows of all these files, not a single row. So what we do with the RDD is all about the entire collection, and Spark is putting the entire rdd in memory, not on the disk like MapReduce, so that spark can operate faster than MapReduce.

Then continue with the code:

val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)wordCounts.collect()

The final results show the number of occurrences of each word, the code of Flatmap, map, Reducebykey is the RDD conversion operation, Collect is the action of the RDD operation, do not understand that it does not matter, after the article detailed. This section is here for the time being, please listen to tell.

Learning spark--use Spark-shell to run Word Count

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Learning spark--use Spark-shell to run Word Count

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Learning spark--use Spark-shell to run Word Count

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support