Learning spark--use Spark-shell to run Word Count

Source: Internet
Author: User

In the Hadoop, zookeeper, hbase, spark cluster environment has set up the environment, 工欲善其事 its prerequisite, now the device has been, the next is to open up, first from Spark-shell began to uncover spark artifact veil.

Spark-shell is the command line interface of Spark, we can directly hit some commands above, just like Windows CMD, into the Spark installation directory, execute the following command to open Spark-shell:

Bin/spark-Shell --Master spark://hxf:7077 --Executor-Memory 1024m --Driver-Memory 1024m -- Total-Executor-cores 4

Executor-memory is slave memory, Driver-memory is Master's memory, Total-executor-cores is all the number of cores

Terminal display As you can see Spark-shell has helped us initialize two variables SC, SPARK,SC is spark Context,spark is spark session, have not eaten pork seen pig run, Like these include context ah session AH don't think it is very important, the same spark implementation is relying on these two variables, at present, the first familiar, and later

The Spark Management page displays such as:

OK, now let's start knocking the first example to count the number of occurrences of each word in the Spark directory readme.md this file:

First give the complete code, convenient for everyone to have a whole idea:

val textFile = sc.textFile("file:/data/install/spark-2.0.0-bin-hadoop2.7/README.md")val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)wordCounts.collect()

The code is simple, but the first time you see it may not be very understanding, the following explanation

1. How spark reads the raw data

First read the README.MD:

val textFile = sc.textFile("README.md")

This code is to read raw data into Spark's own data format RDD, which generally reads raw data in two ways

1. Test usage: Call Sparkcontext's Parallelize method

val rdd = sc.parallelize(Array(1to10))

This gets the 1 to 10 array, which is used to test the program, formally developed without this

2, formal usage: All Hadoop can use the data source spark can be used, of course, our most commonly used is the Sparkcontext Textfile method, such as reading the files on the HDFs:

val rdd = sc.parallelize("hadoop://hxf:9000/test/test.log")
2. Spark's underlying data type Rdd

The result obtained by Textfile is called the Rdd, which is the basic data type of spark.

The RDD is the abbreviation for the resillient distributed dataset, meaning the elastic distributed data set, which is not very well understood, but we can literally understand that the RDD is distributed and is a collection of data, assuming that there are multiple files under the distributed system, These files have many lines, and the Rdd refers to a collection of all the rows of all these files, not a single row. So what we do with the RDD is all about the entire collection, and Spark is putting the entire rdd in memory, not on the disk like MapReduce, so that spark can operate faster than MapReduce.

Then continue with the code:

val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)wordCounts.collect()

The final results show the number of occurrences of each word, the code of Flatmap, map, Reducebykey is the RDD conversion operation, Collect is the action of the RDD operation, do not understand that it does not matter, after the article detailed. This section is here for the time being, please listen to tell.

Learning spark--use Spark-shell to run Word Count

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.