"Original Hadoop&spark Hands-on 5" Spark Basics Starter, cluster build and Spark Shell

Source: Internet
Author: User
Tags hadoop fs

Introduction to spark Basics, cluster build and Spark Shell

The main use of spark-based PPT, coupled with practical hands-on to enhance the concept of understanding and practice.

Spark Installation Deployment

The theory is almost there, and then the actual hands-on experiment:

Exercise 1 using Spark Shell (native mode) to complete wordcount

Spark-shell to Spark-shell native mode

First step: Import data by file mode

scala> val rdd1 = Sc.textfile ("File:///tmp/wordcount.txt")
Rdd1:org.apache.spark.rdd.rdd[string] = File:///tmp/wordcount.txt mappartitionsrdd[3] at Textfile at <console> : 24

Scala> Rdd1.count
Res1:long = 3

Second step: Using Flatmap (_.split ("")) for word breaker operation

scala> val rdd2 = Rdd1.flatmap (_.split (""))
Rdd2:org.apache.spark.rdd.rdd[string] = mappartitionsrdd[4] at FlatMap at <console>:26

Scala> Rdd2.count
Res2:long = 8

Scala> Rdd2.take
Take Takeasync takeordered takesample

Scala> Rdd2.take (8)
Res3:array[string] = Array (hello, World, Spark, world, Hello, Spark, Hadoop, great)

The third step: using map to convert to KV form

scala> val kvrdd1 = rdd2.map (x = (x,1))
kvrdd1:org.apache.spark.rdd.rdd[(String, Int)] = mappartitionsrdd[5] at map at <console>:28

Scala> Kvrdd1.count
Res4:long = 8

Scala> Kvrdd1.take (8)
res5:array[(String, Int)] = Array ((hello,1), (world,1), (spark,1), (world,1), (hello,1), (spark,1), (hadoop,1), great,1 ))

Fourth step: The Reducebykey operation of the KV map

scala> val resultRdd1 = Kvrdd1.reducebykey (_+_)
resultrdd1:org.apache.spark.rdd.rdd[(String, Int)] = shuffledrdd[6] at Reducebykey at <console>:30

Scala> Resultrdd1.count
Res6:long = 5

Scala> Resultrdd1.take (5)
res7:array[(String, Int)] = Array ((hello,2), (world,2), (spark,2), (hadoop,1), (great,1))

Fifth step: Keep the results in the file

Scala> resultrdd1.saveastextfile ("FILE:///TMP/OUTPUT1")

Exercise 2 Using the Spark Shell (Yarn client mode) to complete wordcount

Spark-shell--master yarn-client start Spark-shell yarn client mode

First step: Import data by file mode

scala> val rdd1 = Sc.textfile ("Hdfs:///input/wordcount.txt")
Rdd1:org.apache.spark.rdd.rdd[string] = Hdfs:///input/wordcount.txt mappartitionsrdd[1] at TextFile at <console >:24

Scala> Rdd1.count
Res0:long = 260

Scala> Rdd1.take (100)
Res1:array[string] = Array (HDFs users Guide, "", HDFs Users Guide, Purpose, overview, prerequisites, Web Interface, Shell Commands, Dfsadmin Command, secondary NameNode, Checkpoint node, Backup node, Import Checkpoint, Balancer, Rack Awareness , SafeMode, fsck, FETCHDT, Recovery Mode, Upgrade and Rollback, DataNode hot Swap Drive, File Permissions and Security, Sc Alability, related documentation, Purpose, "", this document was a starting point for users working with Hadoop distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file System. While HDFs was designed to "just work" in many environments, a working knowledge of HDFS helps greatly with configuration I Mprovements and diagnostics on a specific cluster., "", Overview, "",...

Second step: Using Flatmap (_.split ("")) for word breaker operation

scala> val rdd2 = Rdd1.flatmap (_.split (""))
Rdd2:org.apache.spark.rdd.rdd[string] = mappartitionsrdd[2] at FlatMap at <console>:26

Scala> Rdd2.count
Res2:long = 3687

Scala> Rdd2.take (100)
Res3:array[string] = Array (HDFs, Users, Guide, "", HDFs, Users, Guide, Purpose, overview, Prerequisites, Web, Interface, Shell, Commands, Dfsadmin, Command, Secondary, NameNode, Checkpoint, node, Backup, node, Import, Checkpoint, Balancer, Rac K, awareness, SafeMode, fsck, FETCHDT, Recovery, Mode, Upgrade, and, Rollback, DataNode, hot, Swap, drive, File, Permissio NS, and, Security, Scalability, related, documentation, Purpose, "", this, document, was, a, starting, point, for, users, W Orking, with, Hadoop, distributed, File, System, (HDFS), either, as, a, part, of, A, Hadoop, cluster, or, as, a, Stand-alo NE, general, purpose, distributed, file, System., while, HDFS, was, designed, to, ' Just, work ', in, many, environments,, A, Working, knowledge, of, HDFS, helps, greatly, with, configuratio ...

The third step: using map to convert to KV form

scala> val kvrdd1 = rdd2.map (x = (x,1))
kvrdd1:org.apache.spark.rdd.rdd[(String, Int)] = mappartitionsrdd[3] at map at <console>:28

Scala> Kvrdd1.count
Res4:long = 3687

Scala> Kvrdd1.take (100)
res5:array[(String, Int)] = Array ((hdfs,1), (users,1), (guide,1), ("", 1), (hdfs,1), (users,1), (guide,1), (purpose,1), (O verview,1), (prerequisites,1), (web,1), (interface,1), (shell,1), (commands,1), (dfsadmin,1), (command,1), (Secondary, 1), (namenode,1), (checkpoint,1), (node,1), (backup,1), (node,1), (import,1), (checkpoint,1), (balancer,1), (rack,1), ( awareness,1), (safemode,1), (fsck,1), (fetchdt,1), (recovery,1), (mode,1), (upgrade,1), (and,1), (rollback,1), ( datanode,1), (hot,1), (swap,1), (drive,1), (file,1), (permissions,1), (and,1), (security,1), (scalability,1), (Related, 1), (documentation,1), (purpose,1), ("", 1), (this,1), (document,1), (is,1), (a,1), (starting,1), (point,1), (for,1), (use rs,1), (working,1), (with,1), (hadoop,1), (distributed,1), (file,1), (system,1), (HDF ...

Fourth step: The Reducebykey operation of the KV map

scala> var resultRdd1 = kvrdd1.reduce
Reduce Reducebykey reducebykeylocally

scala> var resultRdd1 = Kvrdd1.reducebykey
Reducebykey reducebykeylocally

scala> var resultRdd1 = Kvrdd1.reducebykey (_+_)
resultrdd1:org.apache.spark.rdd.rdd[(String, Int)] = shuffledrdd[4] at Reducebykey at <console>:30

Scala> Resultrdd1.count
Res6:long = 1084

Scala> Resultrdd1.take (100)
res7:array[(String, Int)] = Array ((because,1), (-reconfig,2), (guide,4), (under-replicated,1), (blocks,5), (maintained , 1), (responsibility,1), (filled,1), (order,5), ([key-value,1), (prematurely,1), (cluster:,1), (type,1), (behind,1), ( however,,1), (competing,1), (been,2), (begins,1), (up-to-date,3), (permissions,3), (browse,1), (list:,1), (improved,1) , (balancer,2), (fine.,1), (over,1), (dfs.hosts,,2), (any,7), (connect,1), (select,2), (version,7), (disks.,1), file,33 ), (documentation,,1), (file.,7), (performs,2), (million,2), (ram,1), (are,27), ((data,1), (supported.,1), consists,1 , (existed,1), (brief,2), (overwrites,1), (safely,1), (guide:,1), (safemode,6), (only,1), (currently,1), (first-time,1) , (dfs.namenode.name.dir,1), (thus,2), (salient,1), (query,1), (page)., 1), (status,5 ...

Fifth step: Keep the results in the HDFs file

Scala> resultrdd1.saveastextfile ("Hdfs:///output/wordcount1")

Localhost:tmp jonsonli$ Hadoop fs-ls/output/wordcount1
17/05/13 17:49:28 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable
Found 3 Items
-rw-r--r--1 jonsonli supergroup 0 2017-05-13 17:47/output/wordcount1/_success
-rw-r--r--1 Jonsonli supergroup 6562 2017-05-13 17:47/output/wordcount1/part-00000
-rw-r--r--1 Jonsonli supergroup 6946 2017-05-13 17:47/output/wordcount1/part-00001

"Original Hadoop&spark Hands-on 5" Spark Basics Starter, cluster build and Spark Shell

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.