Introduction to spark Basics, cluster build and Spark Shell
The main use of spark-based PPT, coupled with practical hands-on to enhance the concept of understanding and practice.
Spark Installation Deployment
The theory is almost there, and then the actual hands-on experiment:
Exercise 1 using Spark Shell (native mode) to complete wordcount
Spark-shell to Spark-shell native mode
First step: Import data by file mode
scala> val rdd1 = Sc.textfile ("File:///tmp/wordcount.txt")
Rdd1:org.apache.spark.rdd.rdd[string] = File:///tmp/wordcount.txt mappartitionsrdd[3] at Textfile at <console> : 24
Scala> Rdd1.count
Res1:long = 3
Second step: Using Flatmap (_.split ("")) for word breaker operation
scala> val rdd2 = Rdd1.flatmap (_.split (""))
Rdd2:org.apache.spark.rdd.rdd[string] = mappartitionsrdd[4] at FlatMap at <console>:26
Scala> Rdd2.count
Res2:long = 8
Scala> Rdd2.take
Take Takeasync takeordered takesample
Scala> Rdd2.take (8)
Res3:array[string] = Array (hello, World, Spark, world, Hello, Spark, Hadoop, great)
The third step: using map to convert to KV form
scala> val kvrdd1 = rdd2.map (x = (x,1))
kvrdd1:org.apache.spark.rdd.rdd[(String, Int)] = mappartitionsrdd[5] at map at <console>:28
Scala> Kvrdd1.count
Res4:long = 8
Scala> Kvrdd1.take (8)
res5:array[(String, Int)] = Array ((hello,1), (world,1), (spark,1), (world,1), (hello,1), (spark,1), (hadoop,1), great,1 ))
Fourth step: The Reducebykey operation of the KV map
scala> val resultRdd1 = Kvrdd1.reducebykey (_+_)
resultrdd1:org.apache.spark.rdd.rdd[(String, Int)] = shuffledrdd[6] at Reducebykey at <console>:30
Scala> Resultrdd1.count
Res6:long = 5
Scala> Resultrdd1.take (5)
res7:array[(String, Int)] = Array ((hello,2), (world,2), (spark,2), (hadoop,1), (great,1))
Fifth step: Keep the results in the file
Scala> resultrdd1.saveastextfile ("FILE:///TMP/OUTPUT1")
Exercise 2 Using the Spark Shell (Yarn client mode) to complete wordcount
Spark-shell--master yarn-client start Spark-shell yarn client mode
First step: Import data by file mode
scala> val rdd1 = Sc.textfile ("Hdfs:///input/wordcount.txt")
Rdd1:org.apache.spark.rdd.rdd[string] = Hdfs:///input/wordcount.txt mappartitionsrdd[1] at TextFile at <console >:24
Scala> Rdd1.count
Res0:long = 260
Scala> Rdd1.take (100)
Res1:array[string] = Array (HDFs users Guide, "", HDFs Users Guide, Purpose, overview, prerequisites, Web Interface, Shell Commands, Dfsadmin Command, secondary NameNode, Checkpoint node, Backup node, Import Checkpoint, Balancer, Rack Awareness , SafeMode, fsck, FETCHDT, Recovery Mode, Upgrade and Rollback, DataNode hot Swap Drive, File Permissions and Security, Sc Alability, related documentation, Purpose, "", this document was a starting point for users working with Hadoop distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file System. While HDFs was designed to "just work" in many environments, a working knowledge of HDFS helps greatly with configuration I Mprovements and diagnostics on a specific cluster., "", Overview, "",...
Second step: Using Flatmap (_.split ("")) for word breaker operation
scala> val rdd2 = Rdd1.flatmap (_.split (""))
Rdd2:org.apache.spark.rdd.rdd[string] = mappartitionsrdd[2] at FlatMap at <console>:26
Scala> Rdd2.count
Res2:long = 3687
Scala> Rdd2.take (100)
Res3:array[string] = Array (HDFs, Users, Guide, "", HDFs, Users, Guide, Purpose, overview, Prerequisites, Web, Interface, Shell, Commands, Dfsadmin, Command, Secondary, NameNode, Checkpoint, node, Backup, node, Import, Checkpoint, Balancer, Rac K, awareness, SafeMode, fsck, FETCHDT, Recovery, Mode, Upgrade, and, Rollback, DataNode, hot, Swap, drive, File, Permissio NS, and, Security, Scalability, related, documentation, Purpose, "", this, document, was, a, starting, point, for, users, W Orking, with, Hadoop, distributed, File, System, (HDFS), either, as, a, part, of, A, Hadoop, cluster, or, as, a, Stand-alo NE, general, purpose, distributed, file, System., while, HDFS, was, designed, to, ' Just, work ', in, many, environments,, A, Working, knowledge, of, HDFS, helps, greatly, with, configuratio ...
The third step: using map to convert to KV form
scala> val kvrdd1 = rdd2.map (x = (x,1))
kvrdd1:org.apache.spark.rdd.rdd[(String, Int)] = mappartitionsrdd[3] at map at <console>:28
Scala> Kvrdd1.count
Res4:long = 3687
Scala> Kvrdd1.take (100)
res5:array[(String, Int)] = Array ((hdfs,1), (users,1), (guide,1), ("", 1), (hdfs,1), (users,1), (guide,1), (purpose,1), (O verview,1), (prerequisites,1), (web,1), (interface,1), (shell,1), (commands,1), (dfsadmin,1), (command,1), (Secondary, 1), (namenode,1), (checkpoint,1), (node,1), (backup,1), (node,1), (import,1), (checkpoint,1), (balancer,1), (rack,1), ( awareness,1), (safemode,1), (fsck,1), (fetchdt,1), (recovery,1), (mode,1), (upgrade,1), (and,1), (rollback,1), ( datanode,1), (hot,1), (swap,1), (drive,1), (file,1), (permissions,1), (and,1), (security,1), (scalability,1), (Related, 1), (documentation,1), (purpose,1), ("", 1), (this,1), (document,1), (is,1), (a,1), (starting,1), (point,1), (for,1), (use rs,1), (working,1), (with,1), (hadoop,1), (distributed,1), (file,1), (system,1), (HDF ...
Fourth step: The Reducebykey operation of the KV map
scala> var resultRdd1 = kvrdd1.reduce
Reduce Reducebykey reducebykeylocally
scala> var resultRdd1 = Kvrdd1.reducebykey
Reducebykey reducebykeylocally
scala> var resultRdd1 = Kvrdd1.reducebykey (_+_)
resultrdd1:org.apache.spark.rdd.rdd[(String, Int)] = shuffledrdd[4] at Reducebykey at <console>:30
Scala> Resultrdd1.count
Res6:long = 1084
Scala> Resultrdd1.take (100)
res7:array[(String, Int)] = Array ((because,1), (-reconfig,2), (guide,4), (under-replicated,1), (blocks,5), (maintained , 1), (responsibility,1), (filled,1), (order,5), ([key-value,1), (prematurely,1), (cluster:,1), (type,1), (behind,1), ( however,,1), (competing,1), (been,2), (begins,1), (up-to-date,3), (permissions,3), (browse,1), (list:,1), (improved,1) , (balancer,2), (fine.,1), (over,1), (dfs.hosts,,2), (any,7), (connect,1), (select,2), (version,7), (disks.,1), file,33 ), (documentation,,1), (file.,7), (performs,2), (million,2), (ram,1), (are,27), ((data,1), (supported.,1), consists,1 , (existed,1), (brief,2), (overwrites,1), (safely,1), (guide:,1), (safemode,6), (only,1), (currently,1), (first-time,1) , (dfs.namenode.name.dir,1), (thus,2), (salient,1), (query,1), (page)., 1), (status,5 ...
Fifth step: Keep the results in the HDFs file
Scala> resultrdd1.saveastextfile ("Hdfs:///output/wordcount1")
Localhost:tmp jonsonli$ Hadoop fs-ls/output/wordcount1
17/05/13 17:49:28 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable
Found 3 Items
-rw-r--r--1 jonsonli supergroup 0 2017-05-13 17:47/output/wordcount1/_success
-rw-r--r--1 Jonsonli supergroup 6562 2017-05-13 17:47/output/wordcount1/part-00000
-rw-r--r--1 Jonsonli supergroup 6946 2017-05-13 17:47/output/wordcount1/part-00001
"Original Hadoop&spark Hands-on 5" Spark Basics Starter, cluster build and Spark Shell