The Scala code is as follows:
Import Org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport org.apache.spark.sparkcontext._/** * Statistics character occurrences */object WordCount { def main (args:array[string]) { if (Args.length < 1) { System.err.println (" Usage: <file> ") system.exit (1) } val conf = new sparkconf () val sc = new Sparkcontext (conf) Val line = sc.textfile (args (0)) line.flatmap (_.split ("")). Map ((_, 1)). Reducebykey (_+_). Collect (). foreach ( println) sc.stop () }}
Note:build path will not only use spark Lib folder Spark-assembly-1.5.0-cdh5.5.4-hadoop2.6.0-cdh5.5.4.jar, and to add the jar in the Hadoop share/hadoop directory, I'm not sure which ones to add. , it is right to add them anyway.
Use eclipse to make it into a jar package
Note: Scala's object name does not have to be the same as the file name, which is not the same as Java. For example, my object is named WordCount, but the file name is Wc.scala
Upload Server
To view the contents of a test file on a server
-bash-4.1$ Hadoop fs-cat/user/hdfs/test.txt Zhang Sanhang Steven Cheung Li Sanli Old King Lao Wang
Run the Spark-submit command to submit the jar package
-bash-4.1$ spark-submit--class "WordCount" WC.JAR/USER/HDFS/TEST.TXT16/08/22 15:54:17 INFO sparkcontext:running Spark Version 1.5.0-CDH5.5.416/08/22 15:54:18 info securitymanager:changing View ACLS to:hdfs16/08/22 15:54:18 Info SecurityMa Nager:changing Modify ACLS to:hdfs16/08/22 15:54:18 INFO SecurityManager:SecurityManager:authentication disabled; UI ACLs Disabled; Users with View Permissions:set (HDFS); Users with Modify Permissions:set (HDFS) 16/08/22 15:54:19 info slf4jlogger:slf4jlogger started16/08/22 15:54:19 INFO Remo Ting:starting REMOTING16/08/22 15:54:19 INFO remoting:remoting started; Listening on addresses: [AKKA.TCP://[EMAIL&NBSP;PROTECTED]:55886]16/08/22 15:54:19 INFO remoting:remoting now listens On addresses: [Akka.tcp://[email protected]:55886]16/08/22 15:54:19 INFO utils:successfully started service ' Sparkdriver ' on port 55886.16/08/22 15:54:19 info sparkenv:registering mapoutputtracker16/08/22 15:54:19 INFO sparkenv:r Egistering Blockmanagermaster16/08/22 15:54:19 INFO diskblockmanager:created Local directory at/tmp/ BLOCKMGR-ECCD9AD6-6296-4508-A9C8-A22B5A36ECBE16/08/22 15:54:20 INFO Memorystore:memorystore started with capacity 534.5 mb16/08/22 15:54:20 INFO httpfileserver:http File Server directory is/tmp/ SPARK-BBF694E7-32E2-40B6-88A3-4D97A1D1AAB9/HTTPD-72A45554-B57B-4A5D-AF2F-24F198E6300B16/08/22 15:54:20 INFO httpserver:starting http SERVER16/08/22 15:54:20 INFO utils:successfully started service ' HTTP file server ' on port 5963 6.16/08/22 15:54:20 Info sparkenv:registering outputcommitcoordinator16/08/22 15:54:41 Info utils:successfully started Service ' Sparkui ' on port 4040.16/08/22 15:54:41 INFO sparkui:started sparkui at HTTP://192.168.56.201:404016/08/22 15:54 : $ INFO sparkcontext:added JAR File:/var/lib/hadoop-hdfs/wc.jar at Http://192.168.56.201:59636/jars/wc.jar with Timestamp 147185248118116/08/22 15:54:41 WARN metricssystem:using default name Dagscheduler for source because Spark.app. ID is not SET.16/08/22 15:54:41 Info rmproxy:connecting to ResourceManager at HADOOP01/192.168.56.201:803216/08/22 15:54:41 info client:reque Sting a new application from cluster with 2 NODEMANAGERS16/08/22 15:54:41 INFO client:verifying Our application have not r equested more than the maximum memory capability of the cluster (1536 MB per container) 16/08/22 15:54:41 INFO Client:will Allocate AM container, with 896 MB memory including 384 MB OVERHEAD16/08/22 15:54:41 INFO client:setting up container LA Unch context for we AM16/08/22 15:54:41 INFO client:setting up the launch environment for our AM CONTAINER16/08/22 15:54 : $ info client:preparing resources for our AM CONTAINER16/08/22 15:54:42 Info client:uploading Resource File:/tmp/spark -bbf694e7-32e2-40b6-88a3-4d97a1d1aab9/__spark_conf__5421268438919389977.zip-Hdfs://hadoop01:8020/user/hdfs /.SPARKSTAGING/APPLICATION_1471848612199_0005/__SPARK_CONF__5421268438919389977.ZIP16/08/22 15:54:43 INFO Securitymanager:changing View ACLs to:hdfs16/08/15:54:43 Info securitymanager:changing Modify ACLS to:hdfs16/08/22 15:54:43 Info SecurityManager:SecurityManager:au Thentication Disabled; UI ACLs Disabled; Users with View Permissions:set (HDFS); Users with Modify Permissions:set (HDFs) 16/08/22 15:54:43 INFO client:submitting application 5 to RESOURCEMANAGER16/08/22 15:54:43 Info yarnclientimpl:submitted Application APPLICATION_1471848612199_000516/08/22 15:54:44 INFO Client: Application Report for application_1471848612199_0005 (state:accepted) 16/08/22 15:54:44 INFO client:client token:n/a D iagnostics:n/a applicationmaster host:n/a applicationmaster RPC Port:-1 Queue:root.hdfs start time:1471852483082 fina L status:undefined Tracking Url:http://hadoop01:8088/proxy/application_1471848612199_0005/user:hdfs16/08/22 15:54:45 Info client:application report for application_1471848612199_0005 (state:accepted) 16/08/22 15:54:46 INFO Clien T:application Report for application_1471848612199_0005 (state:accepted) 16/08/22 15:54:47 Info client:application report for application_1471848612199_0005 (state:accepted) 16/08/22 15:54:48 INFO Client : Application Report for application_1471848612199_0005 (state:accepted) 16/08/22 15:54:49 INFO client:application Repor T for application_1471848612199_0005 (state:accepted) 16/08/22 15:54:49 INFO yarnschedulerbackend$ Yarnschedulerendpoint:applicationmaster registered as Akkarpcendpointref (actor[akka.tcp://[email protected] : 46225/user/yarnam#289706976]) 16/08/22 15:54:49 INFO yarnclientschedulerbackend:add WebUI Filter. Org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map (Proxy_hosts-HADOOP01, Proxy_uri_bases- http://hadoop01:8088/proxy/application_1471848612199_0005),/PROXY/APPLICATION_1471848612199_000516/08/22 15:54:49 INFO jettyutils:adding FILTER:ORG.APACHE.HADOOP.YARN.SERVER.WEBPROXY.AMFILTER.AMIPFILTER16/08/22 15:54:50 Info client:application report for application_1471848612199_0005 (state:running) 16/08/22 15:54:50 INFO ClienT:client token:n/a diagnostics:n/a applicationmaster host:192.168.56.206 applicationmaster RPC port:0 QUEUE:ROOT.HD FS Start time:1471852483082 final status:undefined tracking url:http://hadoop01:8088/proxy/application_1471848612199 _0005/USER:HDFS16/08/22 15:54:50 INFO yarnclientschedulerbackend:application application_1471848612199_0005 has Started RUNNING.16/08/22 15:54:50 INFO utils:successfully started service ' Org.apache.spark.network.netty.NettyBlockTransferService ' on port 38391.16/08/22 15:54:50 INFO Nettyblocktransferservice:server created on 3839116/08/22 15:54:50 INFO blockmanager:external Shuffle service port = 733 716/08/22 15:54:50 Info blockmanagermaster:trying to register BLOCKMANAGER16/08/22 15:54:50 info Blockmanagermasterendpoint:registering block manager 192.168.56.201:38391 with 534.5 MB RAM, Blockmanagerid (Driver, 192.168.56.201, 38391) 16/08/22 15:54:50 info blockmanagermaster:registered blockmanager16/08/22 15:54:51 INFO Eventlogginglistener:logGing Events to Hdfs://hadoop01:8020/user/spark/applicationhistory/application_1471848612199_000516/08/22 15:54:51 INFO Yarnclientschedulerbackend:schedulerbackend is scheduling beginning after reached MINREGISTEREDRESOURCESRATIO:0.816/08/22 15:54:52 INFO memorystore:ensurefreespace (195280) called with CurMem=0, MAXMEM=56049795016/08/22 15:54:52 INFO memorystore:block broadcast_0 stored as values in memory (estimated size 190.7 KB, Free 534.3 MB) 16/08/22 15:54:52 INFO memorystore:ensurefreespace (22784) called with curmem=195280, maxmem=56049795016/0 8/22 15:54:52 INFO memorystore:block broadcast_0_piece0 stored as bytes in memory (estimated size 22.3 KB, free 534.3 MB) 16/08/22 15:54:52 INFO blockmanagerinfo:added broadcast_0_piece0 in memory on 192.168.56.201:38391 (size:22.3 KB, free: 534.5 MB) 16/08/22 15:54:52 info sparkcontext:created broadcast 0 from Textfile at WC.SCALA:1716/08/22 15:54:52 INFO FileI Nputformat:total input paths to PROCESS:116/08/22 15:54:52 INFO sparkcontext:starting job:collect at WC.SCALA:1916/08/22 15:54:52 INFO dagscheduler:registering RDD 3 (map at WC.scal a:19) 16/08/22 15:54:52 INFO dagscheduler:got Job 0 (collect at wc.scala:19) with 2 output PARTITIONS16/08/22 15:54:52 INF O dagscheduler:final stage:resultstage 1 (collect at wc.scala:19) 16/08/22 15:54:52 INFO dagscheduler:parents of Final St Age:list (shufflemapstage 0) 16/08/22 15:54:52 INFO dagscheduler:missing parents:list (shufflemapstage 0) 16/08/22 15:54:52 INFO dagscheduler:submitting shufflemapstage 0 (mappartitionsrdd[3] at map at Wc.scala:19), which have no missing PARENTS16/08/22 15:54:52 INFO memorystore:ensurefreespace (4024) called with curmem=218064, MAXMEM=56049795016/08/22 15 : 54:52 INFO memorystore:block broadcast_1 stored as values in memory (estimated size 3.9 KB, free 534.3 MB) 16/08/22 15:54 : Memorystore:ensurefreespace info (2281) called with curmem=222088, MAXMEM=56049795016/08/22 15:54:52 INFO Memorystore:block Broadcast_1_piece0 stored As bytes in memory (estimated size 2.2 KB, free 534.3 MB) 16/08/22 15:54:52 INFO blockmanagerinfo:added broadcast_1_piece0 In memory on 192.168.56.201:38391 (size:2.2 KB, free:534.5 MB) 16/08/22 15:54:52 INFO sparkcontext:created broadcast 1 F Rom broadcast at DAGSCHEDULER.SCALA:86116/08/22 15:54:52 INFO dagscheduler:submitting 2 missing tasks from Shufflemapstag E 0 (Mappartitionsrdd[3] at map @ wc.scala:19) 16/08/22 15:54:52 INFO yarnscheduler:adding task set 0.0 with 2 tasks16/08 /22 15:54:53 INFO executorallocationmanager:requesting 1 new executor because tasks is backlogged (new desired total wil L be 1) 16/08/22 15:54:54 INFO executorallocationmanager:requesting 1 new executor because tasks is backlogged (new Desir Ed Total would be 2) 16/08/22 15:54:59 INFO yarnclientschedulerbackend:registered executor:akkarpcendpointref (Actor[akka . tcp://[email protected]:59707/user/executor#729574503]) with ID 116/08/22 15:54:59 INFO Executorallocationmanager:new Executor 1 has registerEd (new total was 1) 16/08/22 15:54:59 INFO tasksetmanager:starting task 0.0 in stage 0.0 (TID 0, HADOOP05, partition 0,nod e_local, 2186 bytes) 16/08/22 15:54:59 INFO blockmanagermasterendpoint:registering block manager hadoop05:53273 with 534. 5 MB RAM, Blockmanagerid (1, HADOOP05, 53273) 16/08/22 15:55:00 INFO blockmanagerinfo:added broadcast_1_piece0 in memory on hadoop05:53273 (size:2.2 KB, free:534.5 MB) 16/08/22 15:55:01 INFO blockmanagerinfo:added broadcast_0_piece0 in memory On hadoop05:53273 (size:22.3 KB, free:534.5 MB) 16/08/22 15:55:03 INFO tasksetmanager:starting Task 1.0 in stage 0.0 (TI D 1, HADOOP05, partition 1,node_local, 2186 bytes) 16/08/22 15:55:03 INFO tasksetmanager:finished task 0.0 in stage 0.0 (T ID 0) in 3733 ms on HADOOP05 16/08/22 15:55:03 INFO dagscheduler:shufflemapstage 0 (map at wc.scala:19) finished in 10.621 S16/08/22 15:55:03 Info dagscheduler:looking for newly runnable stages16/08/22 15:55:03 INFO dagscheduler:running : Set () 16/08/22 15:55:03Info DAGScheduler:waiting:Set (resultstage 1) 16/08/22 15:55:03 Info DAGScheduler:failed:Set () 16/08/22 15:55:03 info Tas Ksetmanager:finished Task 1.0 in Stage 0.0 (TID 1) in the MS on HADOOP05 (2/2) 16/08/22 15:55:03 INFO Yarnscheduler:remov Ed TaskSet 0.0, whose tasks has all completed, from pool 16/08/22 15:55:03 INFO dagscheduler:missing Parents for ResultS Tage 1:list () 16/08/22 15:55:03 INFO dagscheduler:submitting resultstage 1 (shuffledrdd[4] at Reducebykey at WC.scala:19) , which is today RUNNABLE16/08/22 15:55:03 INFO memorystore:ensurefreespace (2288) called with curmem=224369, maxMem=5604979 5016/08/22 15:55:03 INFO memorystore:block broadcast_2 stored as values in memory (estimated size 2.2 KB, free 534.3 MB) 1 6/08/22 15:55:03 INFO memorystore:ensurefreespace (1363) called with curmem=226657, MAXMEM=56049795016/08/22 15:55:03 INFO memorystore:block broadcast_2_piece0 stored as bytes in memory (estimated size 1363.0 B, free 534.3 MB) 16/08/22 15:5 5:03 INFO Blockmanagerinfo:Added Broadcast_2_piece0 in Memory on 192.168.56.201:38391 (size:1363.0 B, free:534.5 MB) 16/08/22 15:55:03 INFO SparkCon Text:created broadcast 2 from broadcast at DAGSCHEDULER.SCALA:86116/08/22 15:55:03 INFO dagscheduler:submitting 2 missin g Tasks from Resultstage 1 (shuffledrdd[4] @ reducebykey at wc.scala:19) 16/08/22 15:55:03 INFO yarnscheduler:adding Task Set 1.0 with 2 TASKS16/08/22 15:55:03 INFO tasksetmanager:starting task 0.0 in Stage 1.0 (TID 2, HADOOP05, partition 0,p Rocess_local, 1950 bytes) 16/08/22 15:55:03 INFO blockmanagerinfo:added broadcast_2_piece0 in memory on hadoop05:53273 (SI ze:1363.0 B, free:534.5 MB) 16/08/22 15:55:03 INFO mapoutputtrackermasterendpoint:asked to send map output locations for Shuffle 0 to Hadoop05:5970716/08/22 15:55:03 INFO mapoutputtrackermaster:size of output statuses for shuffle 0 are 148 by TES16/08/22 15:55:03 INFO tasksetmanager:starting Task 1.0 in Stage 1.0 (TID 3, HADOOP05, Partition 1,process_local, 1950 bytes) 16/08/22 15:55:03 Info tasksetmanager:finished Task 0.0 in Stage 1.0 (TID 2) in 155 ms on HADOOP05 (a) 16/08/22 15:55:03 INFO Dagschedule R:resultstage 1 (collect at wc.scala:19) finished in 0.193 S16/08/22 15:55:03 INFO tasksetmanager:finished Task 1.0 in S Tage 1.0 (TID 3) in the MS on HADOOP05 (2/2) 16/08/22 15:55:03 INFO yarnscheduler:removed TaskSet 1.0, whose tasks has all Completed, from pool 16/08/22 15:55:03 INFO dagscheduler:job 0 finished:collect at Wc.scala:19, took 11.041942 s (Steven Cheung, 1) ( Lao Wang, 2) (Zhang San, 2) (Zhang Si, 1) (Wang Er, 1) (John Doe, 3) (Lie triple, 2) 16/08/22 15:55:03 INFO sparkui:stopped Spark Web UI at http://192.168.56.201:404016 /08/22 15:55:03 Info dagscheduler:stopping dagscheduler16/08/22 15:55:03 Info yarnclientschedulerbackend:interrupting Monitor THREAD16/08/22 15:55:03 info yarnclientschedulerbackend:shutting down all EXECUTORS16/08/22 15:55:03 info YarnCl Ientschedulerbackend:asking executor to shut DOWN16/08/22 15:55:03 INFO yarnclientschedulerbackend:stopped16/08/ 15:55:03 INFO MapoutputtrackerMasterendpoint:mapoutputtrackermasterendpoint stopped!16/08/22 15:55:03 INFO memorystore:memorystore cleared16/08/ 15:55:03 Info blockmanager:blockmanager stopped16/08/22 15:55:03 Info blockmanagermaster:blockmanagermaster STOPPED16/08/22 15:55:03 INFO Outputcommitcoordinator$outputcommitcoordinatorendpoint:outputcommitcoordinator STOPPED!16/08/22 15:55:03 INFO remoteactorrefprovider$remotingterminator:shutting down remote DAEMON.16/08/22 15:55:03 Info sparkcontext:successfully stopped sparkcontext16/08/22 15:55:03 INFO remoteactorrefprovider$ Remotingterminator:remote Daemon shut down; Proceeding with flushing remote TRANSPORTS.16/08/22 15:55:03 INFO shutdownhookmanager:shutdown Hook called16/08/22 15:55 : Directory/tmp/spark-bbf694e7-32e2-40b6-88a3-4d97a1d1aab9 INFO shutdownhookmanager:deleting
successful execution.
Spark from getting started to discarding the distributed run Jar package