Spark WordCount Read-write HDFs file (read file from Hadoop HDFs and write output to HDFs)

Last Update:2018-07-22 Source: Internet

Author: User

Tags pack hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

0 Spark development environment is created according to the following blog:
http://blog.csdn.net/w13770269691/article/details/15505507
http://blog.csdn.net/qianlong4526888/article/details/21441131

1 Create a Scala development environment in Eclipse (Juno version at least)
Just install scala:help->install new Software->add Url:http://download.scala-ide.org/sdk/e38/scala29/stable/site
Refer to:http://dongxicheng.org/framework-on-yarn/spark-eclipse-ide/

2 write WordCount in eclipse with Scala
Create a Scala project and a WordCount class as follow:

Package com.qiurc.test
Import org.apache.spark._
import sparkcontext._

object WordCount {
    def main ( Args:array[string]) {
      if (args.length!= 3) {
        println ("Usage:com.qiurc.test.WordCount <master> < Input> <output> ")
        return
      }
      val sc = new Sparkcontext (args (0)," WordCount ",
          system.getenv (" Spark_home "), Seq (System.getenv (" Spark_qiutest_jar "))
      val textfile  = Sc.textfile (args (1))
      Val result = Textfile.flatmap (_.split (""))
              . Map (Word => (Word, 1)). Reducebykey (_ + _)
      Result.saveastextfile ( Args (2))
      
    }
}

3 exported as a Jar PackRight click the project and export as Spark_qiutest. jar.
Then put it into some dir, such as spark_home/ qiutest

4 Get a run script run this jar pack
Copy Run-example (in sparkhome) and change it!
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ CP Run-example Run-qiu-test
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Vim Run-qiu-test
__________________
scala_version=2.9.3

# Figure out where the Scala framework is installed
Fwdir= "$ (CD ' dirname $ '; pwd)"

# Export this as Spark_home
Export Spark_home= "$FWDIR"

# Load environment variables from conf/spark-env.sh, if it exists
If [-e $FWDIR/conf/spark-env.sh]; Then
. $FWDIR/conf/spark-env.sh
Fi

If [-Z "$"]; Then
echo "Usage:run-example <example-class> [<args>]" >&2
Exit 1
Fi

# Figure out of the JAR file that we examples were packaged into. This includes a bit of a hack
# to avoid the-sources And-doc packages this are built by publish-local.
Qiutest_dir = "$FWDIR"/qiutest
Spark_qiutest_jar = ""
If [e "$QIUTEST _dir"/spark_qiutest.jar]; Then
Export spark_qiutest_jar= ' ls ' $QIUTEST _dir '/spark_qiutest. JAR '
Fi

if [[Z $SPARK _qiutest_jar]]; Then
echo "Failed to find Spark qiutest jar assembly in $FWDIR/qiutest" >&2
echo "You need to build spark test jar Assembly before running the This program" >&2
Exit 1
Fi

# Since The examples JAR ideally shouldn ' t include Spark-core (that dependency should is
# "provided"), also add our standard Spark classpath, built using compute-classpath.sh.
Classpath= ' $FWDIR/bin/compute-classpath.sh '
Classdata-path= "$SPARK _qiutest_jar: $CLASSPATH"

# find Java Binary
If [-N "${java_home}"]; Then
Runner= "${java_home}/bin/java"
Else
If [' command-v Java ']; Then
Runner= "Java"
Else
echo "Java_home is not set" >&2
Exit 1
Fi
Fi

If ["$SPARK _print_launch_command" = = "1"]; Then
Echo-n "Spark Command:"
echo "$RUNNER"-CP "$CLASSPATH" "$@"
echo "========================================"
Echo
Fi

Exec "$RUNNER"-CP "$CLASSPATH" "$@"

5 Run it in spark with Hadoop HDFs

hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ LS Assembly LICENSE Pyspark.cmd Spa Rk-class
A.txt logs Python Spark-class2.cmd

hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Cat A.txt
A
B
C
C
D
D
E
E

(Note:put a.txt into HDFs)
Hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$hadoop fs-put a.txt./

(Note:check a.txt in HDFs)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Hadoop Fs-ls
Found 6 Items
-rw-r--r--2 hadoop supergroup 4215 2014-04-14 10:27/user/hadoop/readme.md
-rw-r--r--2 Hadoop supergroup 2014-04-14 15:58/user/hadoop/a.txt
-rw-r--r--2 hadoop supergroup 0 2013-05-29 17:17/user/hadoop/dumpfile
-rw-r--r--2 hadoop supergroup 0 2013-05-29 17:19/user/hadoop/dumpfiles
Drwxr-xr-x-hadoop supergroup 0 2014-04-14 15:57/USER/HADOOP/QIURC
Drwxr-xr-x-hadoop supergroup 0 2013-07-06 19:48/user/hadoop/temp

(Note:create a dir named "Qiurc" to store the output of WordCount in HDFs) hadoop@debian-master:~/spark-0.8.0-incubating -bin-hadoop1$ Hadoop FS-MKDIR/USER/HADOOP/QIURC

hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Hadoop Fs-ls
Found 5 Items
-rw-r--r--2 hadoop supergroup 4215 2014-04-14 10:27/user/hadoop/readme.md
-rw-r--r--2 hadoop supergroup 0 2013-05-29 17:17/user/hadoop/dumpfile
-rw-r--r--2 hadoop supergroup 0 2013-05-29 17:19/user/hadoop/dumpfiles
Drwxr-xr-x-hadoop supergroup 0 2014-04-14 15:32/USER/HADOOP/QIURC
Drwxr-xr-x-hadoop supergroup 0 2013-07-06 19:48/user/hadoop/temp

Start to run our WordCount program. Specifies the input and output location. Test only add hdfsxxx absolute path to write HDFs
(Note:prefix "hdfs://debian-master:9000/user/hadoop/" can ' t beforgot)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$./run-qiu-testcom.qiurc.test.wordcount spark:// Debian-master:7077hdfs://debian-master:9000/user/hadoop/a.txthdfs://debian-master:9000/user/hadoop/qiurc
(Note:get command is OK, too)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Hadoop Fs-copytolocal/user/hadoop/qiurc/localfile

hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ ls localfile/
part-00000 part-00001 part-00002 _success

(Note:let me show this result)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Cat localfile/part-00000
(, 1)
(c,2)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Cat localfile/part-00001
(d,2)
(a,1)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Cat localfile/part-00002
(e,3)
(b,1)

Finish it! ^_^

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More