Spark WordCount Read-write HDFs file (read file from Hadoop HDFs and write output to HDFs)

Source: Internet
Author: User
Tags pack hadoop fs
0 Spark development environment is created according to the following blog:
http://blog.csdn.net/w13770269691/article/details/15505507
http://blog.csdn.net/qianlong4526888/article/details/21441131

1
Create a Scala development environment in Eclipse (Juno version at least)
Just install scala:help->install new Software->add Url:http://download.scala-ide.org/sdk/e38/scala29/stable/site
Refer to:http://dongxicheng.org/framework-on-yarn/spark-eclipse-ide/

2 write WordCount in eclipse with Scala
Create a Scala project and a WordCount class as follow:
Package com.qiurc.test
Import org.apache.spark._
import sparkcontext._

object WordCount {
    def main ( Args:array[string]) {
      if (args.length!= 3) {
        println ("Usage:com.qiurc.test.WordCount <master> < Input> <output> ")
        return
      }
      val sc = new Sparkcontext (args (0)," WordCount ",
          system.getenv (" Spark_home "), Seq (System.getenv (" Spark_qiutest_jar "))
      val textfile  = Sc.textfile (args (1))
      Val result = Textfile.flatmap (_.split (""))
              . Map (Word => (Word, 1)). Reducebykey (_ + _)
      Result.saveastextfile ( Args (2))
      
    }
}




3 exported as a Jar PackRight click the project and export as Spark_qiutest. jar.
Then put it into some dir, such as spark_home/ qiutest

4 Get a run script run this jar pack
Copy Run-example (in sparkhome) and change it!
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ CP Run-example Run-qiu-test
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Vim Run-qiu-test
__________________
scala_version=2.9.3

# Figure out where the Scala framework is installed
Fwdir= "$ (CD ' dirname $ '; pwd)"

# Export this as Spark_home
Export Spark_home= "$FWDIR"

# Load environment variables from conf/spark-env.sh, if it exists
If [-e $FWDIR/conf/spark-env.sh]; Then
. $FWDIR/conf/spark-env.sh
Fi

If [-Z "$"]; Then
echo "Usage:run-example <example-class> [<args>]" >&2
Exit 1
Fi

# Figure out of the JAR file that we examples were packaged into. This includes a bit of a hack
# to avoid the-sources And-doc packages this are built by publish-local.
Qiutest_dir = "$FWDIR"/qiutest
Spark_qiutest_jar = ""
If [e "$QIUTEST _dir"/spark_qiutest.jar]; Then
Export spark_qiutest_jar= ' ls ' $QIUTEST _dir '/spark_qiutest. JAR '
Fi

if [[Z $SPARK _qiutest_jar]]; Then
echo "Failed to find Spark qiutest jar assembly in $FWDIR/qiutest" >&2
echo "You need to build spark test jar Assembly before running the This program" >&2
Exit 1
Fi

# Since The examples JAR ideally shouldn ' t include Spark-core (that dependency should is
# "provided"), also add our standard Spark classpath, built using compute-classpath.sh.
Classpath= ' $FWDIR/bin/compute-classpath.sh '
Classdata-path= "$SPARK _qiutest_jar: $CLASSPATH"

# find Java Binary
If [-N "${java_home}"]; Then
Runner= "${java_home}/bin/java"
Else
If [' command-v Java ']; Then
Runner= "Java"
Else
echo "Java_home is not set" >&2
Exit 1
Fi
Fi

If ["$SPARK _print_launch_command" = = "1"]; Then
Echo-n "Spark Command:"
echo "$RUNNER"-CP "$CLASSPATH" "$@"
echo "========================================"
Echo
Fi

Exec "$RUNNER"-CP "$CLASSPATH" "$@"

5 Run it in spark with Hadoop HDFs

hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ LS Assembly LICENSE Pyspark.cmd Spa Rk-class
A.txt logs Python Spark-class2.cmd

hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Cat A.txt
A
B
C
C
D
D
E
E

(Note:put a.txt into HDFs)
Hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$hadoop fs-put a.txt./

(Note:check a.txt in HDFs)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Hadoop Fs-ls
Found 6 Items
-rw-r--r--2 hadoop supergroup 4215 2014-04-14 10:27/user/hadoop/readme.md
-rw-r--r--2 Hadoop supergroup 2014-04-14 15:58/user/hadoop/a.txt
-rw-r--r--2 hadoop supergroup 0 2013-05-29 17:17/user/hadoop/dumpfile
-rw-r--r--2 hadoop supergroup 0 2013-05-29 17:19/user/hadoop/dumpfiles
Drwxr-xr-x-hadoop supergroup 0 2014-04-14 15:57/USER/HADOOP/QIURC
Drwxr-xr-x-hadoop supergroup 0 2013-07-06 19:48/user/hadoop/temp

(Note:create a dir named "Qiurc" to store the output of WordCount in HDFs) hadoop@debian-master:~/spark-0.8.0-incubating -bin-hadoop1$ Hadoop FS-MKDIR/USER/HADOOP/QIURC

hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Hadoop Fs-ls
Found 5 Items
-rw-r--r--2 hadoop supergroup 4215 2014-04-14 10:27/user/hadoop/readme.md
-rw-r--r--2 hadoop supergroup 0 2013-05-29 17:17/user/hadoop/dumpfile
-rw-r--r--2 hadoop supergroup 0 2013-05-29 17:19/user/hadoop/dumpfiles
Drwxr-xr-x-hadoop supergroup 0 2014-04-14 15:32/USER/HADOOP/QIURC
Drwxr-xr-x-hadoop supergroup 0 2013-07-06 19:48/user/hadoop/temp

Start to run our WordCount program. Specifies the input and output location. Test only add hdfsxxx absolute path to write HDFs
(Note:prefix "hdfs://debian-master:9000/user/hadoop/" can ' t beforgot)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$./run-qiu-testcom.qiurc.test.wordcount spark:// Debian-master:7077hdfs://debian-master:9000/user/hadoop/a.txthdfs://debian-master:9000/user/hadoop/qiurc
(Note:get command is OK, too)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Hadoop Fs-copytolocal/user/hadoop/qiurc/localfile


hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ ls localfile/
part-00000 part-00001 part-00002 _success

(Note:let me show this result)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Cat localfile/part-00000
(, 1)
(c,2)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Cat localfile/part-00001
(d,2)
(a,1)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ Cat localfile/part-00002
(e,3)
(b,1)



Finish it! ^_^

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.