Although Scala is recommended, try it.
1 PackageOrg.admln.java7OperateSpark;2 3 Importjava.util.Arrays;4 Importjava.util.List;5 ImportJava.util.regex.Pattern;6 7 Importorg.apache.spark.SparkConf;8 ImportOrg.apache.spark.api.java.JavaPairRDD;9 ImportOrg.apache.spark.api.java.JavaRDD;Ten ImportOrg.apache.spark.api.java.JavaSparkContext; One Importorg.apache.spark.api.java.function.FlatMapFunction; A ImportOrg.apache.spark.api.java.function.Function2; - Importorg.apache.spark.api.java.function.PairFunction; - the ImportScala. Tuple2; - - Public classOperatespark { - //Word Segmentation Separator + Private Static FinalPattern SPACE = Pattern.compile (""); - + Public Static voidMain (string[] args) { A //Initialize atsparkconf sparkconf =NewSparkconf (). Setappname ("Javawordcount"). Setmaster ("spark://hadoop:7077")); -Javasparkcontext CTX =NewJavasparkcontext (sparkconf); - - //The second parameter is the smallest shard of a file -Javardd<string> lines = Ctx.textfile ("Hdfs://hadoop:8020/in/spark/javaoperatespark/wordcount.txt"); -javardd<string> words = Lines.flatmap (NewFlatmapfunction<string,string>() { in PublicIterable<string>Call (String s) { - returnArrays.aslist (Space.split (s)); to } + }); - the //A pair of key values is divided into *javapairrdd<string,integer> ones = Words.maptopair (NewPairfunction<string,string,integer>() { $ PublicTuple2<string, integer>Call (String t) {Panax Notoginseng return NewTuple2<string,integer> (t,1); - } the }); + Ajavapairrdd<string,integer> counts = Ones.reducebykey (NewFunction2<integer,integer,integer>() { the Publicinteger Call (integer v1, integer v2) { + returnV1 +v2; - } $ }); $ -list<tuple2<string,integer>> output =Counts.collect (); - for(tuple2<?,? >tuple:output) { theSystem.out.println (tuple._1 () + ":" +tuple._2 ()); - }WuyiCounts.saveastextfile ("hdfs://hadoop:8020/out/spark/javaoperatespark2/"); the ctx.stop (); - } Wu}
There was an error while running.
In Eclipse:
Exception in thread "main" java.lang.NoSuchMethodError:com.google.common.hash.HashFunction.hashInt (I) lcom/google/ common/hash/hashcode; At org.apache.spark.util.collection.openhashset.org$apache$spark$util$collection$openhashset$ $hashcode ( Openhashset.scala:261) at ORG.APACHE.SPARK.UTIL.COLLECTION.OPENHASHSET$MCI$SP.GETPOS$MCI$SP (Openhashset.scala:165) at ORG.APACHE.SPARK.UTIL.COLLECTION.OPENHASHSET$MCI$SP.CONTAINS$MCI$SP (Openhashset.scala:102) at org.apache.spark.util.sizeestimator$ $anonfun $visitarray$2.APPLY$MCVI$SP (sizeestimator.scala:214) at SCALA.COLLECTION.IMMUTABLE.RANGE.FOREACH$MVC$SP (Range.scala:141) at Org.apache.spark.util.sizeestimator$.visitarray (Sizeestimator.scala:210) at Org.apache.spark.util.sizeestimator$.visitsingleobject (Sizeestimator.scala:169) at org.apache.spark.util.sizeestimator$.org$apache$spark$util$sizeestimator$ $estimate (Sizeestimator.scala: 161) at Org.apache.spark.util.sizeestimator$.estimate (Sizeestimator.scala:155) at org.apache.spark.util.collection.sizetracker$class. Takesample (sizetracker.scala:78) at org.apache.spark.util.collection.sizetracker$class. AfterUpdate (sizetracker.scala:70) at Org.apache.spark.util.collection.SizeTrackingVector. $plus $eq (Sizetrackingvector.scala:31) at org.apache.spark.storage.MemoryStore.unrollSafely (Memorystore.scala:249) at Org.apache.spark.storage.MemoryStore.putIterator (Memorystore.scala:136) at Org.apache.spark.storage.MemoryStore.putIterator (Memorystore.scala:114) at Org.apache.spark.storage.BlockManager.doPut (Blockmanager.scala:787) at Org.apache.spark.storage.BlockManager.putIterator (Blockmanager.scala:33]) at Org.apache.spark.storage.BlockManager.putSingle (Blockmanager.scala:992) at Org.apache.spark.broadcast.TorrentBroadcast.writeBlocks (Torrentbroadcast.scala:98) at Org.apache.spark.broadcast.TorrentBroadcast.<init> (torrentbroadcast.scala:84) at Org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast (Torrentbroadcastfactory.scala:34) at Org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast (Torrentbroadcastfactory.scala:29) at Org.apache.spark.broadcast.BroadcastManager.newBroadcast (Broadcastmanager.scala:62) at Org.apache.spark.SparkContext.broadcast (Sparkcontext.scala:945) at Org.apache.spark.SparkContext.hadoopFile (Sparkcontext.scala:695) at Org.apache.spark.SparkContext.textFile (Sparkcontext.scala:540) at Org.apache.spark.api.java.JavaSparkContext.textFile (Javasparkcontext.scala:184) at Org.admln.java7OperateSpark.OperateSpark.main (Operatespark.java:27)
In the shell are:
Exception in thread "main" Java.lang.VerifyError:classOrg.apache.hadoop.hdfs.protocol.proto.clientnamenodeprotocolprotos$addblockrequestproto OverridesFinalMethod Getunknownfields. () lcom/google/protobuf/Unknownfieldset; At Java.lang.ClassLoader.defineClass1 (Native Method) at Java.lang.ClassLoader.defineClass (Classloader.java:800) at Java.security.SecureClassLoader.defineClass (Secureclassloader.java:142) at Java.net.URLClassLoader.defineClass (URLClassLoader.java:449) ... At Sun.reflect.NativeMethodAccessorImpl.invoke (Nativemethodaccessorimpl.java:57) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (Delegatingmethodaccessorimpl.java:43) at Java.lang.reflect.Method.invoke (Method.java:30S) at Org.apache.spark.deploy.sparksubmit$.launch (Sparksubmit.scala:358) at Org.apache.spark.deploy.sparksubmit$.main (Sparksubmit.scala:75) at Org.apache.spark.deploy.SparkSubmit.main (Sparksubmit.scala)
You can see that the protobuf version conflicts with Hadoop.
Default SPARK1. 2. The protobuf version of 0 is
and hadoop2.2.0 for protobuf2.5.0.
So modify the Pom in spark. XML Recompile build deployment package (takes one hours)
Then run the shell side successfully. But the eclipse still reported the mistake.
This is because I used the MAVEN reference spark package, there is a guava version conflict, the default is
Add a dependency alone
<dependency> <groupId>com.google.guava</groupId> <artifactid>guava</ artifactid> <version>14.0.1</version> </dependency>
Then eclipse submits the words without error, but the task has not been executed, the report resources are not enough
WARN taskschedulerimpl:initial Job has not accepted any resources; Check your cluster UI to ensure that workers is registered and has sufficient memory
Then add the number of cores to 2 and the memory to 1500M, but still report
INFO sparkdeployschedulerbackend:granted executor ID APP-20150111003236-0000/3 on Hostport hadoop:34766 with 2 cores, 512 .0 MB RAM
That is, the number of cores changed, but the implementation of memory can not be changed, do not know why, there is the same program shell-side commit to normal execution, eclipse external submissions reported insufficient memory
Not to drive memory.
I guess there are two possible causes.
The 1.spark bug,spark_driver_memory variable defaults to 512M, but external modifications do not take effect;
Resources for 2.centos and native windows are confusing because I saw the
ERROR sparkdeployschedulerbackend:asked to remove non-existent Executor 2
The error, I am this machine is 4 core, virtual machine is 2 core.
Do not know why the Internet does not have an eclipse submission of the sample, should be either itself does not support, and client resources will be confused, or no one knows.
Java Operation spark1.2.0