Originally wanted to use eclipse, but found a circle on the internet, found that everyone is talking about how good intellij. I was encouraged, and decided to tinker with the IntelliJ on this broken machine.
The Spark program remote debugging, is to connect the local IDE to the spark cluster, let the program run side by side, while Debuger real-time view of the operation, configured later and local debug almost.
Prior to the installation deployment of the spark cluster has been written. HTTP://BLOG.CSDN.NET/U013468917/ARTICLE/DETAILS/50979184 was the Spark1.0.2 deployed on the hadoop2.2.0 platform. Later, after a little familiarity with spark, I wanted to upgrade the cluster below. Just one in place to upgrade to the latest 1.6.1. The specific installation process and the original 1.0.2 exactly the same, after decompression will be the original installation files in the Conf folder under the configuration file directly copied over, but with the hadoop2.3.0 of the precompiled package, there is no problem.
So this demo is done on the spark cluster 1.6.1.
The process is broadly divided into the following steps:
1. Open IntelliJ idea,file->new->project and choose Scala.
2, named TOPK, then choose Java and Scala sdk/
3. Import the spark dependency package, which is named Spark-assembly-xxxxxxx.jar in the Lib directory of the compressed package.
Click File-project structure-libraries Click the plus sign to select Java
Then select the dependent package path.
4. Right-click the-new-scala class on the SRC folder, then fill in the class name and select Object
5. Fill in the following contents in the file
Import Org.apache.spark._import org.apache.spark.sparkcontext._object TopK {def main (args:array[string]) {val conf = New Sparkconf () Val sc = new Sparkcontext (conf) val Textrdd = sc.textfile (args (0), 3) Val count = Textrdd . FlatMap (line = Line.split ("[^a-za-z]+"). Map (word=> (word,1))). Reducebykey (_+_) val topk = Count.mappartitio NS (GETTOPK). Collect () Val iter = topk.iterator val outiter = GETTOPK (iter) println ("TOPK value:") while (Outi Ter.hasnext) {val tmp = Outiter.next () println ("\ n words:" + tmp._1 + "word frequency:" + tmp._2)} sc.stop ()} def GETTOPK (iter:iterator[(string, int)]): iterator[(string, int)] = {Val a = new array[(string, int)] (Ten) while (ITER . hasnext) {val tmp = Iter.next () var flag = True for (i <-0 until a.length if flag) {if (A (i)!=nu ll && tmp._2 > A (i). _2) {for (J <-((i+1) until a.length). Reverse) {A (J) = A (J-1)} A (i) = TM P flag = False }else if (a (i) = = null) {A (i) = TMP flag = False}}} A.iterator}}
This is a TOPK program designed to find the top 10 words in the text morphemes frequency.
6. Export JAR Package
It may be a reason I'm not familiar with, personally feeling that the IntelliJ-lead Jar pack is a lot more tedious than eclipse.
Select: File-project structure-artifacts then click the plus sign, select Jar-from Modules with dependencies
Then select Main class for TOPK, select Copy to the outputxxxxxxx click OK.
Next select Build-build artifacts-Select Build
After the build is complete, you can see the Topk.jar in the Out folder. The Topk.jar is then uploaded to the cluster master node.
Here, as with common application development, the next step is the focus.
7. Cluster configuration
Modify the Spark-class script, which is in the bin directory under the Spark installation directory.
Modify the last two lines:
Done < < ("$RUNNER"-CP "$LAUNCH _classpath" Org.apache.spark.launcher.Main "[email protected]")
Modified to:
Done < < ("$RUNNER"-CP "$LAUNCH _classpath" Org.apache.spark.launcher.Main $JAVA _opts "[email protected]")
This requires spark to take the java_opts variable into account before executing the task. We can add JVM parameters to the application.
After the modification is complete, execute the following command at the command line:
Export java_opts= "$JAVA _opts-xdebug-xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5005"
This sets the current temporary JVM variable.
8, remote debugging has begun
First, run the Topk.jar that you just uploaded
/cloud/spark-1.6.1-bin-hadoop2.3/bin/spark-submit--class TopK--master yarn Topk.jar/spark/jane1.txt
At this point you can see:
Indicates that spark is listening on port 5005, the port can be arbitrarily set, no conflict on the line, but the IntelliJ default listening to this port.
Then go back to idea, choose Run-edit Configuration, click the plus sign in the upper left corner, choose Remote, take a name for yourself test_remote_debug, modify the host, my cluster master address is 192.168.1.131
Click OK
Set a breakpoint in the TOPK program just now.
Then press F9 and select Test_remote_debug.
This is if there is no accident, the console will appear
Connected to the target VM, address: ' 192.168.1.131:5005 ', Transport: ' Socket '
Indicates a successful connection. The next step would be the same as the local debug.
Finally, I'm going to go over the "java_opts" field.
-xdebug Enabling Debugging Features
-XRUNJDWP enables the JDWP implementation with several sub-options:
Transport=dt_socket JPDA The transfer method between front-end and Back-end. Dt_socket represents the use of socket transmissions.
The address=5005 JVM listens for requests on Port 5005, which is set to a non-conflicting port.
Server=y y indicates that the JVM being started is the debugger. If n, the JVM that starts is the debugger.
Suspend=y y indicates that the JVM that is started pauses the wait until the debugger is connected to the execution. Suspend=n, the JVM does not pause the wait.
Spark App Remote Debugging