Spark App Remote Debugging

Source: Internet
Author: User

Originally wanted to use eclipse, but found a circle on the internet, found that everyone is talking about how good intellij. I was encouraged, and decided to tinker with the IntelliJ on this broken machine.

The Spark program remote debugging, is to connect the local IDE to the spark cluster, let the program run side by side, while Debuger real-time view of the operation, configured later and local debug almost.

Prior to the installation deployment of the spark cluster has been written. HTTP://BLOG.CSDN.NET/U013468917/ARTICLE/DETAILS/50979184 was the Spark1.0.2 deployed on the hadoop2.2.0 platform. Later, after a little familiarity with spark, I wanted to upgrade the cluster below. Just one in place to upgrade to the latest 1.6.1. The specific installation process and the original 1.0.2 exactly the same, after decompression will be the original installation files in the Conf folder under the configuration file directly copied over, but with the hadoop2.3.0 of the precompiled package, there is no problem.

So this demo is done on the spark cluster 1.6.1.

The process is broadly divided into the following steps:

1. Open IntelliJ idea,file->new->project and choose Scala.


2, named TOPK, then choose Java and Scala sdk/

3. Import the spark dependency package, which is named Spark-assembly-xxxxxxx.jar in the Lib directory of the compressed package.

Click File-project structure-libraries Click the plus sign to select Java


Then select the dependent package path.

4. Right-click the-new-scala class on the SRC folder, then fill in the class name and select Object


5. Fill in the following contents in the file

Import Org.apache.spark._import org.apache.spark.sparkcontext._object TopK {def main (args:array[string]) {val conf = New Sparkconf () Val sc = new Sparkcontext (conf) val Textrdd = sc.textfile (args (0), 3) Val count = Textrdd . FlatMap (line = Line.split ("[^a-za-z]+"). Map (word=> (word,1))). Reducebykey (_+_) val topk = Count.mappartitio NS (GETTOPK). Collect () Val iter = topk.iterator val outiter = GETTOPK (iter) println ("TOPK value:") while (Outi  Ter.hasnext) {val tmp = Outiter.next () println ("\ n words:" + tmp._1 + "word frequency:" + tmp._2)} sc.stop ()} def GETTOPK (iter:iterator[(string, int)]): iterator[(string, int)] = {Val a = new array[(string, int)] (Ten) while (ITER . hasnext) {val tmp = Iter.next () var flag = True for (i <-0 until a.length if flag) {if (A (i)!=nu ll && tmp._2 > A (i). _2) {for (J <-((i+1) until a.length). Reverse) {A (J) = A (J-1)} A (i) = TM P flag = False       }else if (a (i) = = null) {A (i) = TMP flag = False}}} A.iterator}} 
This is a TOPK program designed to find the top 10 words in the text morphemes frequency.

6. Export JAR Package

It may be a reason I'm not familiar with, personally feeling that the IntelliJ-lead Jar pack is a lot more tedious than eclipse.

Select: File-project structure-artifacts then click the plus sign, select Jar-from Modules with dependencies


Then select Main class for TOPK, select Copy to the outputxxxxxxx click OK.


Next select Build-build artifacts-Select Build


After the build is complete, you can see the Topk.jar in the Out folder. The Topk.jar is then uploaded to the cluster master node.

Here, as with common application development, the next step is the focus.

7. Cluster configuration

Modify the Spark-class script, which is in the bin directory under the Spark installation directory.

Modify the last two lines:

Done < < ("$RUNNER"-CP "$LAUNCH _classpath" Org.apache.spark.launcher.Main "[email protected]")
Modified to:

Done < < ("$RUNNER"-CP "$LAUNCH _classpath" Org.apache.spark.launcher.Main $JAVA _opts "[email protected]")
This requires spark to take the java_opts variable into account before executing the task. We can add JVM parameters to the application.

After the modification is complete, execute the following command at the command line:

Export java_opts= "$JAVA _opts-xdebug-xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5005"
This sets the current temporary JVM variable.

8, remote debugging has begun

First, run the Topk.jar that you just uploaded

/cloud/spark-1.6.1-bin-hadoop2.3/bin/spark-submit--class TopK--master yarn Topk.jar/spark/jane1.txt
At this point you can see:



Indicates that spark is listening on port 5005, the port can be arbitrarily set, no conflict on the line, but the IntelliJ default listening to this port.

Then go back to idea, choose Run-edit Configuration, click the plus sign in the upper left corner, choose Remote, take a name for yourself test_remote_debug, modify the host, my cluster master address is 192.168.1.131

Click OK


Set a breakpoint in the TOPK program just now.


Then press F9 and select Test_remote_debug.

This is if there is no accident, the console will appear

Connected to the target VM, address: ' 192.168.1.131:5005 ', Transport: ' Socket '

Indicates a successful connection. The next step would be the same as the local debug.


Finally, I'm going to go over the "java_opts" field.

-xdebug Enabling Debugging Features
-XRUNJDWP enables the JDWP implementation with several sub-options:
Transport=dt_socket JPDA The transfer method between front-end and Back-end. Dt_socket represents the use of socket transmissions.
The address=5005 JVM listens for requests on Port 5005, which is set to a non-conflicting port.
Server=y y indicates that the JVM being started is the debugger. If n, the JVM that starts is the debugger.
Suspend=y y indicates that the JVM that is started pauses the wait until the debugger is connected to the execution. Suspend=n, the JVM does not pause the wait.

Spark App Remote Debugging

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.