First, environmental windows_x64 system Java1.8
Scala2.10.6 spark1.6.0 hadoop2.7.5
Idea IntelliJ 2017.2 nmap tool (NCAT command in which corresponds to NC commands in Linux)
Second, local application set up 2.1 environment variable setting method: System Parameter--"Add variable-" form: xxx_home, then copy the root directory of the corresponding installation package as the variable value; add:%xxx_home%\bin in Path variable;
1,hadoop need to set environment variables; 2,scala best to download and install the corresponding version, set the environment variable; 3,spark directly decompression;
Reference: Environment Build Reference 2.2 build test use SBT tool very convenient can complete build, use SBT to create Scala project. The project structure is generated as: where Testmain.scala:
/**
* notes:to test Scala and Spark and Hadoop
* date:2017.12.20
* Author:gendlee
/
Import Org.apa Che.spark. {Sparkconf,sparkcontext}
Import org.apache.log4j. {Level,logger}
Import com.test.SparkStreaming
Object Test {
logger.getlogger ("org"). Setlevel (Level.error)
def main ( Args:array[string]): unit = {
sparkstreaming.printwebsites ()
//initiate spark
val sc = new Sparkcontext (conf)
Read file from local disc
val rdd = Sc.textfile ("F:\\code\\scala2.10.6_spark1.6_hadoop2.8\\test.log")
}
}
Where Sparkstreaming.scala is:
/**
*notes:to Test Spark streaming * date:2017.12.21 * Author:gendlee/package
com.test
Import Org.apache.spark. {Sparkconf,sparkcontext}
Import org.apache.spark.streaming. {Seconds, StreamingContext}
Object Sparkstreaming {
def printwebsites (): unit= {
val conf = new sparkconf (). Setmaster ("local[2]"). Setappname ("Printwebsites")
val ssc = new StreamingContext (conf, Seconds (1))
val output = "f:\\code\\ Scala2.10.6_spark1.6_hadoop2.8\\out\\gettedwebsites "
val lines = Ssc.sockettextstream (" localhost ", 7777)
val websitelines = Lines.filter (_.contains ("http"))
Websitelines.print ()
// Websitelines.repartition (1). Saveastextfiles (output)
Ssc.start ()
ssc.awaittermination ()
}
}
I'm going to extract the field containing the URL from the input (including HTTP): Step on the Pit:
Val conf = new sparkconf (). Setmaster ("local[2]"). Setappname ("Printwebsites")
Here the Setmaster parameter must be local[2], for here to open two processes, one to receive, if the default local will not receive data.
After compiling, you can run it and find that printing this information:
Using Spark ' s default log4j profile:org/apache/spark/log4j-defaults.properties
17/12/22 16:39:14 INFO Slf4jlogger : Slf4jlogger started
17/12/22 16:39:14 info remoting:starting Remoting 17/12/22 16:39:14
Info remoting:remotin G started; Listening on addresses: [akka.tcp://sparkdriveractorsystem@169.254.78.142:64905]
17/12/22 16:39:15 ERROR Receivertracker:deregistered Receiver for Stream 0:restarting receiver with delay 2000ms:socket data stream had no more Data
-------------------------------------------
time:1513931956000 Ms
---------------------------- ---------------
time:1513931957000 Ms
-------------------------------------------
An error occurred. No hurry, that's because 7777 ports do not receive the data, the following pause the program, we need to send data to Port 7777.
With the Sockettextstream () function, we can receive data from a specific port on the specified host. Let's look at how to send data on port 7777.
Open the Power shell or cmd for Windows, and enter:
Ncat-lk-p 7777
Then run the program in idea, and then enter it in the Open cmd window, and when the input field contains HTTP, it will print out in idea's running display window.
Idea End Filter Print:
There is a problem here, in fact, like HTTPS this I do not, that is, HTTP as a part of the word this is not, so follow up and try to see how to filter.
To complete the subject of the request.
Third, reference: http://blog.csdn.net/gendlee1991/article/details/78066548 https://www.cnblogs.com/FG123/p/5324743.html