Step 1: software required by the spark cluster;
Build a spark cluster on the basis of the hadoop cluster built from scratch in Articles 1 and 2. We will use the spark 1.0.0 version released in May 30, 2014, that is, the latest version of spark, to build a spark Cluster Based
: // master: 8080": as shown below:
We can see from the page that we have three worker nodes and the information of these three nodes.
In this case, go to the spark bin directory and use the "Spark-shell" console:
Now we enter the spark shell world. Based on the output prompt, we can view sparkui from the Web perspective through "http: // master: 4040", as sh
Install spark
Spark must be installed on the master, slave1, and slave2 machines.
First, install spark on the master. The specific steps are as follows:
Step 1: Decompress spark on the master:
Decompress the package directly to the current directory:
In this case, create the spa
. From the configuration above, we can see that we use the master node as the master node and as the data processing node. This is due to the consideration of three copies of our data and the limited number of machines. Copy the master configured masters and slaves files to the conf folder under the hadoop installation directory of slave1 and slave2 respectively: Go to the slave1 or slave2 node to check the content of the masters and slaves files: It is found that the copy is completel
Install spark
Spark must be installed on the master, slave1, and slave2 machines.
First, install spark on the master. The specific steps are as follows:
Step 1: Decompress spark on the master:
Decompress the package directly to the current directory:
In this case, create the
Columnstore database.?? gives a view of the environment we are going to build will be used throughout this book:?
Build Ubuntu in Oracle VirtualBoxSetting up a VirtualBox environment that runs Ubuntu 14.04 is the safest way to build a development environment that avoids conflicts with existing libraries, and you can use similar commands to replicate your environment to the cloud.To build the anaconda and spark environment, we're going to create
/49/D5/wKioL1QbpNKDWXo_AAElnZLjV4U229.jpg "style =" float: none; "Title =" 14.png" alt = "wkiol1qbpnkdwxo_aaelnzljv4u229.jpg"/>
Select "yes" to enable automatic installation of scala plug-in idea.
650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M00/49/D3/wKiom1QbpLijqttNAAE3LTevJ5I077.jpg "style =" float: none; "Title =" 15.png" alt = "wkiom1qbplijqttnaae3ltevj5i077.jpg"/>
In this case, it takes about 2 minutes to download and install the SDK. Of course, the download time varies depen
Save and run the source command to make the configuration file take effect.
Step 3: Run idea and install and configure the idea Scala development plug-in:
The official document states:
Go to the idea bin directory:
Run "idea. Sh" and the following page appears:
Select "Configure" To Go To The idea configuration page:
Select plugins To Go To The plug-in installation page:
Click the "Install jetbrains plugin" option in the lower left corner t
-site.xml configuration can refer:
Http://hadoop.apache.org/docs/r2.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
Step 7 modify the profile yarn-site.xml, as shown below:
Modify the content of the yarn-site.xml:
The above content is the minimal configuration of the yarn-site.xml, the content of the yarn-site.xml file configuration can be referred:
Http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
[
according to the program.Use the following command to view the result of the run, which is consistent with the results from the previous section-ls /class3/output2 -cat /class3/output2/part-00000| less2.3 Example 2: Package runThe last example uses the idea to run the result directly, and in that case the Idea Packager will be used to execute2.3.1 Writing codeAdd the Join object file in the CLASS3 package, with the following code:Package Class3import org. Apache.
"Note" This series of articles and the use of the installation package/test data can be in the "big gift--spark Getting Started Combat series" Get 1, compile sparkSpark can be compiled in SBT and maven two ways, and then the deployment package is generated through the make-distribution.sh script. SBT compilation requires the installation of Git tools, and MAVEN installation requires MAVEN tools, both of which need to be carried out under the network,
"Note" This series of articles and the use of the installation package/test data can be in the "big gift--spark Getting Started Combat series" Get 1, compile sparkSpark can be compiled in SBT and maven two ways, and then the deployment package is generated through the make-distribution.sh script. SBT compilation requires the installation of Git tools, and MAVEN installation requires MAVEN tools, both of which need to be carried out under the network,
5. Apply method and Singleton object in Scala to create a new class: As an additional point, the methods placed in object objects are static methods, as follows: Next look at the use of the Apply method: The above code always when we use "val a = Applytest ()" will cause the call of the Apply method and return the value of the method call, that is, the instantiated object of the applytest. C The lass can also be used by the Apply method, as shown in the following ways: Because the methods
piece of data stream in DStream.2.2.2.2 Advanced SourcesThis type of source requires an interface to an external Non-spark library, some of which have complex dependencies (such as Kafka, Flume). Therefore, creating dstreams from these sources requires a clear dependency. For example, if you want to create a dstream stream of data that uses Twitter tweets, you must follow these steps:1) Add spark-streaming
1. Introduction
The Spark-submit script in the Spark Bin directory is used to start the application on the cluster. You can use the Spark for all supported cluster managers through a unified interface, so you do not have to specifically configure your application for each cluster Manager (It can using all Spark ' s su
data processing, which has a high scale, high throughput and fault tolerance mechanism, the data source can be Kafka, Flume, Twitter, ZeroMQ, kinesis or TCP, its operation depends on discretized Stream (DStream), DStream can be seen as a number of ordered rdd composition, so it can only be done through the map, reduce, join and window operations to complete real-time data processing, another very important point is that Spark Streaming can be used in
The content of this lecture:A. Spark streaming Job architecture and operating mechanismB. Spark streaming Job fault tolerant architecture and operating mechanismNote: This lecture is based on the spark 1.6.1 version (the latest version of Spark in May 2016).Previous section review:The last lesson that
Sparksubmitaction.request_status => RequestStatus (Appargs)
}
}(1) Create the Sparksubmitarguments object and parse the parameters to initialize the object member; (2) Only the submit process is analyzed here.
4.2 sparksubmitargumentsThis class encapsulates the spark parameter.
Set parameters from command line arguments
parseOpts (args.tolist)
//Populate ' sparkproperties ' map from Prope Rties file
mergedefaultsparkproperties ()
The conversion of RDD and the generation of DagsSpark generates a dependency between the RDD based on the conversion and action of the RDD in the user-submitted calculation logic, and the compute chain generates a logical DAG. Next, take "Word Count" as an example to describe the implementation of this DAG build in detail.The Spark Scala version of Word count program is as follows:1:val file = Spark.textfile ("hdfs://...") 2:val counts = File.flatmap
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.