Spark does not install Hadoop

Source: Internet
Author: User
Tags pyspark xquery log4j


The installation of Spark is divided into several modes, one of which is the local run mode, which needs to be decompressed on a single node without relying on the Hadoop environment.


Run Spark-shell

Local mode running Spark-shell is very simple, just run the following command, assuming the current directory is $spark_home

$ master=local 
$ bin/spark-shell

Master=local is the indication that the current operation is in stand-alone mode. If all goes well, you will see the following message:

Created Spark Context
. Spark context available as SC.

This indicates that the spark context variable is already built into the Spark-shell, and the name is SC, and we can use that variable directly for subsequent operations.

Spark-shell after setting the master parameter, you can support more modes, please refer to Http://spark.apache.org/docs/latest/submitting-applications.html#master-urls.

We run the simplest example in Sparkshell, counting the number of lines in readme.md that contain spark, and enter the following code in Spark-shell:

Scala>sc.textfile ("Readme.md"). Filter (_.contains ("Spark")). Count


If you feel that the output log is too many, you can create Conf/log4j.properties from the template file:

$ mv Conf/log4j.properties.template conf/log4j.properties

Then modify the log output level to warn:

Log4j.rootcategory=warn, console

If you set the log4j log level to info, you can see such a line of log info sparkui:started Sparkui at http://10.9.4.165:4040, which means that Spark started a Web server and you can Browser Access http://10.9.4.165:4040 to view information such as the status of the Spark's task running. Pyspark

The output of running Bin/pyspark is:

$ bin/pyspark Python 2.7.6 (default, Sep 9, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on DARW
In Type ' help ', ' copyright ', ' credits ' or ' license ' for more information. Spark assembly have been built with Hive, including DataNucleus jars on classpath Picked up java_tool_options:-dfile.encod Ing=utf-8 15/03/30 15:19:07 WARN utils:your hostname, june-mac resolves to a loopback address:127.0.0.1; Using 10.9.4.165 instead (on interface utun0) 15/03/30 15:19:07 WARN utils:set spark_local_ip If you need to bind to Anot Her address 15/03/30 15:19:07 WARN nativecodeloader:unable to load Native-hadoop library for your platform ... using built   In-java classes where applicable Welcome to ____ __/__/__ ___ _____//__ _\ \ _/_/__/ _//__/. __/\_,_/_//_/\_\ version 1.3.0/_/Using Python version 2.7.6 (default, Sep 9 15:04:36) Spa
 Rkcontext available as SC, hivecontext available as sqlctx.

You can also use IPython to run Spark:

Ipython=1  ./bin/pyspark

If you want to use IPython NoteBook, run:

Ipython_opts= "Notebook"  ./bin/pyspark

As you can see from the log, both Bin/pyspark and Bin/spark-shell have two built-in variables: SC and sqlctx.

Sparkcontext available as SC, hivecontext available as Sqlctx

The SC represents the context of spark, which can perform some of the actions of Spark, while SQLCTX represents the context of Hivecontext. Spark-submit

A unified script Spark-submit is provided after Spark1.0 to submit the task.

For Python programs, we can use Spark-submit directly:

$ mkdir-p/usr/lib/spark/examples/python
$ tar zxvf/usr/lib/spark/lib/python.tar.gz-c/usr/lib/spark/examples/ Python

$./bin/spark-submit examples/python/pi.py 10

For Java programs, we need to compile the code and then package the run:

$ spark-submit--class "Simpleapp"--master Local[4] Simple-project-1.0.jar



Spark Run mode

The operating modes of Spark are varied and flexible, and when deployed on a single machine, they can be run either in local mode or in pseudo-distributed mode, and when deployed in a distributed cluster, there are many operating modes to choose from, depending on the actual situation of the cluster. The underlying resource scheduling can either rely on external resource scheduling frameworks or use Spark's built-in Standalone model. For the support of the external resource scheduling framework, the current implementation includes a relatively stable Mesos model and a Hadoop YARN pattern that is still being developed in the ongoing update.

In practical applications, the mode of operation of the Spark application depends on the value of the master environment variable passed to Sparkcontext, and the individual patterns need to be used in conjunction with the Auxiliary Program interface, which currently supports a master environment variable consisting of a specific string or URL. For example:

LOCAL[N]: Local mode, using N threads.

Local Cluster[worker,core,memory]: pseudo-distributed mode, you can configure the number of virtual work nodes that need to be started, and the number of CPUs and memory sizes that each work node manages.

Spark://hostname:port:standalone mode, you need to deploy spark to the relevant node, the URL is the Spark Master host address and Port.

Mesos://hostname:port:mesos mode, you need to deploy Spark and Mesos to the associated nodes, with URLs of Mesos host addresses and ports.

Yarn Standalone/yarn Cluster:yarn Mode One, the main program logic and tasks are running in the yarn cluster.

Yarn Client:yarn mode Two, the main program logic runs locally, the specific task runs in the YARN cluster.


Run Spark

There are two ways to run Spark from the command line: Bin/pyspark and Bin/spark-shell.

The logs running the Bin/spark-shell output are as follows:

$./bin/spark-shell--master Local


You can create Conf/log4j.properties from the template file and then modify the log output level:

MV Conf/log4j.properties.template Conf/log4j.properties

Modify the log4j.rootcategory level to the output WARN level of the log:

Log4j.rootcategory=warn, console

If you set the log4j log level to info, you can see such a line of logs info sparkui:started Sparkui at http://10.9.4.165:4040, meaning Spark started a Web server that you can Access http://10.9.4.165:4040 through the browser to see the status of the Spark's task running.


As you can see from the log, both Bin/pyspark and Bin/spark-shell have two built-in variables: SC and sqlctx.

Sparkcontext available as SC, hivecontext available as Sqlctx

The SC represents the context of spark, which can perform some of the actions of Spark, while SQLCTX represents the context of Hivecontext.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.