The installation of Spark under Windows

Last Update:2015-03-29 Source: Internet

Author: User

Tags pyspark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A minimalist development environment built under windows
Instead of contributing code to the Apache Spark Open source project, the Spark development environment here refers to the development of big data projects based on Spark.

Spark offers 2 interactive shells, one pyspark (based on Python) and one Spark_shell (based on Scala). These two environments are in fact tied and not interdependent, so if you're just using the Pyspark interactive environment instead of using Spark-shell, even Scala doesn't need to be installed.

====================
Pyspark Run Environment configuration:
==================== is free to translate this article in full Https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python

1. Install python2.7, the official recommendation under Windows to use the Anaconda version of Python 2.7, already contains a lot of scientific computing packages, https://store.continuum.io/cshop/anaconda/, Add python to the PATH environment variable

2. Install the JDK. The Pyspark operating environment is not actually dependent on Scala, but the JDK is required. Install jdk1.7, add Java to the PATH environment variable, and set the JAVA_HOME environment variable.

3. Download the Spark-1.1.0-bin-hadoop2.4.tgz precompiled package from the Apache Spark website and unzip it.
Choose precompiled packages, eliminating the hassle of compiling directly from the source code.

4. Fix Spark-class2.cmd Script

At the start of the spark shell, an error has been encountered:
Failed to initialize Compiler:object scala.runtime on compiler mirror not found.

According to the teacher's tips, modified the Spark-class2.cmd file, specifically in line 91st to set the java_opts variable, add an additional option-dscala.usejavacp=true, that can solve the problem.

In addition, using spark SQL, according to example query People.txt, encountered the stackoverflowerror problem,
After querying http://f.dataguru.cn/thread-351552-1-1.html, the JVM's thread stack size needs to be modified. Also 91 lines, add the-xss10m option.
Final 91 Behavior:
set java_opts=-xx:maxpermsize=128m%our_java_opts%-xms%our_java_mem%-xmx%our_java_mem%-Dscala.usejavacp=true -xss10m

5. Complement the HADOOP environment variable Hadoop_home

Start Pyspark shell, run a simple parallelize+collect will error, the specific error is, Could not locate executable Null\bin\winutils.exe in the Hadoop Binaries.
Obviously, Spark is going to know the full path of the Winutils.exe, so the first one to install Winutils.exe, the second to let Spark know where to install. Google's message is that Winutils.exe is a binary under Hadoop windows.
I understand that the Spark driver program's machine does not have to install Hadoop, but still needs to configure the Hadoop runtime environment, which includes the Hadoop_home environment variables, as well as the Winutils.exe program.
Https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip-bit compilation package,
Version download: Http://yunpan.cn/csHEXGEaqVrLT access password 8199, see the author blog <
After downloading the extract, run the Winutils.exe on the command line to see if it is compatible with your Windows. If compatible, set Hadoop_home to C:\Program\hadoop-common-2.2.0-bin-master
Reference article: Http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path

6. Matching SPARK environment variable spark_home, IBID.
This step is not necessary for an interactive environment, but it is necessary for Scala/python language programming

7. Perform Pyspark validation to see if it works
In the shell, enter sc.parallelize (range). Count () to get the correct value

Scala version of the environment to build,
Install the Scala-2.11.4.msi and place the Scala bin directory on the OS PATH environment variable, which is referenced in the other steps.

Monitor Spark's jobs
http://localhost:4040/stages/

Bin/pyspark is an interactive shell, but there is no code hint function. I prefer Dreampie, save the following code as c:/pyspark_shell.py, and then execute execfile (R ' c:\pyspark_shell.py ') in Dreampie to get a code-hinting environment.

#-*-coding:utf-8-*-
# file:c:\pyspark_shell.py
# using, in Dreampie input, that gets the Pyspark interaction Shell:execfile ('c:\pyspark_shell.py')
__author__ ="Hari Sekhon"
__version__ ="0.1"

# https://github.com/harisekhon/toolbox/blob/master/.ipython-notebook-pyspark.00-pyspark-setup.py
Import Glob
Import OS
Import Sys

Spark_home=r'c:\program1\spark-1.1.1-bin-hadoop2.4'
Hadoop_home=r'C:\program1\hadoop-common-2.2.0-bin-master'
Python_bin=r'c:\pythonenv\Python27'

# This step is to add Pyspark and py4j to PYTHONPATH. I tested the direct Windows Pythonpath plus the following path, and it didn't work.
Sys.path.insert (0, Os.path.join (Spark_home,'python') # Add Pyspark
Sys.path.insert (0, Os.path.join (Spark_home, R'Python\build') # Add py4j

# This step is set up Spark_home and Hadoop_home
# I'm testing it doesn't work even if you set an environment variable in Windows
# So in the program, set up Spark_home and Hadoop_home
os.environ['Spark_home']=spark_home
os.environ['Hadoop_home']=hadoop_home

#在worker机器上, Python needs to be placed in the operating system's path environment variable, and I manually set the PATH environment variable.
#但在代码中却读PATH环境变量, but found no Python path,
#无奈我在pyspark_shell. py code to add Python to the PATH environment variable.
os.environ['PATH']=os.environ['PATH']+';'+python_bin

#执行D: \program\spark-1.1.1-bin-hadoop2.4\python\pyspark\shell.py
ExecFile (Os.path.join (spark_home,r'python\pyspark\shell.py'))

The installation of Spark under Windows

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More