The installation of Spark under Windows

Source: Internet
Author: User
Tags pyspark

A minimalist development environment built under windows
Instead of contributing code to the Apache Spark Open source project, the Spark development environment here refers to the development of big data projects based on Spark.

Spark offers 2 interactive shells, one pyspark (based on Python) and one Spark_shell (based on Scala). These two environments are in fact tied and not interdependent, so if you're just using the Pyspark interactive environment instead of using Spark-shell, even Scala doesn't need to be installed.

====================
Pyspark Run Environment configuration:
==================== is free to translate this article in full Https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python

1. Install python2.7, the official recommendation under Windows to use the Anaconda version of Python 2.7, already contains a lot of scientific computing packages, https://store.continuum.io/cshop/anaconda/, Add python to the PATH environment variable

2. Install the JDK.  The Pyspark operating environment is not actually dependent on Scala, but the JDK is required. Install jdk1.7, add Java to the PATH environment variable, and set the JAVA_HOME environment variable.

3. Download the Spark-1.1.0-bin-hadoop2.4.tgz precompiled package from the Apache Spark website and unzip it.
Choose precompiled packages, eliminating the hassle of compiling directly from the source code.

4. Fix Spark-class2.cmd Script

At the start of the spark shell, an error has been encountered:
Failed to initialize Compiler:object scala.runtime on compiler mirror not found.

According to the teacher's tips, modified the Spark-class2.cmd file, specifically in line 91st to set the java_opts variable, add an additional option-dscala.usejavacp=true, that can solve the problem.

In addition, using spark SQL, according to example query People.txt, encountered the stackoverflowerror problem,
After querying http://f.dataguru.cn/thread-351552-1-1.html, the JVM's thread stack size needs to be modified. Also 91 lines, add the-xss10m option.
Final 91 Behavior:
set java_opts=-xx:maxpermsize=128m%our_java_opts%-xms%our_java_mem%-xmx%our_java_mem%-Dscala.usejavacp=true -xss10m

5. Complement the HADOOP environment variable Hadoop_home

Start Pyspark shell, run a simple parallelize+collect will error, the specific error is, Could not locate executable Null\bin\winutils.exe in the Hadoop Binaries.
Obviously, Spark is going to know the full path of the Winutils.exe, so the first one to install Winutils.exe, the second to let Spark know where to install. Google's message is that Winutils.exe is a binary under Hadoop windows.
I understand that the Spark driver program's machine does not have to install Hadoop, but still needs to configure the Hadoop runtime environment, which includes the Hadoop_home environment variables, as well as the Winutils.exe program.
Https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip-bit compilation package,
Version download: Http://yunpan.cn/csHEXGEaqVrLT access password 8199, see the author blog <
After downloading the extract, run the Winutils.exe on the command line to see if it is compatible with your Windows. If compatible, set Hadoop_home to C:\Program\hadoop-common-2.2.0-bin-master
Reference article: Http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path

6. Matching SPARK environment variable spark_home, IBID.
This step is not necessary for an interactive environment, but it is necessary for Scala/python language programming

7. Perform Pyspark validation to see if it works
In the shell, enter sc.parallelize (range). Count () to get the correct value



Scala version of the environment to build,
Install the Scala-2.11.4.msi and place the Scala bin directory on the OS PATH environment variable, which is referenced in the other steps.

Monitor Spark's jobs
http://localhost:4040/stages/


Bin/pyspark is an interactive shell, but there is no code hint function. I prefer Dreampie, save the following code as c:/pyspark_shell.py, and then execute execfile (R ' c:\pyspark_shell.py ') in Dreampie to get a code-hinting environment.


#-*-coding:utf-8-*-
# file:c:\pyspark_shell.py
# using, in Dreampie input, that gets the Pyspark interaction Shell:execfile ('c:\pyspark_shell.py')
__author__ ="Hari Sekhon"
__version__ ="0.1"

# https://github.com/harisekhon/toolbox/blob/master/.ipython-notebook-pyspark.00-pyspark-setup.py
Import Glob
Import OS
Import Sys

Spark_home=r'c:\program1\spark-1.1.1-bin-hadoop2.4'
Hadoop_home=r'C:\program1\hadoop-common-2.2.0-bin-master'
Python_bin=r'c:\pythonenv\Python27'

# This step is to add Pyspark and py4j to PYTHONPATH. I tested the direct Windows Pythonpath plus the following path, and it didn't work.
Sys.path.insert (0, Os.path.join (Spark_home,'python') # Add Pyspark
Sys.path.insert (0, Os.path.join (Spark_home, R'Python\build') # Add py4j

# This step is set up Spark_home and Hadoop_home
# I'm testing it doesn't work even if you set an environment variable in Windows
# So in the program, set up Spark_home and Hadoop_home
os.environ['Spark_home']=spark_home
os.environ['Hadoop_home']=hadoop_home

#在worker机器上, Python needs to be placed in the operating system's path environment variable, and I manually set the PATH environment variable.
#但在代码中却读PATH环境变量, but found no Python path,
#无奈我在pyspark_shell. py code to add Python to the PATH environment variable.
os.environ['PATH']=os.environ['PATH']+';'+python_bin

#执行D: \program\spark-1.1.1-bin-hadoop2.4\python\pyspark\shell.py
ExecFile (Os.path.join (spark_home,r'python\pyspark\shell.py'))

The installation of Spark under Windows

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.