Under Windows Pycharm Development Spark

Source: Internet
Author: User
Tags pyspark

deploy a local spark environment
1.1 Install the JDK.Download and install the jdk1.7 and configure the environment variables.
1.2 Spark environment variable configuration       go to http://spark.apache.org/downloads.html website to download the corresponding version of Hadoop, I downloaded isspark-1.6.0-bin-hadoop2.6.tgz, the spark version is 1.6 and the corresponding Hadoop version is 2.6

Unzip the downloaded file, assuming that the extracted directory is: D:\spark-1.6.0-bin-hadoop2.6. Add D:\spark-1.6.0-bin-hadoop2.6\bin to the system path variable and create a new spark_home variable with the value: D:\spark-1.6.0-bin-hadoop2.6


1.3 Installation of Hadoop related packages

      Spark is based on Hadoop and calls the relevant Hadoop libraries during the run. If you do not configure the relevant Hadoop runtime environment, you will be prompted with the relevant error message, although it does not affect the operation.

  to download Hadoop 2.6 compiled package https://www.barik.net/archive/2015/01/19/172716/ I downloaded the hadoop-2.6.0.tar.gz Unzip the downloaded folder, add the related library to the system PATH variable: D:\hadoop-2.6.0\bin; Create a new hadoop_home variable with the value: D:\ hadoop-2.6.0. Go to GitHub and download a component called Winutils   address is   https://github.com/srccodes/ Hadoop-common-2.2.0-bin   if there is no version of Hadoop (at this point the version is 2.6), go to csdn download   http://download.csdn.net/detail/luoyepiaoxin/8860033,

My practice is to copy all the files in this CSDN package into the Hadoop_home bin directory.


Two Python environments

    spark offers 2 interactive shells, one pyspark (based on Python) and one Spark_shell (based on Scala). These two environments are actually tied together, There is no interdependence, so if you're just using the Pyspark interactive environment instead of using Spark-shell, even Scala doesn't need to be installed.

  2.1 Download and install a Naconda  

Anaconda is a system that integrates the Python interpreter and most Python libraries, so you can install Anaconda without installing Python and pandas numpy. It's https://www.continuum.io/downloads. Add python to the PATH environment variable

Three-start Pyspark verification

start Pyspark in the command line under Windows:

 


Four configuring the development environment in Pycharm    

4.1 Configuring Pycharm
more detailed material reference Https://stackoverflow.com/questions/34685905/how-to-link-pycharm-with-pyspark

Open Pycharm and create a project. Then select "Run", "Edit configurations"

Select "Environment variables"Add the Spark_home directory to the Pythonpath directory.

    • spark_home:spark installation directory

    • pythonpath:spark The Python directory under the installation directory

      < Span style= "font-size:17px" >


4.2 Test procedure

Before you test the environment correctly, the code is as follows:

import osimport sys # Path for spark source folderos.environ[‘SPARK_HOME‘]="D:\javaPackages\spark-1.6.0-bin-hadoop2.6" # Append pyspark to Python Pathsys.path.append("D:\javaPackages\spark-1.6.0-bin-hadoop2.6\python") try: from pyspark import SparkContext from pyspark import SparkConf  print ("Successfully imported Spark Modules") except    importerror    as   e  :    print     ( " Can not import Spark Modules "   e    sys.exit(1)
If the program can output normally:"successfully imported Spark Modules"This indicates that the environment is already operational. For example, within the yellow box is the specific spark environment and the Python environment:

The test program code comes from GitHub: Https://gist.github.com/bigaidream/40fe0f8267a80e7c9cf8


Under Windows Pycharm Development Spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.