Install pyspark in windows, pysparkwindows

Source: Internet
Author: User
Tags pyspark

Install pyspark in windows, pysparkwindows

 

0. Install python. I use python2.7.13.

1. Install jdk

Be sure to install version 1.7 or later. If you install a lower version, the following error will be reported.

Java. lang. NoclassDefFoundError

After installation, you do not need to manually set environment variables. After installation, use "java-version" to test whether the installation is successful.

After the installation is successful, add an environment variable "JAVA_HOME", because a JAVA_HOME environment variable is defined in the \ libexec \ hadoop-config.sh of hadoop. If this variable is not found, when "SC = SparkContext (appName =" PythonPi ")" is run later, the VM initialization error will occur and the memory cannot be allocated.

 

2. Download and install spark

: Http://spark.apache.org/downloads.html

Note that the spark and hadoop versions are strictly matched, and the file names are also reflected. To install hadoop, you must install the corresponding version.

Spark-1.6.3-bin-hadoop2.6.tgz requires hadoop version 2.6 to be installed.

After the download, decompress it to a folder, the example is decompressed to C: \ spark \ spark-1.6.3-bin-hadoop2.6"

Add Environment Variables

1. Add "C: \ spark \ spark-1.6.3-bin-hadoop2.6 \ bin" to the system variable Path, where there are some cmd files

2. Create a New System variable SPARK_HOME and add the path C: \ spark \ spark-1.6.3-bin-hadoop2.6

3. Run pyspark to check whether the installation is successful. Although there are errors, you can open the environment and install the following to solve these errors.

 

3. Download and install hadoop

: Https://archive.apache.org/dist/hadoop/common/

Follow the steps above to install hadoop 2.6 and decompress the package to a specified folder, for example, C: \ spark \ hadoop-2.6.5"

Add Environment Variables

1. Add "C: \ spark \ hadoop-2.6.5 \ bin" to the system variable Path, where there are some cmd files

2. Create a New System variable HADOOP_HOME and add the path C: \ spark \ hadoop-2.6.5

3. Run the pyspark check to check whether the error message is returned.

 

4. An error occurs after the above installation is complete.

The reason is that the bindirectory of hadoopdoes not contain the winutils.exe file. The solution here is:
  
-Go to the https://github.com/steveloughran/winutils and select the Hadoop version number you installed, then go to the bin directory and findwinutils.exeFile. The download method is to clickwinutils.exeFile.DownloadClick Download.
-Downloadwinutils.exePut the file in the bin directory of Hadoop.

 

5. Input pyspark to run normally

Spark starts to start and outputs some log information, which can be ignored in most cases. Note the following two sentences:

Spark context available as SC. SQL context available as sqlContext

Only when the two statements are displayed can Spark be started successfully.

6. When using the compiling environment, you also need to copy the pyspark folder in C: \ spark \ spark-1.6.3-bin-hadoop2.6 \ python to the corresponding \ Lib \ site-packages.

7. Run pip install py4j installation package

8. Run an example to check whether the file is successful. When running this example, you need to enter the command line parameters.
import sysfrom random import randomfrom operator import addfrom pyspark import SparkContextif __name__ == "__main__":    """        Usage: pi [partitions]    """    sc = SparkContext(appName="PythonPi")    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2    print("partitions is %f" % partitions)    n = 100000 * partitions    def f(_):        x = random() * 2 - 1        y = random() * 2 - 1        return 1 if x ** 2 + y ** 2 < 1 else 0    count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)    print("Pi is roughly %f" % (4.0 * count / n))    sc.stop()
  • 1
  • 2Spark contextAndSQL contextWhat are the two statements? In the future, we only need to remember that only when we see these two statements will Spark be truly successful.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.