Install pyspark in windows, pysparkwindows

Last Update:2017-09-05 Source: Internet

Author: User

Tags pyspark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Install pyspark in windows, pysparkwindows

0. Install python. I use python2.7.13.

1. Install jdk

Be sure to install version 1.7 or later. If you install a lower version, the following error will be reported.

Java. lang. NoclassDefFoundError

After installation, you do not need to manually set environment variables. After installation, use "java-version" to test whether the installation is successful.

After the installation is successful, add an environment variable "JAVA_HOME", because a JAVA_HOME environment variable is defined in the \ libexec \ hadoop-config.sh of hadoop. If this variable is not found, when "SC = SparkContext (appName =" PythonPi ")" is run later, the VM initialization error will occur and the memory cannot be allocated.

2. Download and install spark

: Http://spark.apache.org/downloads.html

Note that the spark and hadoop versions are strictly matched, and the file names are also reflected. To install hadoop, you must install the corresponding version.

Spark-1.6.3-bin-hadoop2.6.tgz requires hadoop version 2.6 to be installed.

After the download, decompress it to a folder, the example is decompressed to C: \ spark \ spark-1.6.3-bin-hadoop2.6"

Add Environment Variables

1. Add "C: \ spark \ spark-1.6.3-bin-hadoop2.6 \ bin" to the system variable Path, where there are some cmd files

2. Create a New System variable SPARK_HOME and add the path C: \ spark \ spark-1.6.3-bin-hadoop2.6

3. Run pyspark to check whether the installation is successful. Although there are errors, you can open the environment and install the following to solve these errors.

3. Download and install hadoop

: Https://archive.apache.org/dist/hadoop/common/

Follow the steps above to install hadoop 2.6 and decompress the package to a specified folder, for example, C: \ spark \ hadoop-2.6.5"

Add Environment Variables

1. Add "C: \ spark \ hadoop-2.6.5 \ bin" to the system variable Path, where there are some cmd files

2. Create a New System variable HADOOP_HOME and add the path C: \ spark \ hadoop-2.6.5

3. Run the pyspark check to check whether the error message is returned.

4. An error occurs after the above installation is complete.

The reason is that the bindirectory of hadoopdoes not contain the winutils.exe file. The solution here is:
　　
-Go to the https://github.com/steveloughran/winutils and select the Hadoop version number you installed, then go to the bin directory and findwinutils.exeFile. The download method is to clickwinutils.exeFile.DownloadClick Download.
-Downloadwinutils.exePut the file in the bin directory of Hadoop.

5. Input pyspark to run normally

Spark starts to start and outputs some log information, which can be ignored in most cases. Note the following two sentences:

Spark context available as SC. SQL context available as sqlContext

Only when the two statements are displayed can Spark be started successfully.

6. When using the compiling environment, you also need to copy the pyspark folder in C: \ spark \ spark-1.6.3-bin-hadoop2.6 \ python to the corresponding \ Lib \ site-packages.

7. Run pip install py4j installation package

8. Run an example to check whether the file is successful. When running this example, you need to enter the command line parameters.

import sysfrom random import randomfrom operator import addfrom pyspark import SparkContextif __name__ == "__main__":    """        Usage: pi [partitions]    """    sc = SparkContext(appName="PythonPi")    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2    print("partitions is %f" % partitions)    n = 100000 * partitions    def f(_):        x = random() * 2 - 1        y = random() * 2 - 1        return 1 if x ** 2 + y ** 2 < 1 else 0    count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)    print("Pi is roughly %f" % (4.0 * count / n))    sc.stop()

1
2Spark contextAndSQL contextWhat are the two statements? In the future, we only need to remember that only when we see these two statements will Spark be truly successful.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More