Install pyspark in windows, pysparkwindows
0. Install python. I use python2.7.13.
1. Install jdk
Be sure to install version 1.7 or later. If you install a lower version, the following error will be reported.
Java. lang. NoclassDefFoundError
After installation, you do not need to manually set environment variables. After installation, use "java-version" to test whether the installation is successful.
After the installation is successful, add an environment variable "JAVA_HOME", because a JAVA_HOME environment variable is defined in the \ libexec \ hadoop-config.sh of hadoop. If this variable is not found, when "SC = SparkContext (appName =" PythonPi ")" is run later, the VM initialization error will occur and the memory cannot be allocated.
2. Download and install spark
: Http://spark.apache.org/downloads.html
Note that the spark and hadoop versions are strictly matched, and the file names are also reflected. To install hadoop, you must install the corresponding version.
Spark-1.6.3-bin-hadoop2.6.tgz requires hadoop version 2.6 to be installed.
After the download, decompress it to a folder, the example is decompressed to C: \ spark \ spark-1.6.3-bin-hadoop2.6"
Add Environment Variables
1. Add "C: \ spark \ spark-1.6.3-bin-hadoop2.6 \ bin" to the system variable Path, where there are some cmd files
2. Create a New System variable SPARK_HOME and add the path C: \ spark \ spark-1.6.3-bin-hadoop2.6
3. Run pyspark to check whether the installation is successful. Although there are errors, you can open the environment and install the following to solve these errors.
3. Download and install hadoop
: Https://archive.apache.org/dist/hadoop/common/
Follow the steps above to install hadoop 2.6 and decompress the package to a specified folder, for example, C: \ spark \ hadoop-2.6.5"
Add Environment Variables
1. Add "C: \ spark \ hadoop-2.6.5 \ bin" to the system variable Path, where there are some cmd files
2. Create a New System variable HADOOP_HOME and add the path C: \ spark \ hadoop-2.6.5
3. Run the pyspark check to check whether the error message is returned.
4. An error occurs after the above installation is complete.
The reason is that the bindirectory of hadoopdoes not contain the winutils.exe file. The solution here is:
-Go to the https://github.com/steveloughran/winutils and select the Hadoop version number you installed, then go to the bin directory and findwinutils.exe
File. The download method is to clickwinutils.exe
File.Download
Click Download.
-Downloadwinutils.exe
Put the file in the bin directory of Hadoop.
5. Input pyspark to run normally
Spark starts to start and outputs some log information, which can be ignored in most cases. Note the following two sentences:
Spark context available as SC. SQL context available as sqlContext
Only when the two statements are displayed can Spark be started successfully.
6. When using the compiling environment, you also need to copy the pyspark folder in C: \ spark \ spark-1.6.3-bin-hadoop2.6 \ python to the corresponding \ Lib \ site-packages.
7. Run pip install py4j installation package
8. Run an example to check whether the file is successful. When running this example, you need to enter the command line parameters.
import sysfrom random import randomfrom operator import addfrom pyspark import SparkContextif __name__ == "__main__": """ Usage: pi [partitions] """ sc = SparkContext(appName="PythonPi") partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 print("partitions is %f" % partitions) n = 100000 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 < 1 else 0 count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) sc.stop()
- 1
- 2
Spark context
AndSQL context
What are the two statements? In the future, we only need to remember that only when we see these two statements will Spark be truly successful.