spark2.x deep into the end series five Python development Spark environment configuration

Source: Internet
Author: User
Tags pyspark

Before you learn spark any technology, be sure to understand spark correctly, for reference: understanding spark correctly


Here's an environment for configuring spark with Python on Mac OS


First, install Python

spark2.2.0 need python version is python2.6+ or python3.4+


can refer to:

Http://jingyan.baidu.com/article/7908e85c78c743af491ad261.html


Ii. Download the spark compilation package and configure the environment variables


1, in the official website: http://spark.apache.org/downloads.html Download version: SPARK-2.2.0-BIN-HADOOP2.6.TGZ Package

On a local disk, then unzip it.



2. Set Environment variables:

CD ~

VI. bash_profile


Export spark_home=/users/tangweiqun/desktop/bigdata/spark/spark-2.2.0-bin-hadoop2.6

Export path= $PATH: $SCALA _home/bin: $M 2_home/bin: $JAVA _home/bin: $SPARK _home/bin


SOURCE . Bash_profile


3, need to spark_home under the bin directory of the file execution chmod 744./*, otherwise it will report insufficient permissions error

Window machine should not do this step



Third, installation Pycharm

1, from the official website: https://www.jetbrains.com/pycharm/download/Download, and then the idiot installation



Iv. writing wordcount.py and running successfully


1. Create a project

FILE-to-New Project


2, to Pycharm configuration Pythonpath

Run---Edit configurations, configured as follows

650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M00/07/21/wKiom1nDwb6DpKB9AAK8NBUaD6I447.png-wh_500x0-wm_ 3-wmp_4-s_2628805509.png "style=" Float:none; "title=" 3333333333.png "alt=" Wkiom1ndwb6dpkb9aak8nbuad6i447.png-wh_ "/>

650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M02/A5/D2/wKioL1nDwYqg_PxeAAP_gd5f8LE726.png-wh_500x0-wm_ 3-wmp_4-s_3686106793.png "style=" Float:none; "title=" 444444444444.png "alt=" Wkiol1ndwyqg_pxeaap_gd5f8le726.png-wh _50 "/>

650) this.width=650; "Src=" https://s3.51cto.com/wyfs02/M01/07/21/wKiom1nDwcDAfyCIAAEj-F9f7lM398.png-wh_500x0-wm_ 3-wmp_4-s_2982977777.png "style=" Float:none; "title=" 555555555555.png "alt=" Wkiom1ndwcdafyciaaej-f9f7lm398.png-wh _50 "/>

Click on the "+" above, then fill in:

Pythonpath=/users/tangweiqun/desktop/bigdata/spark/spark-2.1.0-bin-hadoop2.6/python/:/users/tangweiqun/desktop /bigdata/spark/spark-2.1.0-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip

The dependencies associated with Python in the upcoming Spark installation package Plus

3, Py4j-some-version.zip and the Pyspark.zip Add to Project

In order to be able to see the source code, we need to link the project source code, associated with the following ways:

650) this.width=650; "Src=" https://s3.51cto.com/wyfs02/M00/A5/D3/wKioL1nDwzrA94m5AAQ9Z0Rno-w970.png-wh_500x0-wm_ 3-wmp_4-s_3688969279.png "style=" Float:none; "title=" 6666666.png "alt=" wkiol1ndwzra94m5aaq9z0rno-w970.png-wh_50 "/ >

650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M01/07/22/wKiom1nDw3Gz6BZhAAMLPZx3DrI953.png-wh_500x0-wm_ 3-wmp_4-s_759117673.png "style=" Float:none; "title=" 7777777.png "alt=" wkiom1ndw3gz6bzhaamlpzx3dri953.png-wh_50 "/ >

Click + Add Content root to add the two zip packages under/users/tangweiqun/desktop/bigdata/spark/spark-2.1.0-bin-hadoop2.6/python/lib


4. Write spark word count and run it successfully

Create a python text wordcount.py, the contents are as follows:

From pyspark import sparkcontext, sparkconfimport osimport shutilif __name_ _ ==  "__main__":     conf = sparkconf (). Setappname ("AppName"). SetMaster ("local")     sc = sparkcontext (conf=conf)     sourcedatardd  = sc.textfile ("File:///Users/tangweiqun/test.txt")     wordsRDD =  Sourcedatardd.flatmap (Lambda line: line.split ())     keyvaluewordsrdd =  wordsrdd.map (lambda s:  (s, 1))     wordCountRDD =  Keyvaluewordsrdd.reducebykey (lambda a, b: a + b)     outputpath  =  "/users/tangweiqun/wordcount"     if os.path.exists (OutputPath):         shutil.rmtree (OutputPath)      Wordsrdd.saveastextfile ("file://"  + outputpath)     print wordcountrdd.collect () 

Right-click to run successfully



A detailed and systematic understanding of the spark core RDD related APIs can be found in the following:Spark core RDD API Rationale

spark2.x deep into the end series five Python development Spark environment configuration

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.