Prerequisites :
1. Spark is already installed. Mine is spark2.2.0.
2. There is already a Python environment, and my side uses python3.6.
First, install the py4j
Using PIP, run the following command:
Install py4j
Using Conda, run the following command:
Install py4j
Second, create a project using Pycharm.
Select the python environment during the creation process. After entering, click run--"Edit configurations--" environment variables.
Add Pythonpath and Spark_home, where Pythonpath is the Python directory in the Spark installation path and Spark_home is the spark installation directory.
Then click OK and go to the first page of the Apply,ok.
Third, point preferences--"Project structure--" Add Content Root
Add the Py4j-0.10.4-src.zip and Pyspark.zip inside the lib in the Python directory in the Spark installation path. Then Apply,ok.
Four, write Pyspark WordCount test a bit. My side is using the Pyspark streaming program.
The code is as follows:
wordcount.py
fromPysparkImportSparkcontext fromPyspark.streamingImportStreamingContext#Create a local streamingcontext with working thread and batch interval of 1 secondSC= Sparkcontext ("Local[2]","Networdcount") SSC= StreamingContext (SC, 1)#Create a DStream that would connect to Hostname:port, like localhost:9999Lines= Ssc.sockettextstream ("localhost", 9999)#Split each line into wordswords= Lines.flatmap (LambdaLine:line.split (" "))#Count Each word in each batchPairs = Words.map (LambdaWord: (Word, 1)) Wordcounts= Pairs.reducebykey (LambdaX, Y:x +y)#Print The first ten elements of each RDD generated in this DStream to the consoleWordcounts.pprint () Ssc.start ( )#Start the computationSsc.awaittermination ()#Wait for the computation to terminate
Run the following command to the terminal first:
9999
You can then right-click in the Pycharm to run it. Then, in the above command line, enter a word with a space split:
I enter as follows:
A B a D D D D
Then press ENTER. You can see that the following results are output in Pycharm:
-A:-------------------------------------- -----('b'1) ('d' 4 ) ('a'2)
At this point, complete.
Pycharm Integrated Pyspark on Mac