1. Installing Anaconda2
Once installed, the native Python environment uses Anaconda's python2.7 environment.
2. Installing py4j
After you open the console on the local ctrl+r, use PIP to install py4j directly, because Anaconda has PIP installed by default and, of course, you can use Conda installation.
installation command:pip install py4j
What if I don't install the py4j problem?
A: Because the Python version of the API for Spark relies on py4j, the following error will be thrown if the run program is not installed.
3. Configure environment variable configuration pycharm environment variable main configuration two variables one is Spark_home, the other is Pythonpath.
(1). Open Run Configurations first
(Create a project with this option in the upper-left corner of a project or Python file)
(2). Edit Environment variables
or expand as follows
Menu: file-->settings (image from the Internet ~ Here I use Python2)
(3). Add spark and Python environments under environment variables
Add the Spark_home directory to the Pythonpath directory.
-Spark_home:spark installation directory
-Python directory under the Pythonpath:spark installation directory
4. Copy the Pyspark package
Write Spark program, copy pyspark package, add code display function
In order for us to have code hints and complete functionality when writing Spark programs in pycharm, we need to import the pyspark of spark into Python. In Spark's program, there's a python package called Pyspark.
Pyspark Bag
Python is also easy to import third-party packages, just import the corresponding modules into the specified folder.
Windows copies Pyspark to Python's site-packages directory (Anaconda is used here)
5. Test code
Import sysfrom operator import addfrom pyspark import sparkcontext
LogFile = "D:\\bigdata\\workspace\\pycharmprojects\\machinelearning1\\word.txt"
sc = sparkcontext ("local", "Pythonwordcount")
Logdata = Sc.textfile (logFile). Cache ()
Numas = Logdata.filter (lambda s: ' A ' in s). Count ()
Numbs = Logdata.filter (lambda s: ' B ' in s). Count ()
Print ("Lines with a:%i, Lines with B:%i"% (Numas, numbs))
Pycharm+eclipse Shared Anaconda Data Science environment