Specific questions:
- Different data analysts/development teams require different versions of the Python version to perform pyspark.
- In the same Python version, you need to install multiple Python libraries, or even different versions of libraries.
One workaround for Issue 2 is to package the Python dependent library into a *.egg file and use –py-files to load the egg file when running Pyspark or spark-submit. The problem with this solution is that many Python libraries contain native code, compile-time dependent on the platform, and for some complex dependent libraries (such as pandas)
1.github Download Pandas Https://codeload.github.com/pandas-dev/pandas/zip/master
2. Generate compiled Python setup.py Bdist_egg will create an egg.
3. If GCC is required, install GCC on its own
-y install gcc gcc-c+ + kernel-devel
Reference:
http://blog.csdn.net/gongbi917/article/details/52369025
http://blog.csdn.net/willdeamon/article/details/53159548
Spark Cluster Python Package management