Briefly
Spark is the universal parallel framework for the open source class Hadoop MapReduce for UC Berkeley AMP Labs, Spark, with the benefits of Hadoop MapReduce But unlike MapReduce, the job intermediate output can be stored in memory, eliminating the need to read and write HDFs, so spark is better suited for algorithms that require iterative mapreduce such as data mining and machine learning. Since Spark has a Python API, I'm more specific to the Python language. So here's a look at how I'm configuring spark and what I've learned.
Configuration process
Step One:
Download the Scala Zip package, go to the link http://www.scala-lang.org/, click Download to download Scala, and unzip it into the current directory.
Download the JDK package and go to the link http://www.oracle.com/technetwork/java/javase/downloads/ index.html, download the latest version of the JDK, if the 64-bit system please download jdk-8u91-linux-x64.tar.gz (i download version of 8u91, the system is 64-bit), 32-bit system download jdk-8u91-linux-i586.tar.gz, the download is completed and extracted to the current directory.
Download the spark compression package, enter the link https://spark.apache.org/downloads.html, select the current latest version of the person is 1.6.2, click Download.
Step Two:
1. Open the command-line window.
2. Execute Command sudo-i
3. Go to the directory where the extracted files are located
4. Transfer the J decompression file to the OPT directory
Performing MV Jdk1.8.0_91/opt/jdk1.8.0_91
Performing MV scala-2.11.8/opt/scala-2.11.8
Performing MV Spark-1.6.2-bin-hadoop2.6/opt/spark-hadoop
Step Three:
Configure environment variables, edit/etc/profile, execute the following command
sudo gedit/etc/profile
Add at the bottom of the file (note the version):
#Seeting JDK JDK Environment variables
Export java_home=/opt/jdk1.8.0_91
Export JRE_HOME=${JAVA_HOME}/JRE
Export Classpath=.:${java_home}/lib:${jre_home}/lib
Export Path=${java_home}/bin:${jre_home}/bin: $PATH
#Seeting Scala Scala environment variables
Export scala_home=/opt/scala-2.11.8
Export Path=${scala_home}/bin: $PATH
#setting Spark Spark Environment variables
Export spark_home=/opt/spark-hadoop/
#PythonPath Add the Pyspark module in spark to the Python environment
Export Pythonpath=/opt/spark-hadoop/python
Save the file, reboot the computer, make/etc/profile permanent, take effect temporarily, open the command window, execute source/etc/profile in the current window
Step Four:
Test the installation Results
Open a command window and switch to the spark root directory
Execute./bin/spark-shell, open the Scala to Spark connection window
The result of the execution is the correct
Execute./bin/pyspark, open the Python connection window to spark
The installation is correct
Test in Pycharm, the following red word appears, then the configuration is successful.
Reference to: http://www.open-open.com/lib/view/open1432192407317.html
Linux under Spark Framework configuration (Python)