Apache Spark brief introduction, installation and use, apachespark
Apache Spark Introduction
Apache Spark is a high-speed general-purpose computing engine used to implement distributed large-scale data processing tasks. Distributed Processing can make it possible for a single computer to fail to process large-scale data.
Install and configure Apache Spark (Ubuntu Virtual Machine Under OS X)
To learn new things, it is best to operate on a virtual machine to avoid impact on the current development environment. My system is OS x, I installed the VirtualBox virtual machine, and then I installed the Ubuntu System in the virtual machine. For details about how to Install VirtualBox, refer to the Tutorial: YouTube: Install Ubuntu in Mac with Virtual Box. Pay attention to setting 4 gb ram and 20 GB space during the installation process. Otherwise, there may be insufficient resources.
Install Anaconda
Anaconda is a collection of Python scientific computing packages. In the following example, matplotlib is used to generate a bar chart. : Https://www.continuum.io/downloads and then enter the command in Terminal:
bash Anaconda2-4.1.1-Linux-x86_64.sh
Install Java SDK
Spark runs on JVM, so you also need to install Java SDK:
$ sudo apt-get install software-properties-common$ sudo add-apt-repository ppa:webupd8team/java$ sudo apt-get update$ sudo apt-get install oracle-java8-installer
Set JAVA_HOME
Open the. bashrc File
gedit .bashrc
Add the following settings to. bashrc:
JAVA_HOME=/usr/lib/jvm/java-8-oracleexport JAVA_HOMEPATH=$PATH:$JAVA_HOMEexport PATH
Install Spark
Go to the official website to download the compressed package, the http://spark.apache.org/downloads.html will unzip the installation package, the command is as follows:
$ tar -zxvf spark-2.0.0-bin-hadoop2.7.tgz$ rm spark-2.0.0-bin-hadoop2.7.tgz
Enable IPython NotebookOpen the. bashrc File
gedit .bashrc
Add the following settings to. bashrc:
export PYSPARK_DRIVER_PYTHON=ipythonexport PYSPARK_DRIVER_PYTHON_OPTS=notebook
Check whether the installation is successful(Restart Terminal)
cd ~/spark-2.0.0-bin-hadoop2.7./bin/pyspark
Simple use of Apache SparkOpen the Spark service and click new-Notebooks-Python to create a Notebook file. In this small example, we read the content in the NOTICE file in the Spark folder, calculate the word frequency, and finally generate a chart. The example is very simple. Directly paste the code and the final result: # coding: UTF-8 # In [1]: import refrom operator import add # In [13]: file_in = SC. textFile ("/home/carl/spark/NOTICE") # In [3]: words = file_in.flatMap (lambda line: re. split ('', line. lower (). strip () # In [4]: words = words. filter (lambda w: len (w)> 3) # In [5]: words = words. map (lambda w :( w, 1) # In [6]: words = words. reduceByKey (add) # In [7]: words = words. map (lambda x: (x [1], x [0]). sortByKey (False) # In [8]: words. take (15) # In [9]: get_ipython (). magic (u'matplotlib inline') import matplotlib. pyplot as pltdef histogram (words): count = map (lambda x: x [1], words) word = map (lambda x: x [0], words) plt. barh (range (len (count), count, color = "green") plt. yticks (range (len (count), word) # In [10]: words = words. map (lambda x :( x [1], x [0]) # In [11]: words. take (15) # In [12]: histogram (words. take (15 ))View Code.
Spark for Python DevelopersIn this book, we will continue to share Spark-related knowledge. If you are interested, please follow this blog and leave a message for discussion.
Benefits:
Spark for Python DevelopersDownload link for the electronic version: Spark for Python Developers.pdf we are in the big data age. If you are interested in data processing, please refer to another series of Essays: Using Python for data analysis basic series of Essays
If you are interested in web crawlers, see another article: Web Crawlers: Use the Scrapy framework to compile a crawler service that crawls book information.