Apache Spark brief introduction, installation and use, apachespark

Source: Internet
Author: User

Apache Spark brief introduction, installation and use, apachespark
Apache Spark Introduction
Apache Spark is a high-speed general-purpose computing engine used to implement distributed large-scale data processing tasks. Distributed Processing can make it possible for a single computer to fail to process large-scale data.
Install and configure Apache Spark (Ubuntu Virtual Machine Under OS X)
To learn new things, it is best to operate on a virtual machine to avoid impact on the current development environment. My system is OS x, I installed the VirtualBox virtual machine, and then I installed the Ubuntu System in the virtual machine. For details about how to Install VirtualBox, refer to the Tutorial: YouTube: Install Ubuntu in Mac with Virtual Box. Pay attention to setting 4 gb ram and 20 GB space during the installation process. Otherwise, there may be insufficient resources.
Install Anaconda
Anaconda is a collection of Python scientific computing packages. In the following example, matplotlib is used to generate a bar chart. : Https://www.continuum.io/downloads and then enter the command in Terminal:

bash Anaconda2-4.1.1-Linux-x86_64.sh

 

Install Java SDK
Spark runs on JVM, so you also need to install Java SDK:
$ sudo apt-get install software-properties-common$ sudo add-apt-repository ppa:webupd8team/java$ sudo apt-get update$ sudo apt-get install oracle-java8-installer

Set JAVA_HOME
Open the. bashrc File
gedit .bashrc
Add the following settings to. bashrc:
JAVA_HOME=/usr/lib/jvm/java-8-oracleexport JAVA_HOMEPATH=$PATH:$JAVA_HOMEexport PATH

 

Install Spark
Go to the official website to download the compressed package, the http://spark.apache.org/downloads.html will unzip the installation package, the command is as follows:
$ tar -zxvf spark-2.0.0-bin-hadoop2.7.tgz$ rm spark-2.0.0-bin-hadoop2.7.tgz

 

Enable IPython NotebookOpen the. bashrc File
gedit .bashrc
Add the following settings to. bashrc:
export PYSPARK_DRIVER_PYTHON=ipythonexport PYSPARK_DRIVER_PYTHON_OPTS=notebook

 

Check whether the installation is successful(Restart Terminal)
cd ~/spark-2.0.0-bin-hadoop2.7./bin/pyspark

Simple use of Apache SparkOpen the Spark service and click new-Notebooks-Python to create a Notebook file. In this small example, we read the content in the NOTICE file in the Spark folder, calculate the word frequency, and finally generate a chart. The example is very simple. Directly paste the code and the final result: # coding: UTF-8 # In [1]: import refrom operator import add # In [13]: file_in = SC. textFile ("/home/carl/spark/NOTICE") # In [3]: words = file_in.flatMap (lambda line: re. split ('', line. lower (). strip () # In [4]: words = words. filter (lambda w: len (w)> 3) # In [5]: words = words. map (lambda w :( w, 1) # In [6]: words = words. reduceByKey (add) # In [7]: words = words. map (lambda x: (x [1], x [0]). sortByKey (False) # In [8]: words. take (15) # In [9]: get_ipython (). magic (u'matplotlib inline') import matplotlib. pyplot as pltdef histogram (words): count = map (lambda x: x [1], words) word = map (lambda x: x [0], words) plt. barh (range (len (count), count, color = "green") plt. yticks (range (len (count), word) # In [10]: words = words. map (lambda x :( x [1], x [0]) # In [11]: words. take (15) # In [12]: histogram (words. take (15 ))View Code. Spark for Python DevelopersIn this book, we will continue to share Spark-related knowledge. If you are interested, please follow this blog and leave a message for discussion. Benefits: Spark for Python DevelopersDownload link for the electronic version: Spark for Python Developers.pdf we are in the big data age. If you are interested in data processing, please refer to another series of Essays: Using Python for data analysis basic series of Essays
If you are interested in web crawlers, see another article: Web Crawlers: Use the Scrapy framework to compile a crawler service that crawls book information.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.