Apache Spark brief introduction, installation and use, apachespark

Last Update:2016-09-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Apache Spark brief introduction, installation and use, apachespark
Apache Spark Introduction
Apache Spark is a high-speed general-purpose computing engine used to implement distributed large-scale data processing tasks. Distributed Processing can make it possible for a single computer to fail to process large-scale data.
Install and configure Apache Spark (Ubuntu Virtual Machine Under OS X)
To learn new things, it is best to operate on a virtual machine to avoid impact on the current development environment. My system is OS x, I installed the VirtualBox virtual machine, and then I installed the Ubuntu System in the virtual machine. For details about how to Install VirtualBox, refer to the Tutorial: YouTube: Install Ubuntu in Mac with Virtual Box. Pay attention to setting 4 gb ram and 20 GB space during the installation process. Otherwise, there may be insufficient resources.
Install Anaconda
Anaconda is a collection of Python scientific computing packages. In the following example, matplotlib is used to generate a bar chart. : Https://www.continuum.io/downloads and then enter the command in Terminal:

bash Anaconda2-4.1.1-Linux-x86_64.sh

Install Java SDK
Spark runs on JVM, so you also need to install Java SDK:

$ sudo apt-get install software-properties-common$ sudo add-apt-repository ppa:webupd8team/java$ sudo apt-get update$ sudo apt-get install oracle-java8-installer

Set JAVA_HOME
Open the. bashrc File

gedit .bashrc

Add the following settings to. bashrc:

JAVA_HOME=/usr/lib/jvm/java-8-oracleexport JAVA_HOMEPATH=$PATH:$JAVA_HOMEexport PATH

Install Spark
Go to the official website to download the compressed package, the http://spark.apache.org/downloads.html will unzip the installation package, the command is as follows:

$ tar -zxvf spark-2.0.0-bin-hadoop2.7.tgz$ rm spark-2.0.0-bin-hadoop2.7.tgz

Enable IPython NotebookOpen the. bashrc File

gedit .bashrc

Add the following settings to. bashrc:

export PYSPARK_DRIVER_PYTHON=ipythonexport PYSPARK_DRIVER_PYTHON_OPTS=notebook

Check whether the installation is successful(Restart Terminal)

cd ~/spark-2.0.0-bin-hadoop2.7./bin/pyspark

Simple use of Apache SparkOpen the Spark service and click new-Notebooks-Python to create a Notebook file. In this small example, we read the content in the NOTICE file in the Spark folder, calculate the word frequency, and finally generate a chart. The example is very simple. Directly paste the code and the final result: # coding: UTF-8 # In [1]: import refrom operator import add # In [13]: file_in = SC. textFile ("/home/carl/spark/NOTICE") # In [3]: words = file_in.flatMap (lambda line: re. split ('', line. lower (). strip () # In [4]: words = words. filter (lambda w: len (w)> 3) # In [5]: words = words. map (lambda w :( w, 1) # In [6]: words = words. reduceByKey (add) # In [7]: words = words. map (lambda x: (x [1], x [0]). sortByKey (False) # In [8]: words. take (15) # In [9]: get_ipython (). magic (u'matplotlib inline') import matplotlib. pyplot as pltdef histogram (words): count = map (lambda x: x [1], words) word = map (lambda x: x [0], words) plt. barh (range (len (count), count, color = "green") plt. yticks (range (len (count), word) # In [10]: words = words. map (lambda x :( x [1], x [0]) # In [11]: words. take (15) # In [12]: histogram (words. take (15 ))View Code. Spark for Python DevelopersIn this book, we will continue to share Spark-related knowledge. If you are interested, please follow this blog and leave a message for discussion. Benefits: Spark for Python DevelopersDownload link for the electronic version: Spark for Python Developers.pdf we are in the big data age. If you are interested in data processing, please refer to another series of Essays: Using Python for data analysis basic series of Essays
If you are interested in web crawlers, see another article: Web Crawlers: Use the Scrapy framework to compile a crawler service that crawls book information.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apache Spark brief introduction, installation and use, apachespark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Apache Spark brief introduction, installation and use, apachespark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support