Spark for Python developers---build spark virtual Environment 3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Build Ubantu machine on VirtualBox, install Anaconda,java 8,spark,ipython Notebook, and WordCount example program with Hello World.

Build Spark Environment

In this section we learn to build a spark environment:

Create an isolated development environment on an Ubuntu 14.04 virtual machine without affecting any existing systems
Installs Spark 1.3.0 and its dependencies.
Installing the Anaconda Python 2.7 Environment contains the required libraries such as pandas, Scikit-learn, Blaze, and Bokeh, using Pyspark, which can be accessed via Ipython notebooks
In our environment to build back-end or data storage, using MySQL as a relational database; MongoDB writing file storage; Cassandra as a columnstore database. Each type of storage serves different special purposes, depending on the need to process the data. MySQL RDBMS can easily use SQL to complete table information query, if we deal with a lot of JSON type data obtained by various APIs, the simplest way is to store them in a document, for real time and time series information, Cassandra is the most appropriate Columnstore database.
?
? gives a view of the environment we are going to build will be used throughout this book:
?

Build Ubuntu in Oracle VirtualBox

Setting up a VirtualBox environment that runs Ubuntu 14.04 is the safest way to build a development environment that avoids conflicts with existing libraries, and you can use similar commands to replicate your environment to the cloud.

To build the anaconda and spark environment, we're going to create a virtual box VM that runs Ubuntu 14.04.
The steps are as follows:
? 1. Oracle VirtualBox VMs are downloaded free of charge from Https://www.virtualbox.org/wiki/Downloads and installed directly.
?? 2. After loading the VirtualBox, open the Oracle VM VirtualBox Manager and click on the button new.
??? 3. Specify a name for the new VM, select the Linux type and the Ubuntu version.
???? 4. It is necessary to allocate enough memory (4GB recommended) and hard disk (20GB recommended) to download ISO files from Ubuntu website. We use Ubuntu 14.04.1 lts version,: Http://www.ubuntu.com/download/desktop .
????? 5. Once the installation is complete, you can install VirtualBox Guest Additions (from the VirtualBox menu, select a new running VM) devices| Insert Guest Additions CD image. The installation may fail because the Windows system restricts the user interface.
?????? 6. Once the mirror installation is complete, restart the VM and it is already available. Turning on the shared clipboard feature is very helpful. Select VM Click Settings, then general| advanced| Shared Clipboard Click Bidirectional again.
??????

Installing the Python version of Anaconda 2.7

? Pyspark currently can only run in Python 2.7 (community needs upgrade to Python 3.3), install Anaconda, follow these steps:
?? 1. Download the Anaconda installer Http://continuum.io/downloads#all for Linux 64-bit Python 2.7.
?
? 2. After downloading the Anaconda installer, open the terminal to its installation location. Run the following command here, replacing the version number of the 2.x.x with the installer in the command:

    #install  anaconda  2.x.x    ??    #bash  Anaconda-2.x.x-Linux-x86[_64].sh

??? 3. After accepting the agreement, you will be asked to determine the path to the installation (default is ~/anaconda).
???? 4. After the self-extracting is complete, you need to add an environment variable Anaconda execution path to path:

???#  add  anaconda  to  PATH????bash  Anaconda-2.x.x-Linux-x86[_64].sh

Installing Java 8

Spark runs on top of the JVM, so we need to install the Java SDK instead of just the JRE, which is what we built for the spark application. The recommended version is Java version 7 or higher. Java 8 is the most suitable, it package installs Java 8, installs the following steps:?
1. The following commands are used to install Oracle Java 8:

#  install  oracle  java  8?$  sudo  apt-get  install  software-properties-common?$  sudo  add-apt-repository  ppa:webupd8team/java?$  sudo  apt-get  update?$  sudo  apt-get  install  oracle-java8-installer

? 2. Set the JAVA_HOME environment variable to ensure that the JAVA execution program is in path.
3. Check if the java_home is properly installed:

#$  echo  JAVA_HOME

Install Spark

First, browse Spark's download page http://spark.apache.org/downloads.?html. It offers a variety of possibilities to download earlier versions of Spark, different bundles and download types. We choose the latest version. Pre-built for Hadoop 2.6 and later. The simplest way to install Spark is to use Spark?package prebuilt for Hadoop 2.6 and later instead of compiling from source code and then moving ~/spark to the root directory. Download latest Version Spark-spark 1.5.2, released on November 9, 2015:?
1. Select Spark Version 1.5.2 (Nov 09 2015),
2. Select the package type prebuilt for Hadoop 2.6 and later,?
3. Select the type of download Direct Download,
? 4. Download spark:spark-1.5.2-bin-hadoop2.6.tgz,?
? 5. Verify that the 1.3.0 signature check can also run:?

#  download  spark?$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz?

? Next, we will extract and clean the file:?
?

?#extract,clean up,move the unzipped files under the spark  directory??$  rm  spark-1.5.2-bin-hadoop2.6.tgz??$  sudo  mv  spark-*  spark?

Now we are able to run the Python interpreter for Spark:?

#  run  spark?$  cd  ~/spark./bin/pyspark

You should see an effect like this:?

The interpreter has provided a spark context object, SC,? We can see:?

object at 0x7f34b61c4e50>??

Using IPython Notebook

The IPython Notebook has a more user-friendly experience than the console. You can start Ipython Notebook with the new command:
?$ IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark
?? In directory Examples/an_spark, start Pyspark and ipynb or in Jupyter or? IPython Notebooks installation directory startup:??

# cd  to /home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark  $  IPYTHON_OPTS=‘notebook‘  /home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/?pyspark  --packages  com.databricks:spark-csv_2.11:1.2.0  ??#launch command using python  3.4  and  the  spark-csv  package:?????$ IPYTHON_OPTS=‘notebook‘  PYSPARK_PYTHON=python3?/home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/pyspark  --packages  com.databricks:spark-csv_2.11:1.2.0

The first application built on the Pyspark

We have checked everything is working properly and it is obligatory to use Word count as the first experiment in this book:?

# Word count on 1st Chapter of the book using Pyspark? Import re???# import Add from operator module??? fromOperator ImportAdd?????# Read input file???? file_in = Sc.textfile ('/home/an/documents/a00_documents/spark4py?20150315 ')??????# Count Lines????? Print' number of lines in file:%s '% File_in.count ())??????# Add up lengths of each line?????chars= File_in.map (lambda s:Len(s)). ReduceAdd)? print (' number of characters in file:%s '%chars)??????# Get words from the input file?????words=file_in.flatmap (lambda Line: Re.Split(' \w+ ', Line.Lower().? Strip ()))??????# Words of more than 3 characters????? Swords =words.Filter(Lambda x:Len(x) >3)??????# Set count 1 per word??????words=words. Map (Lambda W: (W,1))??????# Reduce Phase-sum count all the words??????words=words. Reducebykey (Add)

In this program, first from the directory/home/an/? Documents/a00_documents/spark4py 20150315 reads the file to file_in. Then calculates the number of lines in the file and the number of characters per line.

We split the file into words and lowercase. For the purpose of counting words, we choose words more than three characters to avoid high-frequency words like the, and, for. Generally, these are considered to be stop words and should be filtered out by the language processing task. At this stage, we prepare the MapReduce step, each word map to a value of 1, to calculate the number of occurrences of all unique words. ? This is the code description in Ipython notebook. The first ten cells are extracted from the local file by the word statistics preprocessing data set on the dataset.

The word frequency statistic tuple is exchanged in (count, Microsoft) format to sort count as the key of the tuple:?

 #  create  tuple  (count,  word)  and  sort  in  descending? words  =  words.map(lambda  x:  (x[1],  x[0])).sortByKey(False)?? #  take  top  20  words  by  frequency? words.take(20)

In the future, we create a (count, word) tuple to display the highest 20 words in reverse order of the word frequency:

Generate histogram:?

 # Create function for histogram of the most frequent words?? % matplotlib Inline???ImportMatplotlib.pyplot asPlt????#?????? defhistogram(words):  ????? Count = Map (Lambdax:x[1], words) m????? Word = Map (Lambdax:x[0], words)????? Plt.barh (Len (count)), Count,color =' Grey ')  ????? Plt.yticks (Len (count)), word)??# Change Order of tuples (word, count) from (count, word)? Words = Words.map (LambdaX: (x[1], x[0])? Words.take ( -)??# Display HistogramHistogram (Words.take ( -))

We can see the high-frequency words drawn in the form of histograms, we have exchanged the primitive tuple (count, word) for (Word, count):

So, we've also reviewed all the high-frequency words Spark, Data, and Anaconda in this chapter.

Spark for Python developers---build spark virtual Environment 3

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark for Python developers---build spark virtual Environment 3

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark for Python developers---build spark virtual Environment 3

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support