Spark for Python developers---build spark virtual Environment 3

Source: Internet
Author: User
Tags cassandra virtual environment pyspark oracle vm virtualbox vm virtualbox databricks

Build Ubantu machine on VirtualBox, install Anaconda,java 8,spark,ipython Notebook, and WordCount example program with Hello World.

Build Spark Environment

In this section we learn to build a spark environment:

    • Create an isolated development environment on an Ubuntu 14.04 virtual machine without affecting any existing systems
    • Installs Spark 1.3.0 and its dependencies.
    • Installing the Anaconda Python 2.7 Environment contains the required libraries such as pandas, Scikit-learn, Blaze, and Bokeh, using Pyspark, which can be accessed via Ipython notebooks
    • In our environment to build back-end or data storage, using MySQL as a relational database; MongoDB writing file storage; Cassandra as a columnstore database. Each type of storage serves different special purposes, depending on the need to process the data. MySQL RDBMS can easily use SQL to complete table information query, if we deal with a lot of JSON type data obtained by various APIs, the simplest way is to store them in a document, for real time and time series information, Cassandra is the most appropriate Columnstore database.
      ?
      ? gives a view of the environment we are going to build will be used throughout this book:
      ?
Build Ubuntu in Oracle VirtualBox

Setting up a VirtualBox environment that runs Ubuntu 14.04 is the safest way to build a development environment that avoids conflicts with existing libraries, and you can use similar commands to replicate your environment to the cloud.

To build the anaconda and spark environment, we're going to create a virtual box VM that runs Ubuntu 14.04.
The steps are as follows:
? 1. Oracle VirtualBox VMs are downloaded free of charge from Https://www.virtualbox.org/wiki/Downloads and installed directly.
?? 2. After loading the VirtualBox, open the Oracle VM VirtualBox Manager and click on the button new.
??? 3. Specify a name for the new VM, select the Linux type and the Ubuntu version.
???? 4. It is necessary to allocate enough memory (4GB recommended) and hard disk (20GB recommended) to download ISO files from Ubuntu website. We use Ubuntu 14.04.1 lts version,: Http://www.ubuntu.com/download/desktop .
????? 5. Once the installation is complete, you can install VirtualBox Guest Additions (from the VirtualBox menu, select a new running VM) devices| Insert Guest Additions CD image. The installation may fail because the Windows system restricts the user interface.
?????? 6. Once the mirror installation is complete, restart the VM and it is already available. Turning on the shared clipboard feature is very helpful. Select VM Click Settings, then general| advanced| Shared Clipboard Click Bidirectional again.
??????

Installing the Python version of Anaconda 2.7

? Pyspark currently can only run in Python 2.7 (community needs upgrade to Python 3.3), install Anaconda, follow these steps:
?? 1. Download the Anaconda installer Http://continuum.io/downloads#all for Linux 64-bit Python 2.7.
?
? 2. After downloading the Anaconda installer, open the terminal to its installation location. Run the following command here, replacing the version number of the 2.x.x with the installer in the command:

    #install  anaconda  2.x.x    ??    #bash  Anaconda-2.x.x-Linux-x86[_64].sh

??? 3. After accepting the agreement, you will be asked to determine the path to the installation (default is ~/anaconda).
???? 4. After the self-extracting is complete, you need to add an environment variable Anaconda execution path to path:

???#  add  anaconda  to  PATH????bash  Anaconda-2.x.x-Linux-x86[_64].sh   
Installing Java 8

Spark runs on top of the JVM, so we need to install the Java SDK instead of just the JRE, which is what we built for the spark application. The recommended version is Java version 7 or higher. Java 8 is the most suitable, it package installs Java 8, installs the following steps:?
1. The following commands are used to install Oracle Java 8:

#  install  oracle  java  8?$  sudo  apt-get  install  software-properties-common?$  sudo  add-apt-repository  ppa:webupd8team/java?$  sudo  apt-get  update?$  sudo  apt-get  install  oracle-java8-installer

? 2. Set the JAVA_HOME environment variable to ensure that the JAVA execution program is in path.
3. Check if the java_home is properly installed:

#$  echo  JAVA_HOME
Install Spark

First, browse Spark's download page http://spark.apache.org/downloads.?html. It offers a variety of possibilities to download earlier versions of Spark, different bundles and download types. We choose the latest version. Pre-built for Hadoop 2.6 and later. The simplest way to install Spark is to use Spark?package prebuilt for Hadoop 2.6 and later instead of compiling from source code and then moving ~/spark to the root directory. Download latest Version Spark-spark 1.5.2, released on November 9, 2015:?
1. Select Spark Version 1.5.2 (Nov 09 2015),
2. Select the package type prebuilt for Hadoop 2.6 and later,?
3. Select the type of download Direct Download,
? 4. Download spark:spark-1.5.2-bin-hadoop2.6.tgz,?
? 5. Verify that the 1.3.0 signature check can also run:?

#  download  spark?$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz?

? Next, we will extract and clean the file:?
?

?#extract,clean up,move the unzipped files under the spark  directory??$  rm  spark-1.5.2-bin-hadoop2.6.tgz??$  sudo  mv  spark-*  spark?

Now we are able to run the Python interpreter for Spark:?

#  run  spark?$  cd  ~/spark./bin/pyspark

You should see an effect like this:?

The interpreter has provided a spark context object, SC,? We can see:?

object at 0x7f34b61c4e50>??
Using IPython Notebook

The IPython Notebook has a more user-friendly experience than the console. You can start Ipython Notebook with the new command:
?$ IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark
?? In directory Examples/an_spark, start Pyspark and ipynb or in Jupyter or? IPython Notebooks installation directory startup:??

# cd  to /home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark  $  IPYTHON_OPTS=‘notebook‘  /home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/?pyspark  --packages  com.databricks:spark-csv_2.11:1.2.0  ??#launch command using python  3.4  and  the  spark-csv  package:?????$ IPYTHON_OPTS=‘notebook‘  PYSPARK_PYTHON=python3?/home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/pyspark  --packages  com.databricks:spark-csv_2.11:1.2.0
The first application built on the Pyspark

We have checked everything is working properly and it is obligatory to use Word count as the first experiment in this book:?

# Word count on 1st Chapter of the book using Pyspark? Import re???# import Add from operator module??? fromOperator ImportAdd?????# Read input file???? file_in = Sc.textfile ('/home/an/documents/a00_documents/spark4py?20150315 ')??????# Count Lines????? Print' number of lines in file:%s '% File_in.count ())??????# Add up lengths of each line?????chars= File_in.map (lambda s:Len(s)). ReduceAdd)? print (' number of characters in file:%s '%chars)??????# Get words from the input file?????words=file_in.flatmap (lambda Line: Re.Split(' \w+ ', Line.Lower().? Strip ()))??????# Words of more than 3 characters????? Swords =words.Filter(Lambda x:Len(x) >3)??????# Set count 1 per word??????words=words. Map (Lambda W: (W,1))??????# Reduce Phase-sum count all the words??????words=words. Reducebykey (Add)

In this program, first from the directory/home/an/? Documents/a00_documents/spark4py 20150315 reads the file to file_in. Then calculates the number of lines in the file and the number of characters per line.

We split the file into words and lowercase. For the purpose of counting words, we choose words more than three characters to avoid high-frequency words like the, and, for. Generally, these are considered to be stop words and should be filtered out by the language processing task. At this stage, we prepare the MapReduce step, each word map to a value of 1, to calculate the number of occurrences of all unique words. ? This is the code description in Ipython notebook. The first ten cells are extracted from the local file by the word statistics preprocessing data set on the dataset.

The word frequency statistic tuple is exchanged in (count, Microsoft) format to sort count as the key of the tuple:?

 #  create  tuple  (count,  word)  and  sort  in  descending? words  =  words.map(lambda  x:  (x[1],  x[0])).sortByKey(False)?? #  take  top  20  words  by  frequency? words.take(20)

In the future, we create a (count, word) tuple to display the highest 20 words in reverse order of the word frequency:

Generate histogram:?

 # Create function for histogram of the most frequent words?? % matplotlib Inline???ImportMatplotlib.pyplot asPlt????#?????? defhistogram(words):  ????? Count = Map (Lambdax:x[1], words) m????? Word = Map (Lambdax:x[0], words)????? Plt.barh (Len (count)), Count,color =' Grey ')  ????? Plt.yticks (Len (count)), word)??# Change Order of tuples (word, count) from (count, word)? Words = Words.map (LambdaX: (x[1], x[0])? Words.take ( -)??# Display HistogramHistogram (Words.take ( -))

We can see the high-frequency words drawn in the form of histograms, we have exchanged the primitive tuple (count, word) for (Word, count):

So, we've also reviewed all the high-frequency words Spark, Data, and Anaconda in this chapter.

Spark for Python developers---build spark virtual Environment 3

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.