Spark provides an interactive shell that enables us to point to point (Original: ad hoc) data analysis. If you've already used R,python, or a shell in Scala, or an operating system shell (such as bash), or a Windows command prompt, you'll be familiar with Spark's shell.
But in fact, the spark shell is different from most other shells, and most of the other shells let you manipulate the data through disk or memory on a single machine, and the spark shell allows you to manipulate the data on disk or in memory distributed across many machines. And Spark is responsible for arranging these operations on the cluster.
Because spark can load data into the memory of the working node, many distributed computations, even those that are distributed on many machines and terabytes in size, can be run within seconds. This makes the spark shell suitable for iterative, point-to-point, and exploratory analysis. Spark provides Python and Scala shells, which have been enhanced to support connectivity to the cluster.
Little Tips
Spark only provides python and Scala shells. However, because a shell is very useful for learning APIs, we recommend that you run these examples in one of these two languages, even if you are a Java developer. In each language, these APIs are similar.
The simplest way to demonstrate the power of the spark shell is to use them for simple data analysis. Let's start with an example from the Quick Start Guide in the official documentation.
The first step is to open a shell. In order to open the Python version of Spark (also called Pyspark), go to your Spark directory and enter:
Bin/pyspark
(Input in Windows: Bin\pyspark)
In order to open the Scala version of the shell, enter:
Bin/spark-shell
The shell prompt should appear within a few seconds. When the shell starts, you will see a lot of log information. You may need to tap the ENTER key to clear the log output and get the shell prompt. Figure 2-1 shows what it looks like when you open Pyspark.
Figure 2-1
the log messages printed in the shell may make you feel uncomfortable. You can let the log not output so much information. To achieve this goal, you can create a file called Log4j.properties in the Conf directory. In the Conf directory already contains a template for this file, called Log4j.properties.template. In order to let the log less nonsense, you can copy a log4j.properties.template, renamed to Log4j.properties, and then found in the file this line:
Log4j.rootcategory=info, console
Then turn down the log level and change it to look like this:
Log4j.rootcategory=warn, console
In this way, we will only see warn and higher-level log information.
When you open the shell again, you should see less log output (Figure 2-2).
Figure 2-2
Using Ipython
Ipython is an enhanced Python shell that is loved by many Python users and provides functionality like pressing the TAB key. You can find the instructions to install it in http://ipython.org. You can use the Ipython with spark by setting the Ipython environment variable:
Ipython=1./bin/pyspark
In order to use Ipython Notebook (a Web browser-based version of Ipython), use the following command:
Ipython_opts= "Notebook"./bin/pyspark
On Windows, setting environment variables and running the shell, you can use the following command:
Set Ipython=1bin\pyspark
Inside Spark, we express our calculations by performing operations on distributed datasets that execute in parallel on the cluster. These distributed data sets are called elastic distributed datasets, or RDD. The RDD is the basic abstraction of spark for distributed data and computing.
Before we tell more about the RDD, let's create an RDD in the shell based on a local text file and do some very simple point-to-point analysis on it, in Example 2-1 (Python) and Example 2-2 (Scala):
Example 2-1. Python statistics lines of text
>>> lines = Sc.textfile ("readme.md") # Create an rdd called lines
>>> Lines.count () # count the number of elements in this RDD 127
>>> Lines.first () # Get the first element in this RDD
U ' # Apache Spark '
Example 2-2. Scala statistics lines of text
scala> val lines = sc.textfile ("readme.md")//create a lines-named
Rddlines:spark. Rdd[string] = mappedrdd[...]
Scala> Lines.count ()//statistics The number of elements in this RDD
Res0:long = 127scala> Lines.first ()//Get the first element in this RDD
res1:string = # Apache Spark
The shell can be launched by combining key ctrl-d.
Tips
We'll discuss more about analyzing the data on the cluster in the seventh chapter, but you may have noticed this message from the log output when you started the shell:
INFO sparkui:started Sparkui at http://[ipaddress]:4040
Here you can access the spark UI and see a wide variety of information about your tasks and clusters.
In examples 2-1 and 2-2, the lines variable is an RDD, which is created from a text file on our local machine. We can run a variety of parallel operations on this rdd, such as the number of elements in the statistics dataset (here is the number of lines of text in the text) or the first element printed. In the following chapters, we'll delve into the RDD, but before we go any further, let's take some time to introduce the basic spark concept.
The basic spark concept is described here:
Spark Core Concept Introduction (translated from Learning.spark.lightning-fast.big.data.analysis)
Introduction to Spark's Python and Scala shell (translated from Learning.spark.lightning-fast.big.data.analysis)