Getting Started with Spark

Source: Internet
Author: User
Tags cassandra pyspark apache mesos hadoop mapreduce spark rdd databricks spark mllib

Original link

What is Spark

Apache Spark is a large data processing framework built around speed, ease of use, and complex analysis. Originally developed in 2009 by Amplab of the University of California, Berkeley, and became one of Apache's Open source projects in 2010.

Compared to other big data and mapreduce technologies such as Hadoop and Storm, Spark has the following advantages.

First, Spark provides us with a comprehensive, unified framework for managing large data processing requirements for a variety of datasets and data sources (batch data or real-time streaming data) with different properties (text data, chart data, and so on).

Spark can increase the speed of applications in the Hadoop cluster by up to 100 times times in memory and even increase the speed of the application on disk by up to 10 times times.

Spark allows developers to quickly write programs in Java, Scala, or Python. It itself comes with a collection of more than 80 higher-order operators. You can also use it to interactively query data in the shell.

In addition to the map and reduce operations, it supports SQL queries, stream data, machine learning, and chart data processing. Developers can use a single capability or combine these capabilities in a data pipeline use case.

In the first part of this Apache Spark article series, we'll look at what spark is and how it compares to a typical mapreduce solution and how it provides a complete set of tools for big data processing.

Hadoop and Spark

Hadoop, the big data-processing technology, is about 10 years old, and is considered the solution of choice for big dataset processing. MapReduce is an excellent solution for all the way to compute, but it is not very efficient for use cases that require multiple calculations and algorithms. Each step in the data processing process requires a map phase and a reduce phase, and if you want to take advantage of this solution, you need to convert all the use cases into MapReduce mode.

The job output data from the previous step must be stored in the Distributed File system before the next start. Therefore, replication and disk storage can cause this approach to slow down. In addition, Hadoop solutions often contain clusters that are difficult to install and manage. And to deal with different big data use cases, you need to integrate many different tools (such as mahout for machine learning and storm for streaming data processing).

If you want to do more complex work, you must concatenate a series of mapreduce jobs and execute them sequentially. Each job is Gao Shiyan, and the next job can start only after the previous one has completed.

Spark, however, allows program developers to develop complex multi-step data pipelines using a directed acyclic graph (DAG). It also supports the sharing of in-memory data across a non-circular graph so that different jobs can work together on the same data.

Spark runs on top of existing Hadoop Distributed file Systems (HDFS) to provide additional enhancements. It supports the deployment of Spark applications to existing Hadoop v1 clusters (with simr–spark-inside-mapreduce) or Hadoop v2 yarn clusters or even Apache Mesos.

We should think of spark as a substitute for Hadoop mapreduce and not as a substitute for Hadoop. The intent is not to replace Hadoop, but to provide a comprehensive and unified solution for managing different big data use cases and requirements.

Spark Features

Spark takes MapReduce to a higher level by using a lower-cost shuffle (Shuffle) in the process of data processing. With memory data storage and near real-time processing power, spark can perform many times faster than other big data processing technologies.

Spark also supports lazy computation of big data queries, which can help optimize processing steps in a large processing process. Spark also provides advanced APIs to increase developer productivity, and in addition provides a consistent architecture model for big data solutions.

Spark saves intermediate results in memory instead of writing them to disk, which is especially useful when you need to process the same dataset multiple times. Spark is designed to be an execution engine that works both in memory and on disk. The spark operator performs an external operation when the in-memory data does not apply. Spark can be used to process datasets that are larger than the sum of the cluster's memory capacity.

Spark tries to store as much data as possible in memory and then writes it to disk. It can store a portion of a dataset in memory while the rest is stored on disk. Developers need to evaluate memory requirements based on data and use cases. The performance benefits of spark benefit from this in-memory data storage.

Other features of Spark include:

    • More functions than map and reduce are supported.
    • Optimization of arbitrary operator graphs (operator graphs).
    • A lazy calculation that can help optimize large data queries for the overall data processing process.
    • Provides concise, consistent Scala,java and Python APIs.
    • Provides interactive Scala and python shells. Java is not currently supported.

Spark is written in the Scala programming language and runs on top of the Java Virtual Machine (JVM) environment. Spark applications are currently supported in the following programming languages:

    • Scala
    • Java
    • Python
    • Clojure
    • R
Spark Eco-System

In addition to the Spark core API, other additional libraries are included in the spark ecosystem to provide additional capabilities in the area of Big data analytics and machine learning.

These libraries include:

    • Spark Streaming:
      • Spark streaming is based on micro-batch computing and processing that can be used to process real-time streaming data. It uses Dstream, which is simply an elastic distributed data set (RDD) family that processes real-time data.
    • Spark SQL:
      • Spark SQL can expose the spark dataset through the JDBC API, and it can also execute SQL-like queries on spark data with traditional bi and visualizer tools. Users can also use spark SQL to perform ETL for different formats of data (such as Json,parquet and databases), convert them, and then expose them to specific queries.
    • Spark MLlib:
      • Mllib is an extensible Spark machine learning library that consists of common learning algorithms and tools, including two-tuple classification, linear regression, clustering, collaborative filtering, gradient descent, and bottom-up optimization primitives.
    • Spark GraphX:
      • Graphx is a new (Alpha) Spark API for graph calculations and parallel graph calculations. The spark RDD is extended by introducing the elastic Distributed attribute graph (resilient distributed property graph), a multi-direction graph with attributes on both vertices and edges. To support graph calculations, GRAPHX exposes an underlying set of operators (such as Subgraph,joinvertices and Aggregatemessages) and an optimized Pregel API variant. In addition, GRAPHX includes a continuously growing set of graph algorithms and builders for simplifying graph analysis tasks.

In addition to these libraries, there are other libraries, such as BLINKDB and Tachyon.

BLINKDB is an approximate query engine for performing interactive SQL queries on massive amounts of data. BLINKDB can improve query response time by sacrificing data accuracy. Manipulate large data sets by executing queries on data samples and showing results that contain meaningful error line annotations.

Tachyon is a memory-centric distributed file system that provides trusted file sharing for memory-level speeds across a cluster framework, such as Spark and MapReduce. It caches the working set file in memory to avoid loading the data set that needs to be read frequently to disk. With this mechanism, different jobs/queries and frameworks can access cached files at the speed of memory level.
In addition, there are adapters for integration with other products, such as the Cassandra (Spark Cassandra Connector) and R (SPARKR). Cassandra connector can be used to access data stored in a Cassandra database and perform data analysis on that data.

Shows how these different libraries relate to each other in the spark ecosystem.

Figure 1. Libraries in the Spark framework

We'll step through these spark libraries in this series of articles

Spark Architecture

The spark architecture consists of the following three main components:

    • Data storage
    • Api
    • Management framework

Let's look at these components in more detail next.

Data storage:

Spark stores data with the HDFs file system. It can be used to store any data source that is compatible with Hadoop, including Hdfs,hbase,cassandra.

API :

Using APIs, application developers can create spark-based applications with standard API interfaces. Spark provides APIs for the Scala,java and Python three programming languages.

Below is a link to the website for the Spark API in three languages.

    • Scala API
    • Java
    • Python

Resource management:

Spark can be deployed either on a separate server or on a distributed computing framework like Mesos or yarn.

2 shows the various components in the Spark architecture model.

Figure 2 Spark System Architecture

Resilient distributed data sets

Elastic Distributed Datasets (research papers based on Matei) or RDD are the core concepts in the spark framework. You can view the RDD as a table in the database. You can save any type of data. Spark stores data in an RDD on different partitions.

Rdd can help rearrange calculations and optimize the process of data processing.

In addition, it has fault tolerance because the RDD knows how to recreate and recalculate the dataset.

The RDD is immutable. You can modify the RDD with a transform (transformation), but the transformation returns a new Rdd, and the original RDD remains the same.

The RDD supports two types of operations:

    • Transform (transformation)
    • Actions (Action)

Transform: The return value of the transform is a new Rdd collection, not a single value. Call a transform method, there will be no evaluation, it only gets an RDD as a parameter, and then returns a new Rdd.

Transform functions include: Map,filter,flatmap,groupbykey,reducebykey,aggregatebykey,pipe and coalesce.

Action: The action operation calculates and returns a new value. When an action function is called on an Rdd object, the entire data processing query is computed at this point and the result value is returned.

Action actions include: Reduce,collect,count,first,take,countbykey and foreach.

How to install Spark

There are several different ways to install and use spark. You can install Spark as a standalone framework on your computer or get a spark virtual machine image directly from a vendor such as Cloudera,hortonworks or MAPR. Or you can use spark that is installed and configured in a cloud environment, such as Databricks cloud.

In this article, we will install spark as a stand-alone framework and launch it locally. Recently, Spark has just released version 1.2.0. We will use this version to complete the code presentation for the sample app.

How to Run Spark

When you install spark on your local machine or use a cloud-based spark, there are several different ways to connect to the spark engine.

The following table shows the master URL parameters that are required for different spark run modes.

How to interact with spark

Once spark is up and running, you can use the spark shell to connect to the spark engine for interactive data analysis. The Spark shell supports both Scala and Python in two languages. Java does not support interactive shells, so this feature is not currently implemented in the Java language.

You can run the Scala version and the Python version of the spark shell separately with the spark-shell.cmd and Pyspark.cmd commands.

Spark Web Console

Regardless of the mode in which spark runs, you can view the spark's job results and other statistics by visiting the Spark Web console, which has the following URL address:

http://localhost:4040

Spark Console as shown in 3, including Stages,storage,environment and executors four tabs

(Click to view larger image)

Figure 3. Spark Web Console

Shared variables

Spark provides two types of shared variables to improve the efficiency of spark programs in a clustered environment. Are broadcast variables and accumulators, respectively.

Broadcast variables: broadcast variables you can cache read-only variables on each machine without having to send copies of the variable for each task. They can make the nodes in the cluster copy of the large input datasets more efficient.

The following code snippet shows how to use a broadcast variable.

Broadcast Variables//val Broadcastvar = Sc.broadcast (Array (1, 2, 3)) Broadcastvar.value

Accumulator: The accumulator is added only when the related operation is used, so it can support parallelism well. The accumulator can be used to implement a count (as in mapreduce) or sum. You can use the Add method to add a task running on a cluster to an accumulator variable. However, these tasks cannot read the value of the variable. Only the driver can read the value of the accumulator.

The following code snippet shows how to share a variable using an accumulator:

Accumulators//val accum = sc.accumulator (0, "My accumulator") sc.parallelize (Array (1, 2, 3, 4)). foreach (x = Accum + = x) accum.value
Spark Application Example

The example application involved in this article is a simple word count application. This is the same as the example application when learning to use Hadoop for big data processing. We will perform some data analysis queries on a text file. The text files and datasets in this example are small, but there is no need to modify any code, and the spark queries used in the example can also be used on top of the bulk data set.

To make the discussion as simple as possible, we'll use the spark Scala Shell.

First, let's look at how to install spark on your own computer.

Prerequisites:

    • In order for Spark to work properly on this machine, you need to install the Java SDK (JDK). This will be included in the first step below.
    • There is also the need to install spark software on your computer. The second step below will show you how to do the work.

Note: the following directives take the Windows environment as an example. If you are using a different operating system environment, you will need to modify the system variables and directory paths to match your environment.

I. Installing the JDK

1) Download the JDK from the Oracle website. It is recommended to use JDK version 1.7.

Install the JDK in a directory that has no spaces. For Windows users, you need to install the JDK into a folder like C:\dev and not install it under the C:\Program Files folder. The name of the "C:\Program Files" folder contains spaces, which can cause problems if the software is installed into this folder.

Note: do not install the JDK in the "C:\Program Files" folder or the Spark software (described in the second step).

2) After completing the JDK installation, switch to the "Bin" folder in the JDK 1.7 directory and type the following command to verify that the JDK is installed correctly:

Java-version

If the JDK is installed correctly, the above command will show the Java version.

II. Install the spark software:

Download the latest version of Spark from the Spark website. At the time of this publication, the latest spark version was 1.2. You can choose a specific spark version to install based on the version of Hadoop. I downloaded a spark that matches Hadoop 2.4 or later, and the file name is spark-1.2.0-bin-hadoop2.4.tgz.

Extract the installation files to a local folder (for example: C:\dev).

To verify that the spark installation is correct, switch to the Spark folder and start the Spark Shell with the following command. This is a command under the Windows environment. If you are using Linux or Mac OS, edit the commands accordingly to be able to run correctly on the appropriate platform.

C:CD C:\dev\spark-1.2.0-bin-hadoop2.4bin\spark-shell

If Spark is installed correctly, you will be able to see the following information in the console output.

..... 15/01/17 23:17:46 Info httpserver:starting http SERVER15/01/17 23:17:46 info utils:successfully started service ' HTTP Class Server ' on port 58132.Welcome to      ____              __     /__/__  ___ _____//__    _\ \ _/_ '/__/  ' _/   /___/. __/\_,_/_//_/\_\   version 1.2.0      /_/using Scala version 2.10.4 (Java HotSpot (TM) 64-bit Server VM, Java 1.7. 0_71) Type in expressions to has them evaluated. Type:help for more information.....15/01/17 23:17:53 INFO blockmanagermaster:registered blockmanager15/01/17 23:17:53 INF O sparkiloop:created Spark context. Spark context available as SC.

You can type the following command to check if the spark shell is working properly.

Sc.version

Or

Sc.appname

After you complete the steps above, you can exit the Spark shell window by typing the following command:

: Quit

If you want to start the spark Python Shell, you need to install Python on your computer first. You can download and install anaconda, a free Python release that includes some of the more popular Python packages for science, math, engineering, and data analysis.

You can then start the Spark Python Shell by running the following command:

C:CD C:\dev\spark-1.2.0-bin-hadoop2.4bin\pyspark
Spark Sample App

Once the spark is installed and started, you can perform a data analysis query with the Spark API.

These commands for reading and processing data from a text file are simple. In a follow-up article to this series of articles, we will introduce the use cases used by the more advanced spark framework.

First let's run the popular word count sample with the Spark API. If you are not running the Spark Scala shell, first open a Scala shell window. The relevant commands for this example are as follows:

Import Org.apache.spark.SparkContextimport org.apache.spark.sparkcontext._ val txtfile = "Readme.md" val txtdata = Sc.textfile (txtfile) Txtdata.cache ()

We can call the cache function to save the Rdd object generated in the previous step to the cache, after which spark does not need to recalculate every time the data is queried. It is important to note that the cache () is a deferred operation. When we call the cache, spark does not store the data in memory immediately. This action is performed only if an action is invoked on an RDD.

Now we can call the Count function to see how many rows of data are in the text file.

Txtdata.count ()

We can then execute the following command for word count. Statistics are displayed in the text file after each word.

Val wcdata = txtdata.flatmap (L = L.split ("")). Map (Word = = (Word, 1)). Reducebykey (_ + _) Wcdata.collect (). foreach (P RINTLN)

If you want to see more code examples on how to use the Spark core API, refer to the spark documentation on the Web site.

Follow-up plan

In the following series of articles, we'll start with spark SQL and learn more about the other parts of the spark ecosystem. After that, we'll continue to learn about spark Streaming,spark Mllib and Spark GraphX. We will also have the opportunity to learn frameworks like Tachyon and BLINKDB.

Summary

In this article, we learned how the Apache Spark Framework helps with large data processing and analysis through its standard API. We also compared spark with traditional mapreduce implementations such as Apache Hadoop. Spark and Hadoop are based on the same HDFs file storage system, so if you've made a lot of investment and infrastructure on Hadoop, you can use spark and mapreduce together.

In addition, spark processing can be combined with spark SQL, machine learning, and spark streaming. We will cover this in a follow-up article.

With some of the integrated features and adapters from Spark, we can combine other technologies with spark. One example is the combination of Spark, Kafka, and Apache Cassandra, where Kafka is responsible for streaming data for input, spark completes the calculation, and finally Cassandra the NoSQL database to hold the calculated results data.

Keep in mind, however, that the spark ecosystem is still immature, and that further improvements are still needed in areas such as security and integration with BI tools.

Reference documents
    • Spark Main Station
    • Spark sample
    • 2014 Spark Summit presentation and video
    • Spark on databricks website
    • Spark columns on the Databricks website

Getting Started with Spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.