Getting started with Apache spark Big Data Analysis (i)

Last Update:2017-10-25 Source: Internet

Author: User

Tags cassandra spark mllib

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Summary: The advent of Apache Spark has made it possible for ordinary people to have big data and real-time data analysis capabilities. In view of this, this article through hands-on Operation demonstration to lead everyone to learn spark quickly. This article is the first part of a four-part tutorial on the Apache Spark Primer series.

The advent of Apache Spark has made it possible for ordinary people to have big data and real-time data analysis capabilities. In view of this, this article through hands-on Operation demonstration to lead everyone to learn spark quickly. This article is the first part of a four-part tutorial on the Apache Spark Primer series.

The full text consists of four parts:

Part I: Getting Started with spark, introducing how to use the shell and Rdds
Part II: Introduction to Spark SQL, Dataframes, and how to combine spark with Cassandra
Part III: Introduction to spark mllib and spark streaming
Part IV: Introduction to spark GRAPHX graph calculation

The first part of this tutorial

For the full summary and Outline section, please visit our website Apache Spark QuickStart for real-time data-analytics.

On the website you can find more articles and tutorials on this, for example: Java reactive microservice training,microservices Architecture | Consul Service Discovery and Health for MicroServices Architecture Tutorial. There are more other things that are interesting to see.

Spark Overview

Apache Spark is a fast-growing, open-source cluster computing system that is growing fast. The growing range of packages and frameworks in the Apache spark ecosystem enables spark to perform advanced data analysis. The rapid success of Apache Spark is due to its powerful features and ease of use. Compared to traditional mapreduce big data analysis, Spark is more efficient and runs faster. Apache Spark provides in-memory distributed computing capabilities with API programming interfaces in Java, Scala, Python, and r four programming languages. The spark ecosystem looks like this:

Display-edit

The entire ecosystem is built on top of the spark kernel engine, and the kernel allows spark to have fast memory computing power and its API to support Java, Scala, Python, and r four programming languages. Streaming has the ability to handle real-time streaming data. Spark SQL enables users to query structured data in the language they are best at, Dataframe at the heart of Spark SQL, dataframe data as a collection of rows, each column in the corresponding row is named, and by using Dataframe, you can easily query, Draw and filter data. Mllib is the machine learning framework in spark. GRAPHX is a graph computing framework that provides graph computing capabilities for structured data. The above is an overview of the entire ecosystem.

Apache Spark's development history

Originally developed by the UC Berkeley AMP Lab Lab and Open source in 2010, it has become a top-notch program for the Apache Software Foundation (software).
12,500 code submissions have been submitted from 630 source contributors (see Apache Spark Github Repo)
Most of the code is written in the Scala language.
The recent spurt in Google search for Apache Spark (Google search interests) has shown its high level of focus (the Google Ad Word tool shows that there are up to 108,000 searches in July alone, 10 times times more than MicroServices's search volume)

Some spark source contributors (distributors) are from IBM, Oracle, DataStax, Bluedata, Cloudera ...
Applications built on Spark include: Qlik, Talen, Tresata, Atscale, Platfora ...
The companies that use Spark are: Verizon Verizon, NBC, Yahoo, Spotify ...

The reason people are so interested in Apache Spark is that it makes common development with Hadoop's data processing power. The cluster configuration compared to the Hadoop,spark is simpler than the configuration of the Hadoop cluster, and runs faster and easier to program. Spark enables most developers to have big data and real-time data analysis capabilities. In view of this, in view of this, this article through hands-on operational demonstrations to lead everyone quickly get started learning Apache Spark.

Download spark and River show how to use the Interactive shell command line

The best way to experiment with Apache Spark is to use the interactive shell command line, which currently has two interactive command lines for Python shell and Scala shell.

You can download Apache Spark from here and choose the most recently precompiled version to be able to run the shell right away.

Currently the latest version of the Apache Spark is 1.5.0, which is released on September 9, 2015.

Tar-xvzf ~/spark-1.5.0-bin-hadoop2.4.tgz

Run the Python Shell

CD Spark-1.5.0-bin-hadoop2.4./bin/pyspark

The Python shell is not used in this section for demonstrations.

The Scala interactive command line is able to use the Java library because it runs on the JVM.

Running the Scala Shell

CD Spark-1.5.0-bin-hadoop2.4./bin/spark-shell

After executing the above command line, you can see the following output:

Scala Shell Welcome Message

Welcome to      ____              __     /__/__  ___ _____//__    _\ \ _/_ '/__/  ' _/   /___/. __/\_,_/_//_/\_\
   
    version 1.5.0      /_/using Scala version 2.10.4 (Java HotSpot (TM) 64-bit Server VM, Java 1.8.0_25) Type in expressions to have them evaluated. Type:help for more information.15/08/24 21:58:29 INFO sparkcontext:running Spark version 1.5.0

Here are some simple exercises to help with the shell. Maybe you don't understand what we're doing right now, but we'll take a detailed analysis of that later. In the Scala shell, do the following:

Create Textfilerdd using the Readme file in Spark

Val textfile = Sc.textfile ("readme.md")

Gets the first element of the Textfile Rdd

Textfile.first () res3:string = # Apache Spark

Filter the data in the Textfile Rdd, returning all lines containing the "Spark" keyword, returning a new RDD after the operation is complete, counting the rows of the returned RDD

Filter out the RDD that includes the Spark keyword and then count the rows

Val Lineswithspark = textfile.filter (line = Line.contains ("Spark")) Lineswithspark.count () Res10:long = 19

To find the line that has the most rdd lineswithspark words, you can use the following actions. Using the map method, map each row in the Rdd to a number, and then use the reduce method to find the line that contains the most words.

Find the line with the most number of words in the RDD textfile

Textfile.map (line = Line.split (""). Size). Reduce ((A, B) and if (a > B) a else b) Res11:int = 14

The results show that the number of words in line 14th is the highest.

Other Java packages can also be introduced, such as the Math.max () method, because the map and reduce methods accept Scala function literals as arguments.

Introducing Java methods into the Scala shell

Import Java.lang.MathtextFile.map (line = Line.split (""). Size). Reduce ((a, B) = Math.max (A, b)) Res12:int = 14

We can easily cache the data in memory.

Cache the Rdd Lineswithspark and then count the rows

Lineswithspark.cache () Res13:linesWithSpark.type = Mappartitionsrdd[8] at filter at <console> : 23lineswithspark.count () Res15:long = 19

The above briefly shows you how to use the Spark Interactive command line.

Resilient distributed Data Set (RDDS)

Spark can perform tasks in parallel in a cluster, and the degree of parallelism is determined by--rdd, one of the major components in spark. An elastic distributed DataSet (resilient distributed data, RDD) is a data representation in which data in the RDD is stored in a cluster (fragmented data storage), precisely because the partitioned storage of data allows tasks to be executed in parallel. The higher the number of partitions, the higher the parallelism. The expression of the RDD is given:

Display-edit

Imagine that each column is a partition (partition), and you can easily allocate partition data to individual nodes in the cluster.

To create an RDD, you can read data from external storage, such as from Cassandra, Amazon simple storage services (Amazon Easy Storage service), HDFs, or other Hadoop-supported input data formats. You can also create an RDD by reading data in a file, array, or JSON format. On the other hand, if the data is localized for the application, you only need to use the Parallelize method to function The spark's attributes on the data and parallelize the data through the Apache Spark cluster. To verify this, we use the Scala Spark shell to demonstrate:

Creating an Rdd Thingsrdd from a collection of word lists

Val Thingsrdd = sc.parallelize (List ("spoon", "fork", "plate", "Cup", "Bottle")) thingsrdd:org.apache.spark.rdd.rdd[ String] = parallelcollectionrdd[11] at parallelize at <console>:24

calculate the number of orders in the Rdd Thingsrdd

Thingsrdd.count () Res16:long = 5

When you run spark, you need to create a spark Context. Spark context is created automatically when you use the Spark Shell Interactive command line. When invoking the Parallelize method of the Spark Context object, we get a partitioned rdd that will be distributed across the nodes of the cluster.

What can we do with the RDD?

For the RDD, you can either convert the data or perform action on the action. This means that using transformation can change data formats, perform data queries or data filtering operations, and use action actions to trigger data changes, extract data, collect data, and even count.

For example, we can use the text file readme.md in spark to create an rdd textfile that contains lines of text that are read into the Rdd textfile, where the text line data is partitioned so that it can be distributed to the cluster and parallelized.

Create an RDD textfile based on the readme.md file

Val textfile = Sc.textfile ("readme.md")

Row Count

Textfile.count () Res17:long = 98

There are 98 rows of data in the Readme.md file.

The results are as follows:

Display-edit

We can then filter out all the lines that contain the Spark keyword, and a new rddlineswithspark will be generated when the operation is completed:

Create a filtered Rdd Lineswithspark

Val Lineswithspark = textfile.filter (line = Line.contains ("Spark"))

In the previous picture, we gave a representation of the Textfile Rdd, and the following figure shows the Rdd Lineswithspark:

Display-edit

It is worth noting that spark also has a key-value pair of RDD (pair rdd), the data format of this RDD is the key/value pair data (Key/value paired). For example, the data in the following table, which represents the relationship between fruit and color:

Display-edit

Using the Groupbykey () conversion operation on the data in the table results in the following:

Groupbykey () conversion operation

Pairrdd.groupbykey () Banana [Yellow]apple [Red, Green]      Kiwi [green]figs [Black]

This conversion only groups the keys for Apple, with values of red and green. These are examples of conversion operations that have been given so far.

When you get a filtered Rdd, you can collect/materialize the corresponding data and flow it to the application, which is an example of an action action. After this, all the data in the RDD disappears, but we can still do some things on the RDD data because they are still in memory.

Data in the Collect or Materializelineswithspark Rdd

Lineswithspark.collect ()

It is worth mentioning that each time the spark action operation, such as the count () action operation, Spark will restart all the conversion operations, the calculation will run to the last conversion operation, and then the count operation returns the results of the calculation, this run is slower. To solve this problem and increase the speed of the program, you can cache the RDD data into memory, in this way, when you repeatedly run the action action, you can avoid each calculation starts from the beginning, directly from the cache to the in-memory rdd to get the corresponding results.

Cache Rddlineswithspark

Lineswithspark.cache ()

If you want to purge the Rdd Lineswithspark from the cache, you can use the Unpersist () method.

To remove Lineswithspark from memory

Lineswithspark.unpersist ()

If you do not delete it manually, Spark uses the most recent unused (least recently used LOGIC,LRU) scheduling algorithm to remove the most cached Rdd in memory, in the case of a tight memory space.

Here's a summary of how spark works from start to end:

To create an RDD of a data type
Convert data in an RDD, such as filtering operations
Cache the converted or filtered Rdd in the event that reuse is required
Action actions on the RDD, such as extracting data, counting, storing data to Cassandra, and so on.

The following is a partial conversion operation checklist for the RDD:

Filter ()
Map ()
Sample ()
Union ()
Groupbykey ()
Sortbykey ()
Combinebykey ()
Subtractbykey ()
Mapvalues ()
Keys ()
Values ()

Here is a list of some action actions for the RDD:

Collect ()
Count ()
First ()
Countbykey ()
Saveastextfile ()
Reduce ()
Take (N)
Countbykey ()
Collectasmap ()
Lookup (Key)

For a list and description of all the RDD operations, refer to Spark documentation

Conclusion

This article describes Apache Spark, a fast-growing, open-source cluster computing system. We show you some of the Apache spark libraries and frameworks that can perform advanced data analysis. A brief analysis of why Apache Spark is so successful is demonstrated by the power and ease of use of Apache Spark. Demonstrates the memory, distributed computing environment provided by Apache Spark and demonstrates its ease-of-use and easy-to-grasp.

In the second part of this series of tutorials, we have a more in-depth introduction to spark.

RELATED links:

REACTIVE microservices
High-speed microservices
Real-time, Fast-lane data analytics for microservices metrics

Original Address: Introduction to Big Data Analytics w/apache Spark Pt. 1 (Translator/on True reviewer/Zhu Zhengju Zebian/Zhonghao)

Translator Profile: Nuo Yajen, Bachelor of Science, 2010 graduated from Southwest University School of Computer and Information technology, graduate student, 2013 graduated from the University of Chinese Academy of Sciences, Information science and informatics, computer information processing and retrieval direction.

Getting started with Apache spark Big Data Analysis (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More