SPARK-02 (RDD and simple operators)

Last Update:2016-12-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, we come into the second chapter of Spark Learning, found that a lot of things have begun to change, life is not simple to the direction you want to go, but still need to work hard, do not say chicken soup, etc.

Start our journey to spark today

I. What is an RDD?

The Chinese interpretation of the RDD is an elastic distributed dataset, the full name resilient distributed datases, the in-memory data set,

The RDD is read-only and can be partitioned, and all or part of the data set can be cached in memory and reused over time, so-called
Elasticity, refers to the memory is not enough to be interchangeable with the disk

Two. Spark operator

　　　　Spark operators are divided into two categories, called transformation (conversion), a class called action (action)

Transformation deferred execution, transformation records metadata information and starts a real execution when the compute task violates the action (also described in the previous section)

In this case, either the map or the filter method is the transform method, so this value does not really change , until collect, this is the action, then its real value will be called

　　Three. Two ways to create an RDD

　　　　1. Creating a Rdd,rdd file system with HDFS support There is no real data to be computed, just a log of the meta data

2. Create an RDD in a parallel way through a Scala collection or array

　　　　Look at the internal implementation of the RDD summary (5 features)

Internally, each RDD are characterized by five main properties:
-A List of partitions
-A function for computing each split
-Alist of dependencies on other RDDs
-Optionally, a partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
-Optionally, a list of preferred locations to compute each split on (e.g. block locations an HDFS file)

Four. Spark's first program on idea

1. First we write a spark program on idea and then the package

Object WordCount {  def main (args:array[string]): Unit = {    //very important, entrance to the spark cluster    val conf = new sparkconf (). Set AppName ("WC")    val sc = new Sparkcontext (conf)    sc.textfile (args (0)). FlatMap (_.split (")"). Map (((((_,1))). Reducebykey (_+_). SortBy (_._2). Saveastextfile (args (1))    sc.stop ()  }}

The first thing to clarify is that our spark is created in maven form, so our pom file adds support for Spark

When we were in the package, we would generate two jar packages in target, and we would choose a large size that might include other libraries

2. Upload to Linux and submit (this is similar to executing jar packages on Hadoop)

./spark-submit   --master spark://192.168.109.136:7077  --class cn.wj.spark.WordCount   -- Executor-memory 512m    --total-executor-cores 2/tmp/hello-spark-1.0.jar            hdfs://192.168.109.136:9000/wc/*  Hdfs://192.168.109.136:9000/wc/out

This time we can view the current project execution of Spark through 192.168.109.136:8080

Five. Master's relationship with worker

Master manages all the workers, then the resource is dispatched, the worker manages the current node, and the worker starts executor to complete the actual calculation

SPARK-02 (RDD and simple operators)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

SPARK-02 (RDD and simple operators)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

SPARK-02 (RDD and simple operators)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support