SPARK-02 (RDD and simple operators)

Source: Internet
Author: User

Today, we come into the second chapter of Spark Learning, found that a lot of things have begun to change, life is not simple to the direction you want to go, but still need to work hard, do not say chicken soup, etc.

Start our journey to spark today

I. What is an RDD?

The Chinese interpretation of the RDD is an elastic distributed dataset, the full name resilient distributed datases, the in-memory data set,

The RDD is read-only and can be partitioned, and all or part of the data set can be cached in memory and reused over time, so-called
Elasticity, refers to the memory is not enough to be interchangeable with the disk

Two. Spark operator

    Spark operators are divided into two categories, called transformation (conversion), a class called action (action)

Transformation deferred execution, transformation records metadata information and starts a real execution when the compute task violates the action (also described in the previous section)

    

In this case, either the map or the filter method is the transform method, so this value does not really change , until collect, this is the action, then its real value will be called

  Three. Two ways to create an RDD

    1. Creating a Rdd,rdd file system with HDFS support There is no real data to be computed, just a log of the meta data

2. Create an RDD in a parallel way through a Scala collection or array

    Look at the internal implementation of the RDD summary (5 features)

Internally, each RDD are characterized by five main properties:
-A List of partitions
-A function for computing each split
-Alist of dependencies on other RDDs
-Optionally, a partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
-Optionally, a list of preferred locations to compute each split on (e.g. block locations an HDFS file)

Four. Spark's first program on idea

1. First we write a spark program on idea and then the package

Object WordCount {  def main (args:array[string]): Unit = {    //very important, entrance to the spark cluster    val conf = new sparkconf (). Set AppName ("WC")    val sc = new Sparkcontext (conf)    sc.textfile (args (0)). FlatMap (_.split (")"). Map (((((_,1))). Reducebykey (_+_). SortBy (_._2). Saveastextfile (args (1))    sc.stop ()  }}

The first thing to clarify is that our spark is created in maven form, so our pom file adds support for Spark

When we were in the package, we would generate two jar packages in target, and we would choose a large size that might include other libraries

2. Upload to Linux and submit (this is similar to executing jar packages on Hadoop)

./spark-submit   --master spark://192.168.109.136:7077  --class cn.wj.spark.WordCount   -- Executor-memory 512m    --total-executor-cores 2/tmp/hello-spark-1.0.jar            hdfs://192.168.109.136:9000/wc/*  Hdfs://192.168.109.136:9000/wc/out    

This time we can view the current project execution of Spark through 192.168.109.136:8080

Five. Master's relationship with worker

Master manages all the workers, then the resource is dispatched, the worker manages the current node, and the worker starts executor to complete the actual calculation

SPARK-02 (RDD and simple operators)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.