Introduction to Spark Basics (i)--------Rdd Foundation

Last Update:2018-08-03 Source: Internet

Author: User

Tags foreach arrays join

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(i), RDD definition

immutable distributed Objects Collection

For example, the following figure is RDD1 data, its Redcord is a number, distributed on three nodes, and its content is not variable

There are two ways of creating an Rdd:

1) distribution in driver (Parallelize method)

Create a collection (copy past) in the driver (Driver) as a distributed dataset (the number of partitions is the default and the execution resource count is consistent) through the Parallelize method

list<integer> data = Arrays.aslist (10,34,567,53,9,3);

javardd<integer> Distdata = sc.parallelize (data);//Its Redcord is an Integer

2) read external Data set (local file, HDFS ...) )

javardd<string> distfile = Sc.textfile ("Data.txt");//Its Redcord is a String, separated by rows by default

Textfile ("/my/directory"), Textfile ("/my/directory/*.txt"),

You can use wildcards to read all the data in all the files in the folder, return the Rdd, and count a record for each line of string data in each file.

Wholetextfiles () also reads all the data in all the files in the folder, but returns the Pairrdd, each record is (filename, content) pair

(ii), RDD partition number

1) parallelize generated RDD

default number of partitions = number of parallel threads (number of CPU cores allocated)

(and the Spark website says: If not specified, the default is to set the number of partitions according to the allocated resources)

Bin/spark-shell--master yarn--num-executors 2 --executor-memory 512m

Bin/spark-shell--master yarn-client--num-executors4 --executor-memory 512m

Specify the partition, specify the number of partitions, and how many

2) Read the RDD generated by the external data set

① for files on HDFs

The number of HDFs files is n, and the default is to generate N partition

Small experiment 1:hdfs A directory has 533 files (less than 128M files, so each file occupies only one block), read this directory to operate, you can see that the number of generated tasks is 533

Small experiment 2:hdfs on a file 1.81G (more than 128M files, files in HDFs occupy more than block)

Test with Spark-shell, where Spark-shell allocates 2 cores. To run the count operation on it, you can see that only one task is generated and executor only one. It is true that the number of partition is 1, regardless of the number of blocks it occupies.

② for a local folder

For example: 4 Files folder, the default will generate 4 partition

API can specify the number of minpartitions partitions, Sc.textfile ("", minpartitions)

If minpartitions is less than 4 of the number of files, the number of partitions is 4

If minpartitions is greater than the number of files 4, the number of partitions is a multiple of the number of files 8,12 ....

(Partitions.size will return the result to driver, whether it is an action action.)

Experimental proof: No job submitted, not action)

(iii), RDD operation

Rdds supports 2 types of operations: 1, Transformations conversion, 2, action actions

①transformation Action: Create a new rdd from an RDD

②action Operation: Calculates a result on the RDD, returns the result to the driver or stores the result in an external storage system (such as HDFs) or applies an action action to all elements

Transformations is lazy (lazy) and does not do anything when it is called. The previous transformation operation is evaluated only when the action action is invoked

Transformation is actually a chain-of-logic action that records the process call of the RDD evolution, just to make a record. Action is only triggered when the action is invoked, transformation starts the calculation

Some of them need to pass in the function , function<t,r>,flatmapfunction<t,r>,pairfunction<t,r>,function2<t1 , T2,r> the corresponding relationship is shown in the picture please make a big look (hold the CTRL key while the mouse scroll up) (the code involved at the end of the article has been posted)

1) Conversion Operation:

Map,filter,flatmap,union (The return value of the method is Rdd)

2) RDD action operation

A. Return to the driver value:

Reduce,fold,aggregate, Take (n), Top (), first (), Collect, COUNT (), Countbyvalue ()

(The return value of the method is string,integer,list ... such as

Both reduce and fold require that the return value of the function be the same as the RDD record type, such as the Rdd of an integer, and the result of the operation of reduce and fold is an integer

Aggregate to get rid of this limitation, can return to driver any type

collect,take,top Returns the list data to driver

Note: Collect is to return all of the RDD data to the driver, to ensure that all data must be placed in the memory of a single machine, otherwise it will cause the driver run out of memories, will be collapsed

the first method of the Rdd : random, directly in the rddlist to take out firstly, which is which one is removed. If the data defines the order, you can remove the first record.

Count Returns a long type of counting value

B. Storing the results in an external storage system

Saveastextfile

C. foreach:

Applies an action action to all elements, manipulating each element in the RDD without having to send the RDD back to the local driver

such as sending data over the network or storing it in a database

Note: If you call the println operation, it will not write back to driver, but instead write to executor stdout, and you can see the output by this way

If you want to see the output on the driver, then Print rdd.collect ()

3) Special operation

Persist

When we use the same rdd multiple times, if we do it the usual way, then each calculation will have to recalculate the RDD and its dependencies, for example (here is a bit of a super-outline, the specific principle of the next chapter of the tutorial)

RDD2 = Rdd1.map (). Filter ()

Int a = Rdd2.reduce ()

RDD3 = Rdd2.filter ()

Int B = rdd3.reduce ()

There are two actions, which will trigger the order to commit two jobs, and the second job is submitted after the first job submission is completed.

The first job, which calculates the Dag graph, is rdd1-->map->filter-->rdd2-->reduce It starts from RDD1 until reduce returns the result to driver

The second job, which calculates the Dag graph, is rdd1-->map->filter-->rdd2-->filter-->rdd3-->reduce It starts from RDD1, Until reduce returns the result to driver

You can see that these two jobs have executed the rdd1-->map->filter-->rdd2 all over again, resulting in repeated computations, so you can use the Persist Operation Rdd2.persist (). As long as the RDD2 is calculated once, the next direct use of the RDD, without the need to repeatedly calculate the RDD, then the second job does not need to calculate RDD2, direct RDD2-->filter-->rdd3-- >reduce

The Rdd cache () method actually calls the persist method, the cache policy is memory_only, and the persist method can be manually set Storagelevel to meet the required storage level of the project; Memory_ Only (in the JVM's heap space), Memory_and_disk (if the data doesn't fit in memory, it overflows to disk), etc.

Note: The cache or persist is not an action;

iv.Pairrdd

Create Pairrdd:maptopair

Pairrdd Conversion Operations

Filter (func), Reducebykey (func), Mapvalues (func)

Keys (), Values ()

Sortbykey ()

v. textfile read the local file pit

In the official note, the local file system is used to ensure that the path can be accessed on the worker to obtain local files

If using a path on the local filesystem, the file must also is accessible at the same path on worker nodes. Either copy the file to any workers or use a network-mounted the shared file system.

But it doesn't seem to work.

When it is submitted in yarn mode, it is always reported that the file cannot be found, even if it is placed on all nodes

In standalone mode, if it is a single cluster, you can find the file, if it is more than one cluster, different machines on the same file path of the file content, read the result is only a certain machine file content.

It has not been understood here, but also ask the expert guidance

/** * The square of values in the RDD */public static void Mapdemo () {javardd<integer> Numberrdd = sc.parallelize (A
        Rrays.aslist (1,2,3,4));
            javardd<integer> Resultrddnumber = Numberrdd.map (New Function<integer, integer> () {@Override
            Public integer Call (integer v1) throws Exception {return v1 * v1;

        }
        });
        List result = Resultrddnumber.collect ();
        Long Count = Resultrddnumber.count ();
        System.out.println (Stringutils.join (Result, ","));
    System.out.println ("Count is:" +count); }/** * Flatmap method */public static void Flatmapdemo () {javardd<string> Linesrdd = Sc.par
        Allelize ("Arrays.aslist", "sentences", "Another sentences", "What's You saying");
            javardd<string> Wordrdd = Linesrdd.flatmap (New flatmapfunction<string, string> () {@Override Public iterable<string> Call (String s) throWS Exception {return arrays.aslist (S.split (""));
        }
        });
        System.out.println (Stringutils.join (Wordrdd.collect (), ","));
    System.out.println ("Count is:" + wordrdd.count ()); }/** * Filter */public static void Filterdemo () {javardd<string> Linesrdd = Sc.parallel
        Ize (Arrays.aslist ("This is a sentences", "another Wangke sentences" and "What's You saying");
            javardd<string> Filterrdd = linesrdd.filter (New function<string, boolean> () {@Override
            Public Boolean Call (String v1) throws Exception {return v1.contains ("Wangke");
        }
        });
        System.out.println (Stringutils.join (Filterrdd.collect (), ","));
    System.out.println ("Count is:" + filterrdd.count ()); }/** * Union */public static void Uniondemo () {javardd<string> Linesrdd = Sc.paralle Lize (Arrays.aslist ("This is a sentences", "AnoTher wangke sentences "," What is You saying ");
        javardd<string> linesRdd2 = sc.parallelize (arrays.aslist ("I ' M saying", "we can be Together"));
        javardd<string> Unionrdd = linesrdd.union (LINESRDD2);
        System.out.println (Stringutils.join (Unionrdd.collect (), ","));
    System.out.println ("Count is:" + unionrdd.count ()); /** * Reduce the return value type must be the same as the element type in the RDD we manipulate */public static void Reducedemo () {Javardd<integ
        er> Numberrdd = sc.parallelize (Arrays.aslist (1,2,3,4)); Integer result = numberrdd.reduce (new Function2<integer, Integer, integer> () {@Override Pub
            Lic integer call (integer v1, Integer v2) throws Exception {return v1 + v2;
        }
        });
    SYSTEM.OUT.PRINTLN ("result is:" +result); }/** * Reduce2 */public static void ReduceDemo2 () {javardd<string> Linesrdd = Sc.parall Elize (Arrays.aslist ("This", "is aSentences "," What is You saying ")); String result = Linesrdd.reduce (new function2<string, String, string> () {@Override public S
            Tring Call (String v1, String v2) throws Exception {return v1+v2;
        }
        });
    SYSTEM.OUT.PRINTLN ("result is:" +result); }/** * Reduce2 */public static void Foreachdemo () {list<integer> data = Arrays.aslist (
        5, 1, 1, 4, 4, 2, 2);
        javardd<integer> Javardd = sc.parallelize (data,3);
                Javardd.foreach (New voidfunction<integer> () {public void call (Integer integer) throws Exception {
                System.out.println (integer);
    Send this data to a Web server//store the data in a database}}); }/** * Aggregate * * * class Avgcount implements serializable{public int total;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More