spark2.x deep into the end series seven of the Rdd Python API detailed one

Source: Internet
Author: User
Tags random seed

Before learning spark any technology, please understand spark correctly, and you can refer to: Understanding spark correctly


The following is a Python API description of the three ways to create the RDD, the single-type RDD basic Transformation API, the sampling API, and the pipe operation.


Three ways to create an RDD

    1. Create an RDD from a stable file storage system, such as local filesystem or HDFS, as follows:

How to create an Rdd: 1: From a stable storage system, such as an HDFs file, or a local file system "" "Text_file_rdd = Sc.textfile (" file:////Users/tangweiqun/ Spark-course/word.txt ") print" Text_file_rdd = {0} ". Format (", ". Join (Text_file_rdd.collect ()))

2. You can go through the transformation API to create a new rdd from an already existing RDD, here is the map conversion API

"" "2: From an already existing RDD, the RDD Transformation API" "" Map_rdd = Text_file_rdd.map (lambda line: "{0}-{1}". Format (line, "Test")) Print "Map_rdd = {0}". Format (",". Join (Map_rdd.collect ()))

3. Create an RDD from an in-memory list data, you can specify the number of partitions for the RDD, and if not specified, take all cores of all executor

"" "3: From a list that already exists in memory, you can specify the partition, if you do not specify the number of partitions for all executor cores, the following API specifies 2 partitions" "" Parallelize_rdd = Sc.parallelize ([1, 2, 3 , 3, 4], 2) print "Parallelize_rdd = {0}". Format (Parallelize_rdd.glom (). Collect ())


Note: In the third case, Scala also provides the Makerdd API, which specifies the machine where each partition of the RDD is created, as the principle of this API is described in the Spark core Rdd Scala API


Two, single type RDD basic Transformation API

Create an RDD based on the in-memory data first

conf = sparkconf (). Setappname ("AppName"). Setmaster ("local") sc = Sparkcontext (conf=conf) Parallelize_rdd = Sc.parallelize ([1, 2, 3, 3, 4], 2)
    1. The map operation, which represents the application of our custom function interface to each element of the Parallelize_rdd, adds 1 to each element as follows:

Map_rdd = Parallelize_rdd.map (lambda x:x + 1) "" "Result: [[2, 3], [4, 4, 5]]" "" print "Map_rdd = {0}". Format (Map_rdd.glo M (). Collect ())

It is important to note that the map operation can return data with different types of rdd, such as the following, which returns a String type object:

Map_string_rdd = Parallelize_rdd.map (lambda x: "{0}-{1}". Format (x, "test") "" "Result: [[' 1-test ', ' 2-test '], [' 3-test ', ' 3- Test ', ' 4-test ']] "" "print" Map_string_rdd = {0} ". Format (Map_string_rdd.glom (). Collect ())

2. Flatmap operation, apply our custom lambda function to each element of Parallelize_rdd, the output of this function is a list of data, Flatmap will flatten the data list of these outputs

Flatmap_rdd = Parallelize_rdd.flatmap (lambda x:range (x)) "" "Result: [[0, 0, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 3]]" "" Print "Flatmap_rdd = {0}". Format (Flatmap_rdd.glom (). Collect ())

3. Filter operation, apply our custom filter function to each element of Parallelize_rdd, filter out the elements we do not need, like below, filter out elements not equal to 1:

Filter_rdd = Parallelize_rdd.filter (lambda x:x! = 1) "" "Result: [[2], [3, 3, 4]]" "" print "Filter_rdd = {0}". Format (filt Er_rdd.glom (). Collect ())

4. Glom operation, view the element data corresponding to each partition of Parallelize_rdd

Glomrdd = Parallelize_rdd.glom () "" "Result: [[1, 2], [3, 3, 4]] Description Parallelize_rdd has two partitions, the first partition has data 1 and 2, the second partition has data 3,3 and 4" "Print" Glomrdd = {0} ". Format (Glomrdd.collect ())


5. Mappartitions operation, apply our Custom Function interface method to the data of each partition of Parallelize_rdd, suppose we need to add an initial value to each element, and the acquisition of this initial value is very time consuming, This time with Mappartitions will have a very big advantage, as follows:

This is an initial value acquisition method and is a more time-consuming method Def get_init_number (source):    print  "Get init  number from {0}, may be take much time ... ". Format (source)      time.sleep (1)     return 1def map_partition_func (iterator):      "" "     each partition gets an initial value, Integerjavardd has two partitions, then two times Getinitnumber method is called      therefore, the corresponding time-consuming operations that need to be initialized, such as initializing database connections, are generally used mappartitions to initialize each partition once, rather than using map operations      :p aram iterator:    :return:     "" "     init_number = get_init_number ("Map_partition_func")     yield map ( Lambda x : x + init_number, iterator) Map_partition_rdd = parallelize_ Rdd.mappartitions (Map_partition_func) "" "     results: [[[2, 3]], [[4, 4, 5]]      "" "print  "map_partition_rdd = {0}". Format (Map_partition_rdd.glom (). Collect ()) Def map_func (x):      "" "     go through each element to get the initial value, this integerjavardd contains 5 elements, Then this getinitnumber method will be called 4 times, seriously affect the time, not as good as mappartitions performance     :p aram x:     :return:     "" "    init_number = get_init_number (" Map_ Func ")     return x + init_numbermap_rdd_init_number = parallelize _rdd.map (Map_func) "" "     results:[[2, 3], [4, 4, 5]]      "" "print " map_rdd_init_number = {0} ". Format (Map_rdd_init_number.glom (). Collect ())

6. Mappartitionswithindex operation, the data for each partition of the PARALLELIZE_RDD application our Custom Function interface method, when the function interface method is applied with the partition information, that is, you are currently processing the data of the first partition

def map_partition_with_index_func (Partition_index, iterator): Yield (partition_index, sum (iterator)) Map_partition_ With_index_rdd = Parallelize_rdd.mappartitionswithindex (Map_partition_with_index_func) "" "Results: [[(0, 3)], [(1, 10)]" "Print" Map_partition_with_index_rdd = {0} ". Format (Map_partition_with_index_rdd.glom (). Collect ())

Third, sampling API

Create an RDD based on the in-memory data first

conf = sparkconf (). Setappname ("AppName"). Setmaster ("local") sc = Sparkcontext (conf=conf) Parallelize_rdd = Sc.parallelize ([1, 2, 3, 3, 4], 2)
    1. Sample

"" "The first parameter is withreplacement if the withreplacement=true expression has put back the sampling, uses the Poisson sampling algorithm realizes if the Withreplacement=false expression does not put back the sampling, Using the Bernoulli sampling algorithm to implement the second parameter is: fraction, indicating that each element is extracted as the probability of the sample, not to indicate the amount of data to be extracted, such as sampling from 100 data, fraction=0.2, does not mean to extract 100 * 0.2 = 20 data, Instead, the 100 elements are extracted as a sample probability of 0.2, the size of the sample is not fixed, but is subject to two distributions when the Withreplacement=true fraction>=0 when the Withreplacement=false 0 < fraction < 1 The third parameter is: Reed represents the seed that generates a random number, which generates a random seed "" "Sample_rdd = Parallelize_rdd.sample (False) based on the reed for each partition of the RDD. 0.5, 100) "" "Result: [[1], [3, 4]]" "" print "Sample_rdd = {0}". Format (Sample_rdd.glom (). Collect ())

2. Randomsplit

"" "//according to the weight of the rdd random sampling segmentation, there are several weights are divided into several rdd//random sampling using the Bernoulli sampling algorithm, the following is a two weight, will be cut into two Rdd" "" Split_rdds = Parallelize_ Rdd.randomsplit ([0.2, 0.8]) print len (split_rdds) "" "[[], [3, 4]]" "" print "split_rdds[0] = {0}". Format (Split_rdds[0]. Glom (). Collect ()) "" "[[1, 2], [3]]" "" print "split_rdds[1] = {0}". Format (Split_rdds[1].glom (). Collect ())

3. Takesample

"" "//random sampling of the specified number of sample data///The first parameter is withreplacement//if the withreplacement=true means that there is a sample of the put back, using the Poisson sampling algorithm to achieve//if withreplacement= False if the sample is not put back, using the Bernoulli sampling algorithm to achieve//The second parameter is specified, then how many samples "" "" "randomly sampled a specified number of sample data results: [1]" "" Print parallelize_rdd.takesample ( False, 1)

4. Stratified sampling, sampling the key-value type of RDD

"" "Create a key value of type Rdd" "" Pair_rdd = Sc.parallelize ([' (' A ', 1 '), (' B ', 2), (' C ', 3), (' B ', 4), (' A ', 5)]) Samplebykey_rdd = PA Ir_rdd.samplebykey (Withreplacement=false, fractions={' a ': 0.5, ' B ': 1, ' C ': 0.2}) "" Result: [[(' A ', 1], (' B ', 2), (' B ', 4)]] "" "Print" Samplebykey_rdd = {0} ". Format (Samplebykey_rdd.glom (). Collect ())

The principle of sampling can be found in detail:Spark core RDD API. These principles are not very well expressed in words.

Pipe, which represents a step in the Rdd execution stream to execute other scripts, such as Python or shell scripts

conf = sparkconf (). Setappname ("AppName"). Setmaster ("local") sc = Sparkcontext (conf=conf) Parallelize_rdd = Sc.parallelize (["Test1", "Test2", "Test3", "test4", "Test5"], 2) "" "//if it is in a real spark cluster, Then require echo.py in the cluster of each machine under the same directory must have//the second parameter is the environment variable "" "Pipe_rdd = Parallelize_rdd.pipe (" python/users/tangweiqun/spark/ Source/spark-course/spark-rdd-java/src/main/resources/echo.py ", {" env ":" env "})" "" Result: slave1-test1-env Slave1-test2-env slave1-test3-env slave1-test4-env slave1-test5-env "" "" print "Pipe_rdd = {0}". Format ("". Join (Pipe_ Rdd.collect ()))

The contents of echo.py are as follows:

Import sysimport os#input = "Test" input = Sys.stdinenv_keys = Os.environ.keys () env = "" If "env" in env_keys:env = Os.en viron["env"]for ele in input:output = "slave1-" + ele.strip (' \ n ') + "-" + env print (output) input.close

For the principle of pipe and how it is implemented, refer to theSpark core RDD API, which also clearly explains how to eliminate the manual copying of scripts to each machine.


spark2.x deep into the end series seven of the Rdd Python API detailed one

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.