International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

spark2.x deep into the end series seven of the Rdd Python API detailed one

Last Update:2017-09-22 Source: Internet

Author: User

Tags random seed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before learning spark any technology, please understand spark correctly, and you can refer to: Understanding spark correctly

The following is a Python API description of the three ways to create the RDD, the single-type RDD basic Transformation API, the sampling API, and the pipe operation.

Three ways to create an RDD

Create an RDD from a stable file storage system, such as local filesystem or HDFS, as follows:

How to create an Rdd: 1: From a stable storage system, such as an HDFs file, or a local file system "" "Text_file_rdd = Sc.textfile (" file:////Users/tangweiqun/ Spark-course/word.txt ") print" Text_file_rdd = {0} ". Format (", ". Join (Text_file_rdd.collect ()))

2. You can go through the transformation API to create a new rdd from an already existing RDD, here is the map conversion API

"" "2: From an already existing RDD, the RDD Transformation API" "" Map_rdd = Text_file_rdd.map (lambda line: "{0}-{1}". Format (line, "Test")) Print "Map_rdd = {0}". Format (",". Join (Map_rdd.collect ()))

3. Create an RDD from an in-memory list data, you can specify the number of partitions for the RDD, and if not specified, take all cores of all executor

"" "3: From a list that already exists in memory, you can specify the partition, if you do not specify the number of partitions for all executor cores, the following API specifies 2 partitions" "" Parallelize_rdd = Sc.parallelize ([1, 2, 3 , 3, 4], 2) print "Parallelize_rdd = {0}". Format (Parallelize_rdd.glom (). Collect ())

Note: In the third case, Scala also provides the Makerdd API, which specifies the machine where each partition of the RDD is created, as the principle of this API is described in the Spark core Rdd Scala API

Two, single type RDD basic Transformation API

Create an RDD based on the in-memory data first

conf = sparkconf (). Setappname ("AppName"). Setmaster ("local") sc = Sparkcontext (conf=conf) Parallelize_rdd = Sc.parallelize ([1, 2, 3, 3, 4], 2)

The map operation, which represents the application of our custom function interface to each element of the Parallelize_rdd, adds 1 to each element as follows:

Map_rdd = Parallelize_rdd.map (lambda x:x + 1) "" "Result: [[2, 3], [4, 4, 5]]" "" print "Map_rdd = {0}". Format (Map_rdd.glo M (). Collect ())

It is important to note that the map operation can return data with different types of rdd, such as the following, which returns a String type object:

Map_string_rdd = Parallelize_rdd.map (lambda x: "{0}-{1}". Format (x, "test") "" "Result: [[' 1-test ', ' 2-test '], [' 3-test ', ' 3- Test ', ' 4-test ']] "" "print" Map_string_rdd = {0} ". Format (Map_string_rdd.glom (). Collect ())

2. Flatmap operation, apply our custom lambda function to each element of Parallelize_rdd, the output of this function is a list of data, Flatmap will flatten the data list of these outputs

Flatmap_rdd = Parallelize_rdd.flatmap (lambda x:range (x)) "" "Result: [[0, 0, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 3]]" "" Print "Flatmap_rdd = {0}". Format (Flatmap_rdd.glom (). Collect ())

3. Filter operation, apply our custom filter function to each element of Parallelize_rdd, filter out the elements we do not need, like below, filter out elements not equal to 1:

Filter_rdd = Parallelize_rdd.filter (lambda x:x! = 1) "" "Result: [[2], [3, 3, 4]]" "" print "Filter_rdd = {0}". Format (filt Er_rdd.glom (). Collect ())

4. Glom operation, view the element data corresponding to each partition of Parallelize_rdd

Glomrdd = Parallelize_rdd.glom () "" "Result: [[1, 2], [3, 3, 4]] Description Parallelize_rdd has two partitions, the first partition has data 1 and 2, the second partition has data 3,3 and 4" "Print" Glomrdd = {0} ". Format (Glomrdd.collect ())

5. Mappartitions operation, apply our Custom Function interface method to the data of each partition of Parallelize_rdd, suppose we need to add an initial value to each element, and the acquisition of this initial value is very time consuming, This time with Mappartitions will have a very big advantage, as follows:

This is an initial value acquisition method and is a more time-consuming method Def get_init_number (source):    print  "Get init  number from {0}, may be take much time ... ". Format (source)      time.sleep (1)     return 1def map_partition_func (iterator):      "" "     each partition gets an initial value, Integerjavardd has two partitions, then two times Getinitnumber method is called      therefore, the corresponding time-consuming operations that need to be initialized, such as initializing database connections, are generally used mappartitions to initialize each partition once, rather than using map operations      :p aram iterator:    :return:     "" "     init_number = get_init_number ("Map_partition_func")     yield map ( Lambda x : x + init_number, iterator) Map_partition_rdd = parallelize_ Rdd.mappartitions (Map_partition_func) "" "     results: [[[2, 3]], [[4, 4, 5]]      "" "print  "map_partition_rdd = {0}". Format (Map_partition_rdd.glom (). Collect ()) Def map_func (x):      "" "     go through each element to get the initial value, this integerjavardd contains 5 elements, Then this getinitnumber method will be called 4 times, seriously affect the time, not as good as mappartitions performance     :p aram x:     :return:     "" "    init_number = get_init_number (" Map_ Func ")     return x + init_numbermap_rdd_init_number = parallelize _rdd.map (Map_func) "" "     results:[[2, 3], [4, 4, 5]]      "" "print " map_rdd_init_number = {0} ". Format (Map_rdd_init_number.glom (). Collect ())

6. Mappartitionswithindex operation, the data for each partition of the PARALLELIZE_RDD application our Custom Function interface method, when the function interface method is applied with the partition information, that is, you are currently processing the data of the first partition

def map_partition_with_index_func (Partition_index, iterator): Yield (partition_index, sum (iterator)) Map_partition_ With_index_rdd = Parallelize_rdd.mappartitionswithindex (Map_partition_with_index_func) "" "Results: [[(0, 3)], [(1, 10)]" "Print" Map_partition_with_index_rdd = {0} ". Format (Map_partition_with_index_rdd.glom (). Collect ())

Third, sampling API

Create an RDD based on the in-memory data first

conf = sparkconf (). Setappname ("AppName"). Setmaster ("local") sc = Sparkcontext (conf=conf) Parallelize_rdd = Sc.parallelize ([1, 2, 3, 3, 4], 2)

Sample

"" "The first parameter is withreplacement if the withreplacement=true expression has put back the sampling, uses the Poisson sampling algorithm realizes if the Withreplacement=false expression does not put back the sampling, Using the Bernoulli sampling algorithm to implement the second parameter is: fraction, indicating that each element is extracted as the probability of the sample, not to indicate the amount of data to be extracted, such as sampling from 100 data, fraction=0.2, does not mean to extract 100 * 0.2 = 20 data, Instead, the 100 elements are extracted as a sample probability of 0.2, the size of the sample is not fixed, but is subject to two distributions when the Withreplacement=true fraction>=0 when the Withreplacement=false 0 < fraction < 1 The third parameter is: Reed represents the seed that generates a random number, which generates a random seed "" "Sample_rdd = Parallelize_rdd.sample (False) based on the reed for each partition of the RDD. 0.5, 100) "" "Result: [[1], [3, 4]]" "" print "Sample_rdd = {0}". Format (Sample_rdd.glom (). Collect ())

2. Randomsplit

"" "//according to the weight of the rdd random sampling segmentation, there are several weights are divided into several rdd//random sampling using the Bernoulli sampling algorithm, the following is a two weight, will be cut into two Rdd" "" Split_rdds = Parallelize_ Rdd.randomsplit ([0.2, 0.8]) print len (split_rdds) "" "[[], [3, 4]]" "" print "split_rdds[0] = {0}". Format (Split_rdds[0]. Glom (). Collect ()) "" "[[1, 2], [3]]" "" print "split_rdds[1] = {0}". Format (Split_rdds[1].glom (). Collect ())

3. Takesample

"" "//random sampling of the specified number of sample data///The first parameter is withreplacement//if the withreplacement=true means that there is a sample of the put back, using the Poisson sampling algorithm to achieve//if withreplacement= False if the sample is not put back, using the Bernoulli sampling algorithm to achieve//The second parameter is specified, then how many samples "" "" "randomly sampled a specified number of sample data results: [1]" "" Print parallelize_rdd.takesample ( False, 1)

4. Stratified sampling, sampling the key-value type of RDD

"" "Create a key value of type Rdd" "" Pair_rdd = Sc.parallelize ([' (' A ', 1 '), (' B ', 2), (' C ', 3), (' B ', 4), (' A ', 5)]) Samplebykey_rdd = PA Ir_rdd.samplebykey (Withreplacement=false, fractions={' a ': 0.5, ' B ': 1, ' C ': 0.2}) "" Result: [[(' A ', 1], (' B ', 2), (' B ', 4)]] "" "Print" Samplebykey_rdd = {0} ". Format (Samplebykey_rdd.glom (). Collect ())

The principle of sampling can be found in detail:Spark core RDD API. These principles are not very well expressed in words.

Pipe, which represents a step in the Rdd execution stream to execute other scripts, such as Python or shell scripts

conf = sparkconf (). Setappname ("AppName"). Setmaster ("local") sc = Sparkcontext (conf=conf) Parallelize_rdd = Sc.parallelize (["Test1", "Test2", "Test3", "test4", "Test5"], 2) "" "//if it is in a real spark cluster, Then require echo.py in the cluster of each machine under the same directory must have//the second parameter is the environment variable "" "Pipe_rdd = Parallelize_rdd.pipe (" python/users/tangweiqun/spark/ Source/spark-course/spark-rdd-java/src/main/resources/echo.py ", {" env ":" env "})" "" Result: slave1-test1-env Slave1-test2-env slave1-test3-env slave1-test4-env slave1-test5-env "" "" print "Pipe_rdd = {0}". Format ("". Join (Pipe_ Rdd.collect ()))

The contents of echo.py are as follows:

Import sysimport os#input = "Test" input = Sys.stdinenv_keys = Os.environ.keys () env = "" If "env" in env_keys:env = Os.en viron["env"]for ele in input:output = "slave1-" + ele.strip (' \ n ') + "-" + env print (output) input.close

For the principle of pipe and how it is implemented, refer to theSpark core RDD API, which also clearly explains how to eliminate the manual copying of scripts to each machine.

spark2.x deep into the end series seven of the Rdd Python API detailed one

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

wordpress the field deep end sessions python split string into list of characters deep learning tutorial python deep neural network python empress of deep 2 python front end framework

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

Python abstract class (ABC module) 09-18

The difference between OS and sys two modules in Python 04-05

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

spark2.x deep into the end series seven of the Rdd Python API detailed one

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support