Spark RDD API Extension Development (1)

Last Update:2015-04-23 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As we all know, Apache Spark has built in a lot of API to manipulate data. But many times, when we develop applications in reality, we need to solve real-world problems that might not be available in Spark , and we need to extend the Spark API to implement our own approach.
There are two ways to extend the Spark API, (1), one of which is to add a custom method to the existing Rdd , and (2) The second is to create our own Rdd. In this article, I'll elaborate on both of these methods and give the code. I'll start by introducing the first method.

If we have some product sales data, the format of the data is CSV. For the sake of simplicity, if each row of data consists of four fields of ID, CustomerId, itemId, and Itemvalue, we use the Salesrecord to represent:

Class Salesrecord (Valid:string,val customerid:string, Val itemid:string,val itemvalue:double) extends Comparable[ Salesrecord] with Serializable

So we can analyze the sales data of the goods and store them in Rdd[salesrecord]:

Val SC =newsparkcontext (args (0), "iteblogrddextending") Val datardd =sc.textfile ("File:///www/iteblog.csv") Val Salesrecordrdd =datardd.map (row = {    valcolvalues =row.split (",")    Newsalesrecord (colvalues (0), colvalues (1),    colvalues (2), Colvalues (3). ToDouble)})

If we want to calculate the total sales of these goods, we will write:

Salesrecordrdd.map (_.itemvalue). Sum

Although this looks concise, it is a bit difficult to understand. But if we could write this, it would be good to understand:

Salesrecordrdd.totalsales

In the code snippet above, the TotalSales method gives us the feeling that spark built-in is the same, but Spark does not provide this method, we need to implement our custom operation in the existing RDD.

Let me show you how to add our own custom methods to the existing RDD.

　　Define a tool class to store all of our custom actions

Of course, you have absolutely no need to customize a class class to add our own custom methods, but to manage it, it is recommended that you do so. Let's define the Iteblogcustomfunctions class, which stores all of our custom methods. It is designed to handle Rdd[salesrecord], so the operations provided in this class are all used to deal with sales data:

Class Iteblogcustomfunctions (Rdd:rdd[salesrecord]) {  def totalsales = Rdd.map (_.itemvalue). Sum}

　　Second, invisible conversion to implement in the Rdd Add method

We define the invisible additeblogcustomfunctions function, which can be used for all methods of selling data on Rdd[salesrecord]:

Object Iteblogcustomfunctions {  implicit def additeblogcustomfunctions (Rdd:rdd[salesrecord]) = new  Iteblogcustomfunctions (RDD)}

　　Iii. using a custom method

The following methods are implemented using our custom methods by importing the appropriate methods in Iteblogcustomfunctions:

Import Iteblogcustomfunctions._println (Salesrecordrdd.totalsales)

With the three steps above, we can add our own custom methods to the existing RDD.

Reprinted from: http://www.iteblog.com/archives/1298

Spark RDD API Extension Development (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More