Spark RDD API Extension Development (1)

Source: Internet
Author: User
Tags spark rdd

As we all know, Apache Spark has built in a lot of API to manipulate data. But many times, when we develop applications in reality, we need to solve real-world problems that might not be available in Spark , and we need to extend the Spark API to implement our own approach.
There are two ways to extend the Spark API, (1), one of which is to add a custom method to the existing Rdd , and (2) The second is to create our own Rdd. In this article, I'll elaborate on both of these methods and give the code. I'll start by introducing the first method.

If we have some product sales data, the format of the data is CSV. For the sake of simplicity, if each row of data consists of four fields of ID, CustomerId, itemId, and Itemvalue, we use the Salesrecord to represent:

Class Salesrecord (Valid:string,val customerid:string, Val itemid:string,val itemvalue:double) extends Comparable[ Salesrecord] with Serializable

So we can analyze the sales data of the goods and store them in Rdd[salesrecord]:

Val SC =newsparkcontext (args (0), "iteblogrddextending") Val datardd =sc.textfile ("File:///www/iteblog.csv") Val Salesrecordrdd =datardd.map (row = {    valcolvalues =row.split (",")    Newsalesrecord (colvalues (0), colvalues (1),    colvalues (2), Colvalues (3). ToDouble)})

If we want to calculate the total sales of these goods, we will write:

Salesrecordrdd.map (_.itemvalue). Sum

Although this looks concise, it is a bit difficult to understand. But if we could write this, it would be good to understand:

Salesrecordrdd.totalsales

In the code snippet above, the TotalSales method gives us the feeling that spark built-in is the same, but Spark does not provide this method, we need to implement our custom operation in the existing RDD.

Let me show you how to add our own custom methods to the existing RDD.

  Define a tool class to store all of our custom actions

Of course, you have absolutely no need to customize a class class to add our own custom methods, but to manage it, it is recommended that you do so. Let's define the Iteblogcustomfunctions class, which stores all of our custom methods. It is designed to handle Rdd[salesrecord], so the operations provided in this class are all used to deal with sales data:

Class Iteblogcustomfunctions (Rdd:rdd[salesrecord]) {  def totalsales = Rdd.map (_.itemvalue). Sum}

  Second, invisible conversion to implement in the Rdd Add method

We define the invisible additeblogcustomfunctions function, which can be used for all methods of selling data on Rdd[salesrecord]:

Object Iteblogcustomfunctions {  implicit def additeblogcustomfunctions (Rdd:rdd[salesrecord]) = new  Iteblogcustomfunctions (RDD)}

  Iii. using a custom method

The following methods are implemented using our custom methods by importing the appropriate methods in Iteblogcustomfunctions:

Import Iteblogcustomfunctions._println (Salesrecordrdd.totalsales)

With the three steps above, we can add our own custom methods to the existing RDD.

Reprinted from: http://www.iteblog.com/archives/1298

Spark RDD API Extension Development (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.