Spark use summary and share "go"

Source: Internet
Author: User
Tags types of functions spark rdd

Background

It has been developed for several months with spark. The learning threshold is higher than python/hive,scala/spark. In particular, I remember that when I first started, I was very slow. But thankfully, this bitter (BI) day has passed. Yikusitian, in order to avoid the other students of the project team detours, decided to summarize and comb the use of spark experience.

Spark Basics

Cornerstone Rdd

The core of Spark is the RDD (elastic distributed data Set), a common data abstraction that encapsulates the underlying data manipulation, such as map,filter,reduce. RDD provides an abstraction of data sharing , which is less common than other big data processing frameworks such as mapreduce,pegel,dryadlinq and hive, so the RDD is more generic.

Briefly summarize Rdd:rdd is a non-modifiable, distributed collection of objects. Each RDD consists of multiple partitions, each of which can be computed at the same time on different nodes in the cluster. The RDD can contain any object in Python,java and Scala.

The application in the spark ecosystem is based on the Rdd build (), which fully illustrates that the RDD abstraction is generic enough to describe most scenarios.

Rdd operation type-conversions and actions

The operation of the RDD is divided into two main categories: conversion (transformation) and action (action). The main difference between the two types of functions is that the conversion accepts the RDD and returns the RDD, while the action accepts the RDD but returns a non-Rdd. The conversion takes the lazy call mechanism, each of which records the method of the parent RDD conversion, which is called the Blood (lineage), and the action call is calculated directly.

Using the lazy call, through the blood connection of the RDD operation can be pipelined (pipeline), the operation of the pipeline can be done directly in a single node, to avoid the data synchronization between the multiple conversion of the wait .

The use of blood in tandem can keep each calculation relatively simple, without worrying about having too much intermediate data, because these blood operations are piped, which also guarantees a single logic, without the possibility of reducing the map reduction process, as in MapReduce, in a single map Too much complex logic is written in reduce.

RDD usage Mode

The RDD uses a generic pattern that can be abstracted into the following steps

    1. Loading external data, creating RDD objects
    2. Create a new Rdd object using a transform, such as filter
    3. Caches need to be reused for RDD
    4. Use an action, such as Count, to start a parallel calculation

RDD-Efficient strategy

The data that spark officially provides is the RDD in some scenarios, the computational efficiency is 20X of Hadoop. Whether this data has moisture, we do not pursue, but the efficiency of the RDD is guaranteed by a certain mechanism:

    1. The RDD data is read-only and cannot be modified. If you need to modify the data, you must convert from the parent Rdd (transformation) to the child Rdd. Therefore, in a fault-tolerant strategy, theRDD does not have data redundancy , but rather the implementation of fault tolerance through the RDD parent-child dependency (kinship) relationship.
    2. The RDD data is in memory, between multiple RDD operations, and the data is not landed on the disk, avoiding unnecessary I/O operations.
    3. The data stored in the RDD can be Java objects, so avoid unnecessary serialization and deserialization of objects.

All in all, the main factor of the RDD efficiency is to avoid unnecessary operation and sacrificing the accuracy of the data to improve the computational efficiency.

Tips for using Spark

RDD basic function extension

Although the RDD provides many functions, it is still limited, and sometimes needs to be extended to customize the function of the new RDD. In Spark, the RDD extension can be easily implemented with implicit conversions. During the development of the portrait, the trivial will use the rollup operation (similar to the rollup in hive) to calculate multiple levels of aggregated data. The following is a concrete reality,

/**

* Extend Spark RDD to provide rollup method for Rdd

*/

Implicit class Rolluprdd[t:classtag] (rdd:rdd[(array[string], T)]) extends Serializable {

/**

* Rollup operations similar to SQL

*

* @param aggregate aggregation function

* @param keyplacehold key placeholder, default with Faceconf.stat_summary

* @param iscache to confirm that the data is cached

* @return returns the aggregated data

*/

def Rollup[u:classtag] (

Aggregate:iterable[t] = U,

keyplacehold:string = Faceconf.stat_summary,

Iscache:boolean = True): rdd[(array[string], U)] = {

if (Rdd.take (1). IsEmpty) {

return Rdd.map (x = (array[string] (), Aggregate (Array[t] (x._2)))

}

if (Iscache) {

Rdd.cache//Improve computational efficiency

}

Val Totalkeycount = rdd.first._1.size

Val result = {1 to totalkeycount}.par.map (Untilkeyindex = = {//Parallel calculation

Rdd.map (row = {

Val combinekey = row._1.slice (0, Untilkeyindex). Mkstring (FACECONF.KEY_SEP)//combination KEY

(Combinekey, Row._2)

}). groupbykey.map (row = {//aggregate calculation

Val oldkeylist = Row._1.split (FACECONF.KEY_SEP)

Val newkeylist = oldkeylist + Array.fill (totalkeycount-oldkeylist.size) {keyplacehold}

(Newkeylist, aggregate (row._2))

})

}). Reduce (_ + + _)//Aggregate Results

Result

}

}

The above code declares an implicit class with a member variable RDD, type rdd[(array[string], T)], then if any such Rdd objects appear in the application code, and import the current implicit conversion, Then the compiler will use this RDD as an object of the above implicit class, and you can also work with the rollup function, just like the general Map,filter method.

Rdd operation closure External variable principle

RDD related operations need to pass in a custom closure function (closure), if the function needs to access external variables, then you need to follow certain rules, otherwise you will throw a run-time exception. When a closure function is passed into a node, the following steps are required:

    1. The driver, through reflection, finds all the variables that are accessed by the closure, and marshals it into an object, and then serializes the object
    2. Transfer the serialized object over the network to the worker node
    3. Worker node deserialization of closure object
    4. The worker node executes the closure function,

Note: Changes to external variables within closures are not fed back to the driver.

In short, the function is passed through the network, and then executed. Therefore, the passed variable must be serializable , otherwise the delivery fails. When executed locally, the four steps above will still be performed.

Broadcast mechanisms can also do this, but frequent use of broadcasts will make the code less concise, and the purpose of the broadcast design is to cache large data on the node, avoid multiple data transfer, improve computational efficiency, rather than for external variable access.

RDD Data Synchronization

RDD currently offers two ways to synchronize data: Broadcast and cumulate.

Broadcast broadcast

As mentioned earlier, a broadcast can send a variable to a closure and be used by a closed packet. However, broadcasting also has a role in synchronizing large data. For example, if you have an IP library, you may have a few g, depending on the IP library in the map operation. The IP library can then be broadcast to the closure and be applied by parallel tasks. Broadcast through two aspects to improve the efficiency of data sharing: 1, each node in the cluster (physical machine) has only one copy, the default closure is a copy of each task, 2, broadcast transmission is through the BT download mode, that is, peer download, in the case of multiple clusters, can greatly improve the data transfer rate. When the broadcast variable is modified, it is not fed back to the other nodes.

ACCUMULATOR ACCUMULATOR

An accumulator is a write-only variable that accumulates the state of each task, and only in the driver can access the accumulator. Also, as of the 1.2 release, the accumulator has a known flaw in the action operation, the RDD of n elements ensures that the accumulator accumulates only n times, but at transformation, Spark does not ensure that the accumulator may appear n+1 times.

At present, the synchronization mechanism provided by the RDD is too coarse, especially when the variable state cannot be synchronized in the conversion operation, so the RDD cannot do the complicated transactional operation with state. However, the purpose of the RDD is to provide a common parallel computing framework that is never expected to provide a fine-grained data synchronization mechanism, as this is contrary to the original intent of the design.

Rdd Optimization Tips

RDD Cache

The cache needs to be used more than once, otherwise unnecessary duplication is done. As an example,

Val data = ...//read from TDW

println (Data.filter (_.contains ("error")). Count)

println (Data.filter (_.contains ("Warning")). Count)

In the three paragraphs above, the data variable is loaded two times, and it is efficient to persist the data to memory immediately after it is loaded, as follows

Val data = ...//read from TDW

Data.cache

println (Data.filter (_.contains ("error")). Count)

println (Data.filter (_.contains ("Warning")). Count)

In this way, data is cached in memory after the first load, and the last two operations are directly using the in-memory data.

Conversion parallelization

The RDD conversion operation is parallelized, but the conversion of multiple RDD can also be parallel, refer to the following

Val Datalist:array[rdd[int]] = ...

Val sumlist = Data.list.map (_.map (_.sum))

In the example above, the first map is a handy array variable that computes the sum of each line in each RDD serially. Since there is no logical connection between each RDD, it is theoretically possible to parallelize the calculation of RDD, which can be easily tested in Scala, as follows

Val Datalist:array[rdd[int]] = ...

Val sumlist = Data.list.par.map (_.map (_.sum))

Note the red Code.

Reduce Shuffle network transmission

In general, network I/O overhead is large, reducing network overhead, which can significantly accelerate computational efficiency. The approximate procedure for any two RDD shuffle operation (join, etc.) is as follows,

The user data UserData and event events data are connected through the user ID, then to another node in the network, the process, there are two network transmission process. The default for spark is to complete both processes. However, if you tell spark more information, spark can optimize and perform only one network transfer. Can be used, hashpartition, in the UserData "local" first partition, and then asked events directly shuffle to the node of UserData, then reduced a part of the network transmission, the effect is reduced as follows,

The dashed parts are done locally, with no network transmission. When the data is loaded, it is partition according to key, so that the process of reducing the local hashpartition is done by one part, the sample code is as follows

Val userData = Sc.sequencefile[userid, UserInfo] ("hdfs://...")

. Partitionby (new Hashpartitioner)//Create partitions

. Persist ()

Note that the above must be persist, otherwise the calculation will be repeated multiple times. 100 is used to specify the number of parallel.

Spark Other

Spark development model

Because the spark application needs to run on the deployment to the cluster, resulting in local debugging is troublesome, so after this period of experience accumulated, summed up a set of development process, the purpose is to maximize the development of debugging efficiency, while ensuring the quality of development. Of course, this process may not be optimal, and there is a need for continuous improvement later on.

The whole process is clear, and here we talk about why unit testing is needed. Most projects in the company generally do not advocate unit testing, and because of project schedule pressure, developers will be very resistant to unit testing because it will cost "extra" effort. Bug this thing will not disappear because of the progress of the project, and on the contrary, probably because of the progress, and above the average level. So, if you don't take the time to do unit testing, you'll spend as much or more Time debugging. A lot of times, often some small bug, but it takes you a long time to debug, and these bugs, it is easy to find in the unit test. Also, unit testing can bring two additional benefits: 1) API usage example; 2) regression test. So, it's a unit test, it's an investment, and the ROI is pretty high! But all things need to master the sense of propriety, unit testing should be based on the urgency of the project to adjust the granularity, to do something, something does not.

Other Spark features

As mentioned in the previous spark biosphere, Spark has a few very useful applications on top of it, in addition to the core RDD:

    1. Spark sql: Hive-like implementation of SQL queries using the RDD
    2. Spark Streaming: Streaming computing, providing real-time computing capabilities like storm
    3. MLLib: Machine Learning Library, provides common classification, clustering, regression, cross-examination and other machine learning algorithms parallel implementation.
    4. GraphX: Graph calculation framework, which realizes basic graph calculation function, common graph algorithm and Pregel diagram programming framework.

It is necessary to continue learning and using the above features, especially those that are strongly related to data mining (mllib).

Resources

    1. An Architecture for Fast and General Data processing on Large Clusters, by Matei Zaharia
    2. Spark official website
    3. Spark closure function External variable access problem
    4. Learning Spark lightning-fast Big data.analysis

Spark use summary and share "go"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.