MapReduce program converted to spark program

Last Update:2016-05-22 Source: Internet

Author: User

Tags iterable hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

MapReduce and Spark compare the current big data processing can be divided into the following three types:
1, complex Batch data processing (Batch data processing), the usual time span of 10 minutes to a few hours;
2, based on the historical Data Interactive query (interactive query), the usual time span of 10 seconds to a few minutes;
3, data processing based on real-time data stream (streaming data processing), the usual time span of hundreds of milliseconds to a few seconds.

Big data processing is bound to rely on the cluster environment, and the cluster environment has three major challenges, namely, parallelization, single-point failure processing, resource sharing, respectively, can be used in parallel to rewrite the application, the single point of failure processing, dynamic allocation of computing resources and other solutions to face the challenge.

There are a lot of big data programming frameworks for the cluster environment, first of all Google's MapReduce, it shows us a simple general-purpose and automatic fault-tolerant batch processing calculation model. However, for other types of calculations, such as interactive and streaming, MapReduce is not suitable. This also leads to a large number of proprietary data processing models that are different from MapReduce, such as Storm, Impala, and so on. But there are some deficiencies in these proprietary systems:
1. Repetitive work: Many proprietary systems are addressing the same problems, such as distributed operations and fault tolerance, for example, a distributed SQL engine or a machine learning system that requires parallel aggregation, which is repeatedly resolved in each proprietary system.

2. Combinatorial problems: It's a hassle to combine calculations between different systems. For a particular big data application, the intermediate datasets are very large and the cost of moving is high. In the current environment, we need to replicate data to a stable storage system, such as HDFS, for sharing in different computing engines. However, such replication may cost more than real computing, so it is not very efficient to combine multiple systems in a pipelined format.

3. Limitations of scope: If an application is not suitable for a proprietary computing system, then the user can only change one, or rewrite a new one.

4. Resource allocation: Dynamic sharing of resources between different computing engines is difficult because most computing engines assume that they have the same machine node resources before the end of the program run.

5. Management issues: For multiple proprietary systems, it takes more effort and time to manage and deploy, especially for end users, which requires learning multiple APIs and system models.

Spark is the Big Data processing framework launched by Berkeley, which presents the RDD concept (resilient distributed Datasets), the abstract elastic data set concept. Spark is an extension of the MapReduce model. It is difficult to implement computational work that is not good at mapreduce, such as iterative, interactive, and streaming, because MapReduce lacks the ability to perform effective data sharing at all stages of parallel computing, which is the essence of the RDD. With this efficient data sharing and a MapReduce-like operation interface, the various proprietary type calculations described above can be effectively expressed, and the same performance as the proprietary system can be obtained.

MapReduce and Spark Introduction MapReduce MapReduce is tailor-made for Apache Hadoop and is ideal for use in Hadoop scenarios, such as large-scale log processing systems, bulk data extraction loading tools (ETL tools), and more. But with the expansion of Hadoop sites, Hadoop developers found that MapReduce was not the best choice in many scenarios, and Hadoop began to put resource management into its own component YARN. In addition, projects like Impala are beginning to evolve into our architecture, and Impala provides SQL semantics to query petabytes of big data stored in Hadoop's HDFS and HBase. Similar projects have been previously, such as Hive. Although the hive system also provides SQL semantics, it is still a batch process that is difficult to meet the interactivity of the query because it uses the MapReduce engine for the underlying implementation of hive. By contrast, Impala's biggest feature is its efficiency.

The first generation of Hadoop MapReduce is a software framework for distributed processing of massive datasets on a computer cluster, including a jobtracker and a number of tasktracker. Run flowchart 1, as shown in:

There are 4 independent entities at the top, namely client, Jobtracker, Tasktracker, and Distributed file systems. The client submits the MapReduce job; jobtracker coordinates the operation of the job; Jobtracker is a Java application whose main class is jobtracker;tasktracker tasks after the job is run, and Tasktracker is also a Java application, the main class of which is Tasktracker. The steps for Hadoop to run a MapReduce job include 6 steps to submit a job, initialize a job, assign a task, perform a task, update progress and status, complete a job, and so on.

Spark Introduction The goal of the spark ecosystem is to integrate batch processing, interactive processing, and streaming into a single software framework. Spark is an open-source cluster computing system based on memory computing that is designed to make data analysis faster. Spark is very small, developed by Matei, a team based in the AMP Lab at the University of California, Berkeley. The language used is Scala, the core part of the project's code is only 63 scala files, very short and concise. Spark enables a memory distribution dataset that optimizes iterative workloads in addition to providing interactive queries. Spark provides a memory-based compute cluster that imports data into memory for fast queries when analyzing data, much faster than disk-based systems such as Hadoop. Spark was originally developed to deal with iterative algorithms, such as machine learning, graph mining algorithms, and interactive data mining algorithms. In both scenarios, Spark can run up to hundreds of times times the speed of Hadoop.

Spark allows applications to save working sets in memory for efficient reuse, which supports a wide range of data processing applications, while maintaining the important features of MapReduce, such as high fault tolerance, data localization, and large scale data processing. In addition, the concept of elastic distributed data set (resilient distributed Datasets) is presented:

1, RDD performance as a Scala object, can be created by a file;
2, an immutable object segmentation set distributed within a cluster;
3, create the model by parallel processing (map, filter, GroupBy, join) fixed data (Baserdd), generate transformed RDD;
4, the failure can use the RDD pedigree information reconstruction;
5, can be cached for re-use.
Figure 2 shows a sample code for log mining that first imports the error information from the log data into memory and then searches interactively:

When data is imported, the model is present on the worker in block form, the task is distributed to the worker by driver, and the work is processed to driver feedback results. You can also create cache caches on your work for the data model, and the cache is treated like a block, which is also a process of distribution and feedback.

Spark's RDD concept provides the same performance as proprietary systems, as well as features that include fault-tolerant processing, lagged node processing, and the lack of proprietary systems.

1. Iterative algorithm: This is a very common application scenario of the current proprietary system implementation, such as iterative computation can be used for graph processing and machine learning. RDD is a good way to implement these models, including Pregel, Haloop, and Graphlab.
2. Relational query: The most important requirement for MapReduce is to run SQL queries, including long running, hours of batch jobs, and interactive queries. However, for MapReduce, comparing the parallel database to interactive query has its inherent disadvantage, for example, because of its fault-tolerant model, the speed is very slow. With the RDD model, you can achieve good performance by implementing many common database engine features.
3. mapreduce batch: The interface provided by the RDD is a superset of MapReduce, so the RDD is able to efficiently run applications using MapReduce, and the RDD is also suitable for more abstract DAG-based applications.
4. Streaming: Current streaming systems also provide limited fault-tolerant processing, which consumes very large copy code or very long fault-tolerant time of the system. Especially in the current system, the basic is based on a continuous calculation of the model, the permanent state operation will process the arrival of each record. To recover the failed nodes, they need to replicate two operations for each operation, or to replay the upstream data for a costly operation, and use the RDD to achieve discrete data flow to overcome the above problems. The discrete data stream treats streaming calculations as a series of short, deterministic batch operations rather than permanent stateful operations, preserving the state of the two discrete streams in the RDD. The discrete-flow model allows for parallel recovery through the RDD inheritance graph without the need for data copying.

Spark internal terminology explains application: A spark-based user program that contains driver programs and executor on the cluster;
Driver Program: Run the main function and create a new Sparkcontext;
Cluster Manager: External services that obtain resources on the cluster (for example: Standalone,mesos,yarn);
Worker node: Any node in the cluster that can run the application code;
Executor: is a process initiated for an application on a worker node that is responsible for running the task and is responsible for having the data in memory or on disk. Each application has its own independent executors;
Task: A unit of work that is sent to a executor;
Job: A parallel computation with many tasks that corresponds to the action of Spark;
Stage: A job is split into many groups of tasks, each set of tasks called the stage (just like the Mapreduce map task and the reduce task).

The MapReduce conversion to spark Spark is similar to the MapReduce compute engine, and its proposed memory approach solves the difficulty of reading the disk slowly in MapReduce, and it is based on Scala's functional programming style and API, which is efficient in parallel computing. High.

Since Spark uses the RDD (elastic distributed result set) method to calculate the data, which is much different from the MapReduce Map () and Reduce (), it is difficult to use the Mapper, Reducer APIs directly, which is also a hindrance to Mapreduc E to Spark's stumbling block.

The map () and reduce () methods in Scala or Spark are more flexible and complex than the map () and reduce () methods in Hadoop MapReduce, and the Hado is listed below. Some features of OP MapReduce:
Mappers and reducers typically use Key-value key-value pairs as inputs and outputs;
1, a key corresponding to a Reducer reduce;
2. Each Mapper or Reducer may emit a key-value pair similar to 0,1 as each output;
3, Mappers and reducers may issue arbitrary key or value, rather than the standard data set mode;
4. Mapper and Reducer objects have a life cycle for each call to map () and reduce (). They support a setup () method and the Cleanup () method, which can be used to manipulate the bulk data before it is processed.

Imagine a scenario where we need to calculate the number of characters in each line of a text file. In Hadoop MapReduce, we need to prepare a key-value pair for the Mapper method, where key is used as the number of rows in the row, and value is the number of characters in this line. The procedure is as follows:

public class Linelengthcountmapper extends mapper<longwritable,text,intwritable,intwritable> {@Override protected void Map (longwritable linenumber, Text Line, context context) throws IOException, Interruptedexception {context . Write (New intwritable (Line.getlength ()), New intwritable (1)); }}

The code shown above, because Mappers and reducers only deal with key-value pairs, so for class linelengthcountmapper, the input is the Textinputformat object, its key is provided by the number of rows, and value is all the characters of that line. The code after switching to Spark is as follows:

Lines.map (line = (line.length, 1))

In spark, the input is an elastic distributed dataset (Resilient distributed dataset), and Spark does not need to key-value key-value pairs, which is based on Scala Ganso (tuple), which is passed (line.length, 1) This is created by the (A, B) syntax. The above code in the map () operation is an RDD, (line.length, 1) Yuan zu. When an RDD contains a meta-ancestor, it relies on other methods, such as Reducebykey (), which is important for regenerating the MapReduce feature.

The code shown below is the number of characters in each line of Hadoop MapReduce statistics, and then output in Reduce mode.

public class Linelengthreducer extends reducer<intwritable,intwritable,intwritable,intwritable> {@Override protected void reduce (intwritable length, iterable<intwritable> counts, context context) throws IOException, interruptedexception {int sum = 0; for (intwritable count:counts) {sum + = Count.get ();} context.write (length, new int Writable (sum)); }}

The corresponding code in Spark is as follows:

Val lengthcounts = Lines.map (line = (line.length, 1)). Reducebykey (_ + _)

The RDD API for Spark has a reduce () method that will reduce all key-value key values to a separate value.
We now need to count the number of words beginning with capital letters, and for each line of text, a Mapper may need to count a number of key-value pairs, with the following code:

public class Countuppercasemapper extends mapper<longwritable,text,text,intwritable> {@Override protected void Map (longwritable linenumber, Text Line, context context) throws IOException, interruptedexception {for (String Word:lin E.tostring (). Split ("")) {if (Character.isuppercase (Word.charat (0))) {Context.write (new Text (word), new intwritable (1 )); } } }}

In Spark, the corresponding code is as follows:

Lines.flatmap (_.split ("). Filter (Word = character.isuppercase (Word (0))). Map (Word = (word,1)))

The Map method of MapReduce dependency does not apply here, because each input must correspond to an output, so that each row may occupy a lot of output. In contrast, the Map method in Spark is relatively simple. The method in Spark is to first summarize each row of data into an array of output results, which may be empty or contain many values, and eventually the array will be used as an RDD as an output. This is the function of the FlatMap () method, which filters the words in each line of text into tuples within the function.

In Spark, the Reducebykey () method can be used to count the number of letters that appear within each article. If we want to count the number of capital letters that appear in each article, the program in MapReduce can be as follows:

public class Countuppercasereducer extends reducer<text,intwritable,text,intwritable> {@Override protected void  Reduce (Text word, iterable<intwritable> counts, context context) throws IOException, interruptedexception {int sum = 0; for (intwritable count:counts) {sum + = Count.get ();} context.write (New Text (Word.tostring (). toUpperCase ()), New Intwri Table (sum)); }}

In Spark, the code looks like this:

Groupbykey (). Map {case (Word,ones) = (Word.touppercase, ones.sum)}

The Groupbykey () method is responsible for collecting all values of a key and not for a reduce method. In this example, the key is converted to uppercase, and the value is added directly to the sum. However, it is important to note that if a key is associated with a number of value, out of Memory errors may occur.

Spark provides a simple way to convert the value of a key, which transfers the reduce method process to spark to avoid an OOM exception, as follows:

Reducebykey (_ + _). Map {case (Word,total) = (word.touppercase,total)}

The main function of the setup () method in MapReduce is to process the input before the map method is started, and the common scenario is to connect the database and release the resources that are occupied in the Setup () method in the Cleanup () method.

public class Setupcleanupmapper extends mapper<longwritable,text,text,intwritable> {private Connection DbConnection; @Override protected void Setup (context context) {DbConnection = ...,} ... @Override protected void Cleanup (context conte XT) {dbconnection.close ();}}

There's no such method in Spark!

MapReduce program converted to spark program

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More