Spark vs. Hadoop

Source: Internet
Author: User
Tags join require shuffle
1.hadoop solved what the problem

Hadoop is the solution to the reliable storage and processing of big data (large to one computer cannot be stored, and one computer cannot be processed within the required time).
HDFS, which provides high-reliability file storage on a cluster of ordinary PCs, solves the problem of server or hard drive deterioration by saving multiple copies of the block.
MapReduce, through simple mapper and reducer abstraction, provides a programming model that can be distributed over a large number of datasets on an unreliable cluster of hundreds of dozens of of PCs on a single platform, while concurrently, Compute details such as distributed (Inter-machine communication) and fault recovery are hidden. The abstraction of mapper and reducer is the basic element that can be decomposed into various complex data processing. In this way, complex data processing can be decomposed into a directed acyclic graph (DAG) consisting of multiple jobs (one mapper and one reducer), and then each mapper and reducer are placed on the Hadoop cluster to produce the results.
In MapReduce, shuffle is a very important process, and it is the invisible shuffle process that makes it possible for developers who write data on MapReduce to be completely unaware of the existence of distributed and concurrent.

Generalized shuffle refers to a series of processes between map and reuce in a graph.
Yarn, before the MRV1 version was used for task scheduling, MRV2 version refactoring was replaced by yarn. The fundamental idea of refactoring is to separate the Jobtracker two main functions into separate components, and these two functions are resource management and task scheduling/monitoring. The new resource manager globally manages the allocation of all application compute resources, and the applicationmaster of each application is responsible for scheduling and coordinating accordingly. An application is nothing more than a single traditional MapReduce task or a DAG (directed acyclic graph) task. The node Management Server for ResourceManager and each machine can manage the user's processes on that machine and can organize the calculations. limitations and deficiencies of 2.hadoop

However, Maprecue has the following limitations, which are difficult to use.
1. Low level of abstraction, need to write code to complete, the use of difficult to get started.
2. Only two operations, map and reduce, lack of expressive power.
3. A job has only map and reduce two phases (Phase), complex computations require a lot of job completion, and the dependencies between jobs are managed by the developers themselves.
4. Processing logic hidden in code details, no overall logic
5. Intermediate results are also placed in the HDFs file system
6.ReduceTask needs to wait until all Maptask are complete before you can start
Singo, only for batch data processing, for interactive data processing, real-time data processing support is not enough
7. Poor performance for iterative data processing

For example, using MapReduce to join both tables is a tricky process, as shown in the following illustration:

The join in Mr is a very laborious operation, as long as the students who have written Mr Code can appreciate it. one of the advantages of 3.spark

Apache Spark is an emerging engine of big data processing, and the main feature is the distributed memory abstraction of a cluster to support applications that require working sets.

This abstraction is the RDD (resilient distributed Dataset), an RDD is an immutable, partitioned set of records, and the RDD is the programming model in spark. Spark provides two types of operations, transformations, and actions on the RDD. The conversion is used to define a new RDD, including map, FLATMAP, filter, Union, sample, join, Groupbykey, Cogroup, Reducebykey, Cros, Sortbykey, mapvalues, etc. , the action is to return a result, including collect, reduce, count, save, LookupKey.

The Spark API is very easy to use, and the spark's wordcount example looks like this:

Val spark = new Sparkcontext (master, AppName, [Sparkhome], [jars])
val file = Spark.textfile ("hdfs://...")
Val Co unts = File.flatmap (line = Line.split (""))
                 . Map (Word = = (Word, 1))
                 . Reducebykey (_ + _)
Counts.saveastextfile ("hdfs://...")

The file is the Rdd created from the files on HDFs, and the Flatmap,map,reducebyke creates a new Rdd, a short program that performs many transformations and actions.

In Spark, all RDD conversions are lazy-evaluated. The RDD conversion operation generates a new RDD, and the data of the new RDD depends on the data of the original RDD, and each RDD contains multiple partitions. Then a program actually constructs a directed acyclic graph (DAG) consisting of multiple, interdependent rdd. And by performing an action on the RDD, this forward loop diagram is submitted to spark as a job execution.
For example, the WordCount program above will generate the following DAG

scala> counts.todebugstring
res0:string =
mappartitionsrdd[7] at Reducebykey at <console>:14 (1 partitions)
  Shuffledrdd[6] at Reducebykey @ <console>:14 (1 partitions)
    Mappartitionsrdd[5] at Reducebykey at <console>:14 (1 partitions)
      Mappedrdd[4] at map at <console>:14 (1 partitions)
        FLATMAPPEDRDD[3] at FlatMap @ <console>:14 (1 partitions)
          mappedrdd[1] at Textfile at <console>:12 (1 par titions)
            Hadooprdd[0] at Textfile at <console>:12 (1 partitions)

Spark dispatches to a non-circular graph job, identifies phases (stage), Partitions (Partition), pipelining (Pipeline), Tasks (Task) and cache (caches), optimizes, and runs the job on the spark cluster. The dependency between the RDD is divided into a wide dependency (dependent on multiple partitions) and a narrow dependency (dependent on only one partition), which needs to be divided into phases based on wide dependencies when determining the phase. divides tasks by partition.

Spark supports failback in a different way, providing two ways to linage, through the data of the blood relationship, and then perform the previous processing, Checkpoint, to store the dataset in persistent storage.

Spark provides better support for iterative data processing. The data for each iteration can be saved in memory instead of being written to the file.

Spark's performance has improved significantly compared to Hadoop, and in October 2014, Spark completed a sort benchmark test of the Daytona Gray category, which was completely on disk, compared to the test before Hadoop, as shown in the table:

From the table you can see the sorted 100TB data (1 trillion data), Spark uses only 1/10 of the computing resources that Hadoop uses, and it takes only 1/3 of Hadoop. two advantages of 4.spark

The benefits of spark not only reflect performance gains, the Spark framework for batch processing (spark Core), interactive (spark SQL), streaming (spark streaming), machine learning (MLlib), Figure Computing (GraphX) provides a unified data processing platform, which has a significant advantage over using Hadoop.

According to Databricks's Liancheng, one Stack to Rule them all

Especially in some cases, you need to do some ETL work, then train a machine learning model, finally make some queries, if you are using spark, you can complete the logic of the three parts in a program to form a large directed acyclic graph (DAG), And spark will optimize the large, non-circular graphs overall.

For example, the following program:

Val points = sqlcontext.sql (   "Select latitude, longitude from Historic_tweets")  

val model = Kmeans.train (points,  

sc.twitterstream (...)   . Map (t = (Model.closestcenter (t.location), 1))   . Reducebywindow ("5s", _ + _)

The first line of this program is to use spark SQL to search out some points, the second line is to use the K-means algorithm in Mllib to train a model, the third line is to use Spark streaming processing the message in the stream, using a well-trained model.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.