Compare Hadoop with Spark

Source: Internet
Author: User
Tags cassandra hadoop mapreduce hadoop ecosystem

Read this article first: http://www.huochai.mobi/p/d/3967708/?share_tid=86bc0ba46c64&fmid=0

It is difficult to compare Hadoop and spark directly, because many of the tasks they handle are the same, but in some ways they do not overlap each other.

For example, Spark does not have file management capabilities and must rely on Hadoop Distributed File System (HDFS) or some other solution.

The main modules of the Hadoop framework include the following:

    • Hadoop Common
    • Hadoop Distributed File System (HDFS)
    • Hadoop YARN
    • Hadoop MapReduce

While the four modules above make up the core of Hadoop, there are several other modules. These modules include: Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume, and Sqoop, which further enhance and extend the capabilities of Hadoop.

Spark is really fast (up to 100 times times faster than Hadoop MapReduce). Spark can also perform batch processing, but it's really good at dealing with streaming workloads, interactive queries, and machine learning.

Compared to the MapReduce disk-based batch processing engine, Spark's fame is in its real-time data processing capabilities. Spark is compatible with Hadoop and its modules. In fact, on the Hadoop project page, Spark is listed as a module.

Spark has its own page, because although it can be run in a Hadoop cluster through yarn (another resource coordinator), it also has a standalone mode. It can be run as a Hadoop module or as a standalone solution.

The main difference between MapReduce and Spark is that MapReduce uses persistent storage, while Spark uses elastic distributed Datasets (RDDS).

Performance

The reason that spark is so fast is that it handles all the data in memory. Yes, it can also use disks to process data that is not fully loaded into memory.

Spark's memory processing provides near real-time analysis of data from multiple sources: Marketing campaigns, machine learning, IoT sensors, log monitoring, security analytics, and social media sites. In addition, MapReduce uses batch processing, which is never designed for amazing speed. It is designed to constantly collect information from the site, without the need for real-time or near-real-time data.

Ease of Use

Supports Scala (native language), Java, Python, and Spark SQL. Spark SQL is very similar to SQL 92, so it's almost no need to go through some learning and get started right away.

Spark also has an interactive mode so that developers and users can get instant feedback on queries and other actions. MapReduce has no interactive mode, but with additional modules such as hive and pig, it is easier for the user to use MapReduce.

cost

"Spark has proven to be easy with data up to PB. It is used to sort 100TB data up to 3 times times faster than Hadoop MapReduce on machines with only one-tenth of the number. "This achievement has made spark the 2014 Daytona Graysort benchmark.

Compatibility

MapReduce and Spark are compatible with each other; MapReduce has the same compatibility with MapReduce as many data sources, file formats, and business intelligence tools that are compatible with JDBC and ODC.

Data processing

MapReduce is a batch processing engine. MapReduce operates in sequential steps, reading data from the cluster, then performing operations on the data, writing the results back to the cluster, reading the updated data from the cluster, performing the next data operation, writing the results back to the results, and so on. Spark performs a similar operation, but is executed in one step in memory. After it reads the data from the cluster, it performs operations on the data and then writes back to the cluster.

Spark also includes its own graphical computing library GRAPHX??。 GRAPHX allows users to view the same data as graphs and collections. Users can also use elastic distributed data sets (RDD), change and union graphics, and fault tolerance sections are discussed.

Fault Tolerance

As for fault tolerance, mapreduce and spark solve the problem in two different directions. MapReduce uses the Tasktracker node, which provides a heartbeat (heartbeat) for the Jobtracker node. If there is no heartbeat, then the Jobtracker node re-dispatches all the actions that will be performed and the operations that are in progress to another Tasktracker node. This approach is effective in providing fault tolerance, but it can greatly prolong the completion time of certain operations, even if there is only one failure.

Spark uses elastic distributed Datasets (RDD), which are fault-tolerant collections in which data elements perform parallel operations. The RDD can reference datasets in an external storage system, such as a shared file system, HDFS, HBase, or any data source that provides Hadoop inputformat. Spark can create an RDD with any storage source supported by Hadoop, including a local file system, or one of the file systems listed earlier.

The RDD has five main properties:

    • Partition list
    • Functions for calculating each Shard
    • List of items that depend on other RDD
    • A partitioner for the key-value RDD (for example, the RDD is a hash partition), which is an optional attribute
    • Calculates a list of preferred locations for each shard (such as the data block location of an HDFs file), which is an optional attribute

The RDD may be persistent in order to cache the dataset in memory. In this way, future operations are much faster, up to 10 times times. Spark's cache is fault tolerant, because if any of the RDD partitions are lost, the original conversion is used to automatically recalculate.

Scalability

By definition, both mapreduce and spark can be extended using HDFS. So how big can hadoop clusters get?

Yahoo is said to have a 42,000-node Hadoop cluster, can say that expansion without limits. The largest known spark cluster is 8,000 nodes, but as big data increases, the cluster size is expected to grow as well, in order to continue to meet throughput expectations.

Safety

Hadoop supports Kerberos authentication , which is cumbersome to manage. However, third-party vendors allow enterprise organizations to take full advantage of Active Directory Kerberos and LDAP for authentication. Similarly, third-party vendors provide data encryption for data and static data in transit.

The Hadoop Distributed File system supports access control lists (ACLs) and traditional file permission modes . Hadoop provides service level Authorization for user control in task submissions, which ensures that the customer has the correct permissions.

Spark is less secure and currently only supports authentication through shared keys (password Authentication). The security benefit of Spark is that if you run spark on HDFS, it can use HDFs ACLs and file-level permissions. In addition, spark can run on yarn, so it can use Kerberos authentication.

Summarize

Spark and MapReduce are mutually symbiotic relationships. Hadoop provides features that spark does not have, such as distributed file systems, and Spark provides real-time memory processing for those datasets that need it. The perfect big data scenario is what the designers originally envisioned: Let Hadoop and spark work together in the same team.

Then read this article: Link

The 2009 UC Berkeley team started the Apache Spark Project, designed to design a unified engine for distributed data processing. Spark has a programming model similar to MapReduce, but uses data sharing abstraction extensions called Elastic distributed Datasets (RDDS).

The versatility of Spark has several important benefits.

First, applications are easier to develop because they use a unified API.

Second, combining processing tasks is more efficient, whereas the previous system needs to write data to the store to pass it to another engine, and spark can run different functions on the same data (usually in memory).

Finally, spark enables new applications that were not previously implemented by the system (interactive queries on the form and streaming computer learning). Since its release in 2010, Spark has grown to become the most active open source project or Big data processing, with more than 1,000 contributors. The project has been used in more than 1,000 organizations, from technology companies to banking, retail, biotechnology and astronomy.

The key programming abstraction in Spark is the RDD, which is a fault-tolerant collection that can process objects in the cluster in parallel. The user creates the RDD through "transformations" (such as map, filter, and GROUPBY) operations.

Lines = Spark.textfile ("hdfs://... "  = Lines.filter (s = = S.startswith ("ERROR")) println ("   "+errors.count ())

Spark evaluates the Rdds delay and attempts to find a valid plan for the user operation. In particular, the transformation returns a new Rdd object that represents the result of the calculation, but does not calculate it immediately. When an action is invoked, Spark looks at the entire diagram of the transformation used to create the execution plan. For example, if there are multiple filters or mapping operations in a row, spark can fuse them into a pass, or if it knows that the data is partitioned, it avoids data passing through the network for GroupBy. This allows the user to modularize the program without causing poor performance.

Finally, RDDS provides explicit support for data sharing between calculations. By default, the RDD is "ephemeral" because they are recalculated each time the action (for example, count) is used. However, users can also keep the selected Rdd in memory or quickly reuse it. (If the data is not suitable for memory, spark will also overflow it to disk.) For example, when a user searches for a large number of log datasets in HDFs for error debugging, you can load different cluster error messages into memory by calling the following function:

Errors.persist ()

The user can then run different queries on that in-memory data:

// Count errors mentioning MySQL  = = S.contains ("MySQL"/ // Mention PHP, assuming time is field #3:= = S.contains ("PHP"). Map (line = Line.split ('t') (3)). Collect ()

Fault Tolerance

In addition to providing data sharing and various parallel operations, Rdds can also automatically recover from failures. Traditionally, distributed computing systems provide fault tolerance through data replication or checkpoint. Spark uses a new method called "Lineage". Each RDD trace is used to build its transformation diagram and rerun these operations on the basic data to rebuild any missing partitions.

Shows the RDD in our previous query, where we get the wrong time field by applying two filters and a map. If any partition of the RDD is lost (such as a failed node that saves memory partitions), spark rebuilds it by applying a filter on the appropriate block of the HDFs file. For the "Shuffle" operation that sends data from all nodes to all other nodes (for example, Reducebykey), the sender retains its output data locally to prevent the receiver from having an error.

Lineage-based recovery is much more efficient than replication in a data-intensive workload. It saves time because writing RAM is faster than writing data over the network. Recovery is usually much faster than simply re-running the program because the failed node typically contains multiple RDD partitions that can be rebuilt concurrently on other nodes.

Another more complicated example:

The implementation of logistic regression in spark. It uses a batch gradient descent, a simple iterative algorithm that repeats the data on the gradient function as a parallel summation. Spark makes it easy to load data into RAM and run multiple sums. Therefore, it runs faster than the traditional mapreduce. For example, in a 100GB job, MapReduce takes 110 seconds per iteration, because each iteration needs to load data from disk, and Spark takes only one second to iterate after the first load.

integration with Storage systems

Very similar to Google's MapReduce, Spark is designed to use persistent storage with multiple external systems. Spark is most commonly used in clustered file systems such as HDFS and key-value storage, such as S3 and Cassandra. It can also be used as a data directory to connect to Apache hive. Rdd typically stores temporary data only in the application, but some applications, such as the Spark SQL JDBC Server, also share the RDD among multiple users. Spark is designed as a storage-system-agnostic engine that allows users to easily operate on existing data and connect to a variety of data sources.

Library

One technique that has not yet been implemented in spark SQL is the index, although other libraries on spark (such as Indexedrdds) do use it.

Spark Streaming (Stream). Spark streaming uses a model called "discrete flow" to implement incremental stream processing. To enable streaming through spark, we divide the input data into small batches (for example, every 200 milliseconds), and we regularly combine the states stored in the RDD to produce new results. Running flow computing in this way has several advantages over traditional distributed streaming systems. For example, failure recovery is cheaper due to the use of lineage, and streams can be combined with batch processing and interactive queries.

GraphX. GRAPHX provides a graphical computing interface similar to Pregel and Graphlab, and 1 implements the same layout optimizations (such as vertex partitioning schemes) as those systems by using the RDD selection partition function built for it.

MLlib. Mllib,spark's machine learning Library realizes more than 50 kinds of common distributed model training algorithms. For example, it includes a decision tree (PLANET), a common distributed algorithm for latent Dirichlet distribution, and alternate least squares matrix decomposition.

The database of Spark operates on the RDD as a data abstraction, making them easy to assemble in the application. For example, it shows a program that uses spark SQL to read some historical Twitter data, use Mllib to train a K-means clustering model, and then apply the model to a new tweet stream. The data tasks returned by each library (here is the historic tweet Rdd and K-means model) are easily passed on to other libraries.

Performance

Compare the performance of spark with other engines on three simple tasks (SQL query, stream Word count and alternate least squares matrix decomposition). Although the results vary depending on the workload, spark is typically comparable to dedicated systems such as Storm,graphlab and Impala. For streaming, although we show the results of the distributed implementation on storm, the throughput of each node can be comparable to the business flow engine, such as Oracle CEP.

Interactive query

Interaction using spark is divided into three main categories. First, organizations typically use spark SQL for relational queries through business intelligence tools such as tableau. Examples include ebay and Baidu. Second, developers and data scientists can interactively use Spark's Scala,python and R interfaces through a shell or visual notebook environment. This interactive use is critical for proposing more advanced issues and designing a model that ultimately leads to production applications, and is common in all deployments. Third, some vendors have developed interactive applications in specific areas that run on spark. Examples include tresata (anti-Money laundering), TRIFACTA (data cleansing), and Pantera.

Spark Components for use

We see that many components are widely used, and Spark core and SQL are the most popular. Streaming is used in 46% of organizations, and machine learning is used in 54%. Although not shown directly in Figure 9, most organizations use multiple components; 88% use at least two of them, 60% use at least three (such as Spark core and two libraries), 27% use at least four components.

Deployment Environment

Although the first spark deployment was typically in a Hadoop environment, only 40% of the July 2015 Spark survey was deployed on the Hadoop yarn cluster Manager. In addition, 52% of respondents run spark on a public cloud.

Model capabilities

MapReduce is inefficient in sharing data across time periods because it relies on replicated external storage systems for this purpose.

Rdds is built on the ability of map-reduce to simulate any distributed computing, but is more efficient. Their primary limitation is the increased wait time due to synchronization in each communication step, but the loss of this wait time is negligible compared to the resulting.

A typical Hadoop cluster might have the following features:

Local storage. Each node has local memory, approximately 50gb/s bandwidth, and 10 to 20 local disks, approximately 1gb/s to 2gb/s of disk bandwidth.

Link. Each node has a 10Gbps (1.3gb/s) Link, or about 40x less than its memory bandwidth, and is twice times smaller than its total disk bandwidth.

Racks. The nodes are organized into racks of 20 to 40 machines, each rack having a bandwidth of 40gbps-80gbps, or 2-5 times the network performance within the rack.

Given these properties, the most important performance issue in many applications is the placement of data and calculations in the network. Fortunately, the RDD provides the facility to control this placement, which allows the application to place calculations near input data (via an API for entering source 25 "preferred location"), and the RDD provides control over data partitioning and co-provisioning (for example, specifying that the data is hashed for a given key).

In addition to network and I/O bandwidth, the most common bottlenecks tend to be CPU time, especially if the data is in memory. In this case, spark can run the same algorithms and libraries that are used in the dedicated system on each node. For example, it uses column storage and processing in spark SQL, native Blas libraries in mllib, and so on. As we discussed earlier, the only area where the RDD significantly increases costs is network latency.

Spark has implemented an obstacle in the shuffle phase, so the reduce task will not start until all the maps have been completed . This avoids some of the complexities required for failure recovery. Although removing some of these features will speed up the system. But by default, we keep fault tolerance open in spark so that we can handle the application with fault tolerance.

Conclusion

Scalable data processing is essential for next-generation computer applications, but often involves different computing systems. To simplify this task, the Spark project introduces a unified programming model and engine for big data applications. The practice proves that such models can effectively support the current workload and bring substantial benefits to the user.

Then read this article (I think this is a few levels, this article > First > Second ):

http://blog.csdn.net/archleaner/article/details/50988258

The current Hadoop ecosystem mainly includes:

    1. HDFS-hadoop Distributed File System. It is a distributed, block-oriented, non-updatable (HDFs file can only be written once, once closed can no longer be modified), highly scalable file systems that can run on ordinary hard disks in the cluster. Additionally, HDFs is a standalone tool that can run independently of other components in the Hadoop ecosystem (but if we want to make HDFs highly available, we also need to rely on zookeeper and log manager, but this is another matter).
    2. MapReduce Framework -This is a basic distributed computing framework that executes on a set of standard hardware in a cluster. We don't have to be using it in HDFs-because the filesystem is pluggable; again, we don't necessarily have to use it in yarn because the resource manager is pluggable: for example, we can replace it with Mesos.
    3. YARN-hadoop The default resource manager in the cluster. But instead of using yarn in the cluster, we can run our Mr (Map/reduce) task over Mesos , or just run hbase in the cluster that does not need to rely on yarn.
    4. Hive-hive is a class SQL query engine built on a mapreduce framework that transforms HIVEQL statements into a series of mapreduce tasks running in a cluster. In addition, HDFs is not the only storage system, and does not necessarily have to use the MapReduce framework, for example, here I can be replaced by Tez.
    5. Hbase-an HDFS-based key-value pair storage system that provides online transaction processing (OLTP) capabilities for Hadoop. HBase relies only on HDFs and zookeeper; but is hbase dependent on HDFs? No, HBase can run on Tachyon (memory file system), Maprfs, IBM GPFs, and some other frameworks, in addition to being able to run on HDFs.

You might also think that Storm can handle data flow , but it's completely independent of Hadoop and can run on its own, and you might also think of the machine learning framework that runs on MapReduce Mahout, but it's getting less attention from the community before. For mahout feedback questions (red) and solved problem (green) trend graph:

    1. Let's talk about Spark, which consists mainly of the following:
    2. Spark Core – the engine for general distributed data processing. It does not depend on any other components and can be run on any commercial server cluster.
    3. Spark SQL – a SQL query statement running on spark that supports a range of SQL functions and HIVEQL. But not very mature, so do not use in production systems, while HIVEQL integrates the required hive metadata and hive-related jar packages.
    4. Spark streaming – a spark-based micro batch processing engine that supports the import of a wide variety of data sources. The only thing that relies on is the spark core engine.
    5. MLib – a machine learning library built on top of spark that supports a range of data mining algorithms.

Note: The following paragraph holds reservations:

What's more, we're going to talk about an important myth about spark-"Spark is a memory-based technology." It is not a memory-based technology; Spark is a piped execution engine that writes data to disk during shuffle (for example, if we want to do aggregate operations on a field), and if memory is not enough, memory overflows (but memory can be adjusted). As a result, spark is faster than mapreduce primarily because it is piped and not some people say "memory-based optimization." Of course, spark does cache in memory to improve performance, but that's not why spark really works fast.

Now, let's complete the comparison:

1. MapReduce can be replaced by spark core? Yes, it will be replaced over time, and this substitution is justified. But Spark is not yet particularly mature enough to replace mapreduce completely. In addition, no one will abandon mapreduce altogether unless all tools that rely on MapReduce have an alternative. For example, a script that you want to run on pig can perform on spark or some work.

(Note:Pig is a data flow language that is used to quickly and easily handle huge data, and Yahoo! is now on the decline.) Pig can handle HDFs and hbase data very easily, and like hive, Pig can handle what it needs to do very efficiently, and can save a lot of labor and time by directly manipulating pig queries . When you want to do some conversion on your data, and you don't want to write mapreduce jobs, you can use pig.)

2. Can hive be replaced by spark SQL? Yes, that's right again. But what we need to understand is that spark SQL is still relatively young for spark itself, about 1.5 times times younger. It's only a toy compared to the more mature hive, and I'll look back at Spark SQL within 1.5 to two years. If we remember, three years Impala was supposed to end hive, but as of now the two technologies still coexist, Impala does not end hive. The same is true for Spark SQL here.

3. Can storm be replaced by spark streaming? Yes, it can be replaced. Just to be fair, storm is not a part of the Hadoop ecosystem because it's a completely separate tool. Their computational models are not very similar, so I don't think Storm will go away, but it will still be a commercial product.

4. Mahout can be replaced by mlib? to be fair, Machout has lost the market, and for the past few years it is rapidly losing the market. For this tool, we can say that this is where spark can really replace the Hadoop ecosystem. ( Note: Agree!) The ml of spark is very useful! Learn to be good! )

So, overall, the conclusion of this article is:

1. Don't be fooled by big data vendors ' packaging. They are pushing the market rather than the final truth. Hadoop was initially designed as an extensible framework, and many of the parts were replaceable: you could replace HDFs with Tachyon (now the new name is Alluxio), and you could replace yarn with Mesos, MapReduce can be replaced with tez and hive can be run on top of Tez. Will this be an option or a complete alternative to the Hadoop technology stack? If we give up Mr MapReduce and use Tez, will it still be Hadoop?

2. Spark cannot provide us with a complete technology stack. it allows us to integrate its functionality into our Hadoop cluster and benefit from it without completely departing from our old cluster scenario.

3. Spark is not mature enough. I think that in 3-4 years we will not call it "Hadoop stack" but call it "big data stack" or something like that. Because in the big data stack we have a wide range of options to select different open source products to combine to form a separate technology stack to use.

Compare Hadoop with Spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.