Comparison of core components of Hadoop and spark

Source: Internet
Author: User
Tags comparison shuffle apache mesos spark mllib
first, the core components of Hadoop

The components of Hadoop are shown in the figure, but the core components are: MapReduce and HDFs.


1, the system structure of HDFS
We first introduce the architecture of HDFs, which uses a master-slave (Master/slave) architecture model, and an HDFS cluster consists of a namenode and several datanode. Where Namenode acts as the primary server, manages file system namespaces and client access to files, and Datanode in the cluster manages the stored data. HDFs allows users to store data in the form of files. Internally, the file is partitioned into blocks of data, and several blocks of data are stored on a set of Datanode. Namenode performs namespace operations on the file system, such as opening, closing, renaming files or directories, and it is responsible for mapping data blocks to specific datanode. Datanode is responsible for processing file read and write requests from file system clients, and the creation, deletion and copying of data blocks under the unified dispatch of Namenode.

Both Namenode and Datanode are designed to run on common commercial computers. These computers typically run the Gnu/linux operating system. HDFs is developed in the Java language, so any Java-enabled machine can deploy Namenode and Datanode. A typical deployment scenario is that a machine in the cluster runs a Namenode instance, and the other machines run a Datanode instance respectively. Of course, it does not preclude a machine from running multiple Datanode instances. The design of a single namenode in a cluster greatly simplifies the architecture of the system. Namenode is the manager of all HDFs metadata, and user data is never namenode.

2. MapReduce

Next, the architecture of MapReduce is described, and MapReduce is a parallel programming pattern that allows software developers to easily write distributed parallel programs. In the Hadoop architecture, MapReduce is an easy-to-use software framework that distributes tasks to a cluster of thousands of commercial machines and processes a large number of datasets in parallel in a highly fault-tolerant manner, enabling parallel task processing for Hadoop. The MapReduce framework consists of a jobtracker that runs on the master node alone and Tasktracker that runs on each cluster from the node. The primary node is responsible for scheduling all tasks that make up a job, which are distributed across different slave nodes. The primary node monitors their execution and re-executes previously failed tasks, from which the node is responsible only for tasks assigned by the master node. When a job is submitted, Jobtracker receives the submission job and configuration information, distributes the configuration information to the slave node, dispatches the task and monitors the execution of the Tasktracker.

As can be seen from the above introduction, HDFs and MapReduce together form the core of the Hadoop Distributed system architecture. HDFs realizes Distributed File system on the cluster, and MapReduce realizes distributed computing and task processing on the cluster. HDFS provides support for file manipulation and storage during MapReduce task processing, and MapReduce realizes the task of distributing, tracking and executing tasks on the basis of HDFs, and collects the results, which interact with each other and accomplish the main tasks of the Hadoop distributed cluster.

Other components are described in Hadoop, Hadoop core components
Ii. Overview of Spark's core components

Learn about the core components of spark first: Spark SQL, spark streaming, spark Mllib, and Spark Graphx. Image above


1. Spark SQL

Since the Spark1.0 version, it has been a conduit for data from the spark platform, and earlier users preferred the support that spark SQL provided to read data from existing Apache hive tables and popular and parquet columnstore formats. Later, with the addition of Spark SQL support for other formats (such as the more popular data formats---json, etc.), it makes it easier for data sources to convert data formats into a spark platform. and the dense optimizer collection provided by the API means that filtering and column pruning are used in many cases for data sources, greatly optimizing and reducing the amount of data that needs to be processed, significantly improving spark productivity. More detailed understanding of the sparksql can be learned, learn from the understanding of Spark SQL

2. Spark Streaming

Spark streaming is based on spark core for scalable, high-throughput, and fault-tolerant real-time streaming. Support for a very many data sources (including Kafka, Flume, HDFS, S3, etc.), the results of processing can be stored in HDFS, database and so on. Its principle is to decompose streaming calculations into a series of short batch jobs---with spark as a batch engine, divide the input data of spark streaming into a piece of data according to the batch size, then convert each piece of data into the RDD in spark, then spark The conversion of the dstream in streaming into a conversion operation to the RDD in Spark, and the RDD is manipulated into intermediate results to be stored in memory. The entire spark streaming provides an efficient, fault-tolerant, real-time, large-scale streaming framework. Learn more about sparkstreaming, spark streaming real-time computational framework introduction, Spark streaming instance analysis

3. Spark MLlib

Mllib is a library of Spark's commonly used machine learning algorithms, with relevant test and data generators, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and basic optimization elements at the bottom. The correlation algorithms of classification and regression include SVM, logistic regression, linear regression, naive Bayesian, decision tree, etc., collaborative filtering includes alternating least squares (ALS), Kmeans, streaming version of Kmeans, Gaussian mixture (Gaussian mixture), PIC (Power iteration Clustering), LDA (latent Dirichlet Allocation), and so on, reduced dimension realizes SVD (Singular Value Decomposition) and PCA (Principal compaonet analysis), frequent pattern mining implementations have fp-growth. It is believed that as time goes on, the machine learning Library will become more and more complete, Spark MLlib

4, Graphx

Spark's support for graph computing, along with the research of AI and machine learning, is GRAPHX.


The overall architecture of Spark is shown in the following figure:


Driver is a user-written data processing logic, this logic contains user-created Sparkcontext,sparkcontext is the user logic and spark cluster main interface, it will interact with Cluster Manager request computing resources, etc. Cluster The manager is responsible for cluster resource management and scheduling (supports standalone, Apache Mesos, and Hadoop yarn); Worknode is the node in the cluster that can perform compute tasks Excutor is a process initiated on a worknode for an application that runs the task and has the data in memory or on disk, and the task is sent to a calculated cell on a executor. Each application has a separate executor computation that is eventually executed in the executor of the compute nodes:

The user program needs to go through the following stages from submission to calculation execution:

1) When the user program creates Sparkcontext, the newly created Sparkcontext instance is connected to Cluster Manager,cluster Manager allocates compute resources for this commit based on CPU and memory information set by the user at the time of submission. Start executor;

2) driver will divide the user program into different stages of execution, each of which consists of a set of identical tasks, each of which is used for different partitions of the data to be processed, and driver sends a task to executor after the phase partitioning is complete and the task is created;

3) Executor when a task is received, it downloads the task runtime dependency, starts executing the task after preparing the task's execution environment, and reports the task's operational status to driver;

4) driver will handle different status updates based on the status of the task received, and the task can be divided into two types: one is the shuffle Map task, it realizes the data shuffling, the shuffle result is saved to the file system of the executor node, and the other is the result Task, which is responsible for generating the result data;

5) driver will continue to call the task, send the task to executor execution, when all the tasks are executed correctly or the limit of the number of executions is still not executed successfully when the stop;




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.