Hadoop Learning Note (1): Conceptual and holistic architecture

Source: Internet
Author: User

    • Introduction and History of Hadoop
    • Hadoop Architecture Architecture
    • Master and Slave nodes
    • The problem of data analysis and the idea of Hadoop

For work reasons, you must learn and delve into Hadoop to take notes.

  What is Hadoop?

Apache Hadoop is an open-source software framework that supports data-intensive distributed Applications and is published in the Apache 2.0 license agreement. It supports applications that run on large clusters that are built on commodity hardware. Hadoop is based on a Google-published paper on MapReduce and Google's file system.

  The Hadoop framework transparently provides reliability and data movement for your applications. It implements a programming paradigm called MapReduce: Applications are split into many small pieces, and each can be executed or re-executed on any node in the cluster. In addition, Hadoop provides a distributed file system that stores data for all compute nodes, which brings a very high bandwidth to the entire cluster. The design of MapReduce and Distributed file systems enables the entire framework to automatically handle node failures. It enables applications with tens of thousands of independently computed computer and petabyte-scale data.  


  Hadoop history

Hadoop was formally introduced by the Apache software Foundation as part of Lucene's sub-project Nutch in the fall of 2005. It was inspired by Map/reduce and Google File System (GFS), which was first developed by Google Lab.

In March 2006, Map/reduce and Nutch distributed File System (NDFS) were included in a project called Hadoop. Hadoop is the most popular tool for content categorization of search keywords on the Internet, but it can also address many of the issues that require extreme scalability. For example, what would happen if you were to grep a 10TB jumbo file? On a traditional system, this will take a long time. However, Hadoop is designed with these issues in mind and uses a parallel execution mechanism, which can greatly improve efficiency.

    • Hadoop Common: In 0.20 and previous releases, including HDFs, MapReduce, and other project public content, starting from 0.21 HDFs and MapReduce were separated into separate sub-projects, the remainder being Hadoop Common
    • Hdfs:hadoop distributed FileSystem (Distributed File System)-hdfs (Hadoop Distributed File systems)
    • MapReduce: Parallel computing framework, using the Org.apache.hadoop.mapred legacy interface before 0.20, and the 0.20 release to introduce the new API for Org.apache.hadoop.mapreduce
    • Apache HBase: Distributed nosql Column database, similar to Google company BigTable.
    • Apache Hive: A data warehouse built on top of Hadoop that provides users with the ability to generalize, query, and analyze data through a class of SQL-HIVEQL language. Hive was originally contributed by Facebook.
    • Apache Mahout: Machine Learning algorithm package.
    • Apache Sqoop: Data conversion tool between structured data (such as relational database) and Apache Hadoop.
    • Apache ZooKeeper: A distributed lock facility that offers features like Google Chubby that are contributed by Facebook.
    • Apache Avro: The new data serialization format and Transfer tool will gradually replace the original IPC mechanism of Hadoop.

  Hadoop Platform Sub-project

  It is now widely accepted that the entire Apache Hadoop "platform" includes the Hadoop kernel, MapReduce, Hadoop Distributed File System (HDFS), and related projects, Apache Hive and Apache HBase, among others. The core design of the Hadoop framework is: HDFs and MapReduce. HDFS provides storage for massive amounts of data, and MapReduce provides calculations for massive amounts of data.

, the bottom layer is the core code of Hadoop, the core code to achieve the two most core functions: MapReduce and HDFs, this is the two pillars of Hadoop! Because Hadoop is written in Java, in order to facilitate other programmers unfamiliar with the Java language, there is pig, this is a lightweight language, users can use pig for data analysis and processing, the system will automatically convert it into a mapreduce program.

And a hive, it's important! This is a traditional SQL-to-mapreduce mapper for traditional database engineers. However, all SQL is not supported. There is also a sub-project called HBase, a non-relational database, NoSQL database, data is columnstore, improve response speed, reduce the amount of Io, can be made into a distributed cluster.

Zookeeper is responsible for communication between server nodes and processes, which is a coordination tool, because almost every sub-project in Hadoop is made with an animal logo, so this coordinating software is called a zoo administrator.

  Hadoop architecture

, two server cabinets, each of which represents a physical machine, each physical node connected via a network cable, connected to the switch, and then accessed by the client over the Internet. Some of the background processes of Hadoop are running on each of the physical machines.



Also called the name node, is the HDFs daemon (a core program), the overall control of the entire distributed file system, will record all the metadata distribution stored state information, such as how the file is divided into blocks of data, and the data blocks are stored on which nodes, as well as memory and I/O centralized management, The user first accesses the Namenode, obtains the state information of the file distribution through the master node, finds out which data nodes the files are distributed to, and then deals with these nodes to get the files. So this is a core node.

  However, this is a single point and a failure will cause the cluster to crash.

  Secondary Namenode


In Hadoop, there are some poorly named modules, and secondary namenode is one of them. From its name, it feels like a namenode backup, such as someone called it a second name node, as if to give a feeling and follow ... But it's actually not exactly.

It is best to translate to a secondary name node, or checkpoint node , which is a secondary daemon that monitors the status of HDFs and can hold a copy of the name node, so each cluster has one, which communicates with Namenode and periodically saves the HDFs metadata snapshot. the Namenode fault can be used as a standby namenode and cannot be switched automatically at this time. But functionality is never limited to this. The so-called reserve is not its main function. Detailed explanations follow.




Called data nodes, each running from the server node , is responsible for the HDFS data block read, write to the local file system. these three things form one of the pillars of the Hadoop platform--hdfs system.

Look at the other pillar--mapreduce, there are two background processes.



Called the job Tracker, a very important process running to the master node (Namenode), is the scheduler for the MapReduce system. A daemon that processes jobs (user-submitted code), determines which files are involved in the processing of jobs, and then cuts the jobs into small tasks and assigns them to the child nodes where the data is needed.

The principle of Hadoop is to run nearby, data and programs in the same physical node, where the data is, and where the program runs. This job is done by jobtracker, monitoring tasks, and restarting failed tasks (on different nodes), each cluster has only one jobtracker, like a single point of NN, Located in the master node (later explained master node and slave node).




Called the task Tracker, the last background process of the MapReduce system, located on each slave node, combined with Datanode (the principle of code and data), manages the task (assigned by Jobtracker) on the respective node, There is only one tasktracker per node, but one tasktracker can launch multiple JVMsfor parallel execution of a map or reduce task, which communicates with Jobtracker to inform the Jobtracker subtask of the completion of the task.

  Master and slave

Master node: A node that runs Namenode, or secondary Namenode, or Jobtracker. There are also browsers (for viewing the admin interface), and other Hadoop tools. Master is not the only one!

Slave node: A machine running Tasktracker, Datanode.

  The problems faced by data analysts and the thinking of Hadoop

at present, we need to deal with the increasingly large data, both in the storage and query, there are performance bottlenecks, the user's application and analysis results are integrated trend, the real-time and response time requirements are more and more high. The model used is becoming more and more complex, and the number of computational quantities increases exponentially.

Therefore, people want to have a technology or tools to solve performance bottlenecks, in the foreseeable future is not prone to new bottlenecks, and learning costs as low as possible, so that the skills of the past can be a smooth transition. such as SQL, R, and the cost of the transfer platform can be minimized, such as platform hardware and software costs, re-development costs, skills re-training costs, maintenance costs.

And Hadoop can solve the above problems--divide and conquer, simplify and simplify.

Hadoop Learning Note (1): Conceptual and holistic architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.