Get a little bit every day------Hadoop overview

Source: Internet
Author: User

First, the Origin of Hadoop

The idea of Hadoop comes from Google in the search engine when there is a big problem is that so many pages how I can be the fastest speed to search, because of this problem Google invented the inverted index algorithm, by adding map-reduce thought to calculate page Rank , through the continuous evolution of Google has given us the GFS, map-reduce, bigtable these three key technologies and ideas. Because Google doesn't have open source code for these technologies. There's a person who mimics Google's implementation of the framework lucene like Google full-text search, which provides the architecture of the full-text search engine, including the full query engine and search engine. Faced with big data, Lucene faces the same difficulties as Google. This allows Lucene's authors to mimic the problems that Google solves by doing a subproject Nutch under the Lucene project. A few years later, Google disclosed some of the details of GFS and MapReduce, and the author has made a formal introduction to the Apache fund as part of the Hadoop,hadoop Nutch as a sub-project of Lucene.

Second, what problems does Hadoop solve?

The progress of Hadoop over time solves several problems:

1. Timely analysis and processing of massive data.

2. Deep analysis and mining of massive data.

3, the long-term preservation of data.

4. Realize cloud computing.

5, can run on thousands of nodes, processing data volume and sequencing time is constantly shortened.

Third, the basic architecture of Hadoop.

3.1 The basic composition of the Hadoop framework.

Hbase:nosql database, Key-value storage, NoSQL database chain storage, data analysis to improve the corresponding speed. Maximize memory utilization.

Hdfs:hadoop distribute file system distributed filesystem maximizes disk utilization

MapReduce: The programming model is primarily used for data analysis to maximize CPU utilization.

Pig: User-to-mapreduce converter.

Hive:sql language to the MapReduce converter.

Zookeeper: Communication between the server node and the process.

Chukwa: Data integration communication.

3.2 Hadoop Framework cluster architecture

The Namenode:hdfs daemon that records how files are partitioned into chunks of data. and to which nodes the data blocks are stored. Centralized management of memory and I/O. is a single point, a failure will cause the cluster to crash.

Secondary Namenode: A secondary daemon that monitors the status of HDFs, has one in each cluster, communicates with Namenode to save HDFs metadata snapshots when Namenode failures can be used as backup Namenode.

Datenode: Each slave server is responsible for reading and writing HDFS data blocks to the local file system.

Jobtracker: A daemon that handles user-submitted code, determines which files are involved in processing, and then cuts the task and assigns nodes. Monitor the task, restart the failed task, and only one jobtracker per cluster is located on the master node.

Iv. Summary.

The advent of Hadoop solved our big data analysis and mining, but also greatly reduced the cost, not to buy a very powerful server, as long as a PC we can hang it to the Hadoop node can make it for our big data analysis and mining to contribute. Hadoop also solves our storage problem with big data, so we don't have to worry about the bottlenecks that big data poses to disk i/0 operations.

Welcome you to discuss the exchange: qq:747861092

QQ Group:163354117 (group name:codeforfuture)

Get a little bit every day------Hadoop overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.