Introduction to Java-Overview of Hadoop in framework-Introduction to hadoop
I. Hadoop history
The idea of Hadoop comes from a big problem that occurs when Google is making a search engine. How can I find so many webpages as quickly as possible, because of this problem, Google invented the inverted index algorithm and used the Map-reduce idea to calculate the Page Rank, through continuous evolution, Google has brought us three key technologies and ideas: GFS, Map-Reduce, and Bigtable. Because Google has no open source code for these technologies. Some people imitate Google to implement a framework like Google's full-text search function Lucene. It provides the full-text search engine architecture, including the complete query engine and search engine. While facing big data, Lucene faces the same difficulties as Google. So that the authors of Lucene imitated the problems solved by Google and developed a sub-project named "nuttch" under the lucene project. A few years later, Google published some details about GFS and Mapreduce ideas. On this basis, the author made Hadoop. Hadoop was officially introduced into the Apache Fund as part of the Mapreduce sub-project.
Ii. What problems does Hadoop solve?
With the passage of time, Hadoop has solved several problems step by step:
1. Timely analysis and processing of massive data.
2. In-depth analysis and mining of massive data.
3. Long-term data storage.
4. Implement cloud computing.
5. It can run on thousands of nodes, shortening the data volume and sorting time.
Iii. Hadoop basic architecture.
3.1 basic components of the Hadoop framework.
HBase: NoSql database, Key-Value storage, NoSql database chain storage, data analysis improves the corresponding speed. Maximize memory utilization.
HDFS: Hadoop distribute file system Distributed file system, maximizing disk Utilization
MapReduce: the programming model is mainly used for data analysis to maximize CPU utilization.
Pig: the user and MapReduce converter.
Hive: SQL-to-MapReduce converter.
Zookeeper: communication between server nodes and processes.
Chukwa: data integration and communication.
3.2 Hadoop framework cluster architecture
Namenode: The HDFS daemon that records how files are divided into data blocks. And the nodes on which the data blocks are stored. Centralized management of memory and I/O. Is a single point of failure, the cluster will crash.
Secondary Namenode: a Secondary backend program that monitors the HDFS status. Each cluster has one. It communicates with NameNode to save HDFS metadata snapshots. When NameNode fails, it can be used as a backup NameNode.
DateNode: each slave server runs a program to read and write HDFS data blocks to the local file system.
JobTracker: the background program used to process the Code submitted by the user. It determines which files are involved in the processing, and then cut the task and allocate nodes. Monitoring task, restart failed task, each cluster has only one JobTracker located on the Master node.
Iv. Summary.
The emergence of Hadoop solved our big data analysis and mining, and greatly reduced the cost. We don't have to buy any powerful servers, as long as it is a PC, We can mount it to the Hadoop node, so that it can contribute to the analysis and mining of big data. Hadoop also solves our big data storage problem, so that we don't have to worry about the bottleneck of big data on disk I/0 operations.