Hadoop origin and System Overview
Hadoop source -- Lucene
Lucene is an open-source software developed by Doug cutting. It writes code in Java to implement full-text search functions similar to Google. It provides the full-text search engine architecture, including the complete query engine and index engine.
It was released on the personal website and SourceForge early in 2001 and became a subproject of the Apache Software Foundation Jakarta.
Lucene aims to provide software developers with a simple and easy-to-use toolkit to conveniently implement full-text retrieval in the target system, or build a complete full-text retrieval engine based on this.
For big data scenarios, Lucene faces the same difficulties as Google. Forcing Doug cutting to learn and imitate Google's solutions to these problems [it can be said that Doug has built a Google solution]
Lucene has a miniature version: nutch
From Lucene to nutch, from nutch to hadoop
In-, Google made public some details about GFS and mapreduce ideas. Based on this, Doug cutting and others spent two years working on DFS and mapreduce mechanisms to soar the performance of nutch.
Later, Yahoo! Zhao 'an Doug cutting and its project
Hadoop was officially introduced to the Apache Foundation as part of the Lucene sub-project nutch in the fall of 2005. In February March 2006, mapreduce and nutch Distributed File System (NDfS) were included in a project called hadoop respectively.
Hadoop's name comes from Doug cutting's son's toy elephant.
Current height of hadoop
1. hadoop has become an open-source software that implements de facto cloud computing standards.
2. It contains dozens of child projects with powerful vitality (such as hbase, hive, pig, etc)
3. It has been able to run on thousands of nodes, constantly breaking the world record for data processing and sorting time
Hadoop subproject family
Mapreduce and HDFS: open-source implementation of Google map-Reduce and GFS, which constitute two pillars of hadoop technology.
Hbase: open-source implementation of bigtable, A nosql [non-relational] database, is a column-based database [different from traditional row-based databases like Oracle], which is stored in columns; mainly for data analysis, column-based storage can increase the response speed and reduce the IO volume.
Pig: a lightweight programming language that converts a command into a mapreduce program by using a command like a system, controls mapreduce work, and returns the result to the user. [similar to a converter between a user and mapreduce]
Hive: it is equivalent to a er from the SQL language to the mapreduce program. It is intended for database engineers. [hive can be understood as a relational database, but its functions are relatively simple. It can only be part of the sql92 standard.]
Chukwa: equivalent to a data integration tool;
Zookeper: Mainly used for inter-process communication between server nodes;
Hadoop Architecture
Rank: rack
1. namenode [Name node]
A) HDFS daemon
B) How are recording files divided into data blocks and the nodes on which these data blocks are stored?
C) centralized management of memory and I/O
D) It is a single point of failure, which will cause the cluster to crash.
2. Secondary namenode [secondary Name node]
A) Auxiliary background programs that monitor the HDFS status
B) each cluster has
C) Communicate with namenode and regularly save HDFS metadata snapshots
D) When namenode fails, it can be used as a backup namenode.
3. datanode
A) each slave server runs
B) reads and writes HDFS data blocks to the local file system.
HDFS: = namenode + secondary namenode + datanode
4. jobtracker [job tracker, on the master server]
A) background programs used to process jobs (user-submitted code)
B) decide which files are involved in processing, cut the job into a task, and assign nodes.
C) monitoring tasks and restarting failed tasks (different nodes)
D) each cluster has only one jobtracker located on the master node.
5. tasktracker
A) located on the slave node, combined with datanode (Principles of code and data)
B) manage tasks on each node (assigned by jobtracker)
C) each node has only one tasktracker, but one tasktracker can start multiple JVMs to execute map or reduce tasks in parallel.
D) Interaction with jobtracker [Inform jobtracker of its own situation]
6. master and slave
A) Master: namenode, secondary namenode, and jobtracker. You can use a browser (for viewing the Management Interface) and other hadoop tools;
B) slave: tasktracker and datanode
C) the master is not necessarily unique: Generally, the namenode and datanode are placed on the same server, and the secondary namenode is placed on another server.
Hadoop ideas
Analysis Methods in hadoop System
1) Mainstream: Java programs
2) Lightweight scripting language: Pig
3) stable SQL skills transition: hive
4) nosql: hbase