First, Hadoop Project Profile 1. What is Hadoop
Hadoop is a distributed storage and computing platform for big data.
Author: Doug Cutting; Lucene, Nutch.
Inspired by three Google papers
Hadoop core project
HDFS: Hadoop Distributed File System Distributed File System
MapReduce: parallel computing framework
3. Hadoop Architecture 3.1 HDFS Architecture
(1) master-slave structure
• Primary node, only one: namenode
• From node, there are many: datanodes
(2) namenode is responsible for: management
• Receive user operation request, which can implement file system operation (there are two general operation modes, command line mode and Java API mode)
• Maintain the file system's directory structure (used to categorize files).
• Manage the relationship between a file and a block (the file is divided into blocks, which file the block belongs to, and the order of the blocks is like the movie clip), the relationship between block and datanode.
(3) datanode is responsible for: storage
• Save the file
• Files are divided into blocks (the block is generally divided by 64M, but the space occupied by each block is the actual space of the file) is stored on disk and the big data is divided into relatively small block blocks, which can be fully utilized Disk space for easy management.
• To ensure data security, multiple copies of the file (such as key fobs) are duplicated and stored on different DataNodes separately.
3.2 MapReduce architecture
(1) master-slave structure
• Primary node, only one: JobTracker
• From node, there are many: TaskTrackers
(2) JobTracker is responsible for:
• Receive computing tasks submitted by clients
• Assign tasks to TaskTrackers
• Monitor TaskTracker implementation
(3) TaskTrackers is responsible for:
• Perform job assignments assigned by JobTracker
Hadoop features
(1) Scalable: reliable storage and processing of gigabyte (PB) data.
(2) Low cost (Economical): You can use ordinary server group to distribute and process data. These server farms can amount to thousands of nodes.
(3) Efficient: By distributing data, hadoop can process them in parallel on the node where the data resides, which makes processing very fast.
(4) Reliable: hadoop can automatically maintain multiple copies of data and automatically redeploy compute tasks after a failed task.
5. Hadoop cluster physical distribution
As shown in Figure 1.1
Figure 1 Hadoop cluster physical distribution
Here is a cluster consisting of two racks. There are two colors of green and yellow in the figure. It is easy to see that the master is yellow. NameNode and JobTracker all occupy one server. Only one is unique. Green is the slave. (Slave) more than one. The above said JobTracker, NameNode, DataNode, TaskTracker are essentially Java processes, these processes are called each other to achieve their own functions, and the main node and slave nodes generally run in a different java virtual machine among them Communication is cross-virtual machine communication.
These clusters are put on the server, the server is essentially physical hardware, the server is the main node or from the node, mainly to see what role or process is running, if the above is running Tomcat WEB server, running the database Is the database server, so when running on the server NameNode or JobTracker is the main node, running DataNode or TaskTracker is from the node.
In order to achieve high-speed communications, we generally use the LAN, the network can use Gigabit Ethernet, high-frequency switches, optical fiber and so on.
Hadoop cluster single-node physical structure
Figure 2 Hadoop cluster single-node physical structure
Second, the Hadoop ecosystem 1 Hadoop ecosystem overview Hadoop is a software framework for distributed processing of large amounts of data. Reliable, efficient, scalable features. Hadoop is the core of HDFS and Mapreduce, hadoop2.0 also includes YARN. The picture below shows the ecosystem of hadoop:
image 3
2, HDFS (Hadoop Distributed File System) comes from Google's GFS paper, published in October 2003, HDFS is GFS clone. It is the foundation of data storage management in Hadoop system. It is a highly fault-tolerant system that detects and copes with hardware failures for operation on low-cost, commodity hardware. HDFS simplifies document consistency models by providing high-throughput application data access capabilities through streaming data access for applications with large data sets.
Figure 4
Client: Split files; Access HDFS; Interact with NameNode for file location information; Interact with DataNode to read and write data. NameNode: Master node, there is only one in hadoop1.X, manages the HDFS namespace and data block mapping information, configures a copy policy and handles client requests. DataNode: Slave node, stores the actual data and reports the stored information to the NameNode. Secondary NameNode: Assist NameNode to share its workload. It periodically merges fsimage and fsedits and pushes it to NameNode. In case of emergency, it can assist in restoring NameNode, but Secondary NameNode is not a hotspot for NameNode. 3, Mapreduce (distributed computing framework) from Google MapReduce papers, published in December 2004, Hadoop MapReduce google MapReduce clone. MapReduce is a distributed computing model for the calculation of large amounts of data. Map, which performs the specified operation on individual elements on the dataset, generates intermediate results in the form of key-value pairs. Reduce, all the "values" of the same "key" in the intermediate result are scaled to get the final result. MapReduce this division of functions, is ideal for data processing in a distributed parallel environment consisting of a large number of computers. JobTracker: Master node, there is only one, management of all jobs, job / task monitoring, error handling, etc .; the task is broken down into a series of tasks, and assigned to the TaskTracker. TaskTracker: Slave node, running Map Task and Reduce Task; and interact with JobTracker to report the status of the task. Map Task: Parses each data record, passes it to the user-written map (), and executes to write the output to the local disk (directly into HDFS if it is a map-only job). Reducer Task: Executes the results from the Map Task, reads the input data remotely, sorts the data, and passes the data to the reduce function written by the user. Mapreduce processing flow, wordCount as an example: 4, Hive (Hadoop-based data warehouse) from facebook open source, originally used to solve massive structured log data statistics. Hive defines a SQL-like query language (HQL) that transforms SQL into MapReduce tasks executing on Hadoop. Usually used for offline analysis. 5, Hbase (Distributed Columns Database) Bigtable thesis from Google, published in November 2006, HBase is the Google Bigtable clone. HBase is a scalable, highly reliable, high-performance, distributed, and column-oriented dynamic schema database for structured data. Unlike traditional relational databases, HBase uses BigTable's data model: Enhanced Sparse Sorted Maps (Key / Value) where keys are made up of row keys, column keys, and timestamps. HBase provides random, real-time read and write access to large-scale data, while the data saved in HBase can be processed using MapReduce, which perfectly combines data storage and parallel computing. Data Model: Schema -> Table -> Column Family -> Column -> RowKey -> TimeStamp -> Value6, Zookeeper (Distributed Collaboration Services) Chubby thesis from Google, published in November 2006 Zookeeper is a Chubby clone. Solve the data management problems in distributed environment: unified naming, status synchronization, cluster management, configuration synchronization and so on. 7, Sqoop (data synchronization tool) Sqoop is an acronym for SQL-to-Hadoop, mainly for the traditional database and Hadoop before the data transfer. The import and export of data is essentially a Mapreduce program that takes full advantage of MR's parallelism and fault tolerance. 8, Pig (Hadoop-based data flow system) By yahoo! Open source design motivation is to provide a MapReduce-based ad-hoc (calculated in the query) data analysis tools, defines a data flow language -Pig Latin, Convert scripts to MapReduce tasks to execute on Hadoop. Usually used for offline analysis. 9, Mahout (data mining algorithm library) Mahout originated in 2008, was originally a sub-project of Apache Lucent, it achieved rapid development in a very short period of time, is now the top Apache project. The main goal of Mahout is to create some scalable implementations of classic algorithms in machine learning that are designed to help developers create smart applications more quickly and easily. Mahout now includes widely used data mining methods such as Clustering, Classification, Recommendation Engine (Collaborative Filtering), and Frequent Set Mining. In addition to algorithms, Mahout also includes data input / output tools, data mining support architecture integrated with other storage systems such as databases, MongoDB or Cassandra. 10, Flume (Log Collection Tool) Cloudera open source log collection system, with distributed, high reliability, high fault tolerance, easy to customize and expand features. It abstracts data from the process of generating, transmitting, processing, and eventually writing to the destination's path into a data stream in which the data source supports customizing the data sender in Flume to support the collection of various protocol data. At the same time, Flume data stream to log data to provide simple processing capabilities, such as filtering, format conversion. In addition, Flume has the ability to write logs to a variety of data targets (customizable). In summary, Flume is a massive log collection system that is scalable and suitable for complex environments. Third, the use of eclipse view hadoop source Hadoop source placed in the Hadoop SRC directory; import it into Eclipse; import jar package (Ant lib directory, hadoop directory, hadoop lib directory)
See: http://pan.baidu.com/s/1eQCcdcm
Note This article is excerpted from: http://blog.csdn.net/woshiwanxin102213/article/details/19688393