Hadoop: open-source implementation of Google cloud computing
Hadoop is a distributed computer framework organized by Apache open-source. It can run applications on clusters composed of a large number of cheap hardware devices.ProgramProvides a set of stable and reliable interfaces for applications to build a distributed system with high reliability and good scalability.
The core of hadoop is HDFS, mapreduce, and hbase, which correspond to Google's cloud computing GFS, mapreduce, and bigtable respectively.
Hadoop consists of the following sub-projects:
1) hadoop common: the original hadoop core, which is the core of the entire hadoop Project
2) Avro: hadoop RPC Solution
3) vhukwa: a data collection system used to manage large-scale distributed systems
4) hbase: a distributed database that supports structured data storage. It is an open source implementation of bigtable.
5) HDFS: provides a high-volume distributed file system and is an open-source implementation of GFS.
6) Hive: a data warehouse that provides data summarization and query function keys
7) mapreduce: Distributed Processing Model for large data
8) pig: an advanced data stream language built on mapreduce
9) zookeeper: used to solve consistency problems in distributed systems. It is an open source implementation of Chubby.
In addition to open source, hadoop has many advantages:
1) scalability
2) Economic
3) reliable
4) Efficient
Distributed File System (HDFS)
Design prerequisites and objectives:
1) handle hardware errors and quickly restore them automatically
2) Stream Data Access. Applications mainly use stream reading for batch processing, and pay more attention to the high throughput of data access.
3) ultra-large data sets support large file storage. A single HDFS instance can support tens of millions of files and expand to hundreds of nodes in a cluster.
4) simple consistency model. Applications generally write files at a time and read the files multiple times.
5) mobile computing is simpler than mobile data
6) portability between heterogeneous software and hardware platforms
Architecture:
HDFS is a master-slave architecture. HDFS clusters consist of a namenode and many datanode.
Namenode manages metadata of the file system and datanode stores actual data. The client contacts namenode to obtain the metadata of the file, and the real file I/O operations directly interact with datanode.
HDFS data is written and read multiple times at a time. The typical block size is 64 MB.
The client obtains the location list of the data blocks that constitute the file from namenode, that is, it knows which datanode the data block is stored in, and then the client directly reads the file from datanode. namenode does not participate in file transmission.
Namenode uses Event Logs to record changes to HDFS metadata and uses the namespace of the image file storage file system.
Reliability Assurance Measures:
1) redundant backup
2) copy storage
3) heartbeat detection. Each datanode regularly receives heartbeat packets and block reports. If the heartbeat packet is received, the datanode works normally.
4) Security Mode: when the system is started, the system enters a security mode. Data Block write operations are not performed at this time.
5) Data Integrity Detection: when creating an HDFS file, calculate each data block and checksum, and save the checksum as a separate hidden file in the namespace. When the client obtains a file, it will perform the corresponding verification check. If the client considers the data block to be damaged, it will obtain a copy of the data block from other datanode.
6) Space recycling: When a file is deleted, it will be first moved to the/trash directory. As long as the file is still in this directory, it can be quickly restored. The directory clearing time is configurable.
7) Metadata disk failure. image files and transaction logs are the core data structure of HDFS.
8) snapshots: Snapshots support data replication at a certain time. When HDFS data is corrupted, you can roll back to a known time point.
Performance Improvement Measures:
1) Copy Selection
2) Load Balancing
3) Client Cache
4) pipeline Replication
Access interface:
You can use Java APIs or APIs encapsulated in C language.
1) org. Apache. hadoop. conf: defines the configuration file processing API for system parameters.
2) org. Apache. hadoop. DFS: Implementation of the hadoop Distributed File System (HDFS) Module
3) org. Apache. hadoop. FS: defines the abstract file system API
4) org. Apache. hadoop. IO: defines general I/O APIs for reading/writing data objects such as networks, databases, and files.
5) org. Apache. hadoop. IPC: a tool used for network servers and clients. It encapsulates basic modules of Asynchronous Network I/O.
6) org. Apache. hadoop. mapred: Implementation of the hadoop Distributed Computing System (mapreduce) Module
7) org. Apache. hadoop. Metrics: defines an API for Performance Statistics, mainly used for mapred and DFS modules.
8) org. Apache. hadoop. Record: defines the recorded I/O API class and a record description language translator.
9) org. Apache. hadoop. Tools: defines some common tools.
10) org. Apache. hadoop. util: defines some public APIs.
Distributed Data Processing mapreduce
Mapreduce is a distributed computer model and the core of hadoop.
Logical Model:
Abstract The parallel computing process running on a large-scale cluster into two functions: map and reduce, that is, ing and simplification.
In short, mapreduce is the decomposition of the task and the summary of the results. Map splits the task into multiple tasks, and reduce summarizes the results to obtain the final result.
Implementation Mechanism:
1. Distributed Parallel Computing: The mapreduce framework is scheduled by jobtracker and tasktracker. Jobtracker is the Master Control Service and has only one service. It is responsible for scheduling and management of tasktracker.
2. Local computing
3. Task Granularity facilitates data locality. A small dataset starts a map service, and m map tasks can run concurrently on n computers. You can specify the number of reduce tasks.
4, combine
5. Partition. After the connection, the intermediate results can be divided into R parts according to the key range.
6. Read intermediate results
7. task pipeline
Distributed Data Table hbase
Hbase database is a hadoop-based project and is an open-source implementation of Google's bigtable. It is similar to Google's bigtable.
Logical Model:
You can store a series of data rows in a table. Each row contains a sorted keyword, an optional timestamp, and columns that may have data.
Hadoop Installation
You can use a virtual machine for simulation. If SSH is not installed on the machine, install SSH
Configure and install JDK. The detailed steps are not detailed on the Internet.
Download hadoop-0.2.tar.gz to the apache's hadoopopopwebsite.
Use ftp to upload to the corresponding server. If it is a virtual machine, you can select
Copy to the USR directory under Linux
Tar-zxvf hadoop-0.20.2.tar.gz
Unzip the package and use VI to find the conf/hadoop-env.sh File
Open
# The JAVA Implementation to use. required. # export java_home =/usr/lib/j2sdk1.5-sunexport java_home =/opt/jdk1.6.0 _ 27 # extra Java classpath elements. optional. # export hadoop_classpath =
Installation steps:
Hadoop has three operating modes: Standalone mode, pseudo-distributed mode, and full-distributed mode.
Use standalone mode during debugging
View and run the example
Bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input outputcat output /*
The result is as follows:
[Root @ linuxserver hadoop-0.20.2] # Cat output/* hadoop 1 Hello 2 World 1