Hadoop is an open source distributed computing platform owned by the Apache Software Foundation, which supports intensive distributed applications and is published as a Apache2.0 license agreement.
Hadoop: Hadoop Distributed File System HDFs (Hadoop distributed filesystem) and MapReduce (Googlemapreduce Open Source implementation) The core Hadoop provides the user with a transparent distributed infrastructure at the bottom of the system
1.Hadoop implements the MapReduce programming paradigm: Applications are segmented into a number of small portions, and each part can be executed or executed on any node in the cluster.
2.HDFS: The data used to save all the calculation points, which brings a very high band width to the whole cluster.
The 3.HADOOP cluster structure is: Master and slave. A HDFS cluster consists of a namenode and several datanode. Where Namenode as the primary server, manages the file system's namespace and the client's access to the file system; The Datanode in the cluster manages the stored data.
The 4.MapReduce framework is composed of a jobtracker running on the primary node and a tasktracker running in each cluster from the node. The master is responsible for scheduling all the tasks that make up a job, which are distributed across different nodes. The master node monitors their execution and restarts the previous failed tasks, and only the tasks assigned by the master node from the node. When a job is committed, Jobtracker receives the submit job and configuration information, and distributes the configuration information to the node, dispatching the task and monitoring the execution of the Tasktracker.
5.HDFS and MapReduce together form the core of the Hadoop Distributed system architecture. HDFs realizes Distributed File system on the cluster, MapReduce distributed computing and task processing on the cluster. HDFS provides the support of file operation and storage in the process of MapReduce task, MapReduce realizes the distribution, tracking and execution of tasks on the basis of HDFs, and collects the results, which are the main tasks of the Hadoop distributed cluster.
The five advantages of Hadoop
High scalability
Hadoop is a highly scalable storage platform because it can store and distribute inexpensive server data clusters across hundreds of parallel operations. Unlike traditional relational database systems that cannot be extended to handle large amounts of data, Hadoop is an application that can provide an enterprise with hundreds of terabytes of data nodes running on it.
Cost effectiveness
Hadoop also provides an extremely cost-effective storage solution for enterprise users. The problem of the traditional relational database management system is that he does not meet the large data processor and can not meet the cost benefit of the enterprise. Many companies used to assume that the best value for the data, and then set the classification based on these valuable data, and if all the data were saved, the cost would be too high. Although this method can be implemented in a short time, but as the volume of data increases, this approach does not solve the problem very well.
Hadoop has a different architecture, designed as an outward-expanding architecture that can economically store all of the company's data for later use, and the cost savings are staggering, with Hadoop providing hundreds of TB of storage and computing power, rather than a thousands of-dollar problem.
Flexibility is better
Hadoop makes it easy for businesses to access new data sources, and can analyze different types of data to generate value from that data, which means that businesses can leverage Hadoop's flexibility to gain valuable business value from data sources such as social media, e-mail, or click traffic.
In addition, Hadoop uses a wide range of applications, such as logarithmic processing, referral systems, data warehousing, campaign analysis, and fraud detection.
Fast
Hadoop has a unique way of storing data, and the tools used for data processing are usually on the same server as the database, resulting in faster processor data, and if you're working on a lot of unstructured data, Hadoop can effectively process terabytes of data in minutes, Instead of the PB-level data as it was before, it was measured in hours.
Fault tolerance
One of the key advantages of using Hadoop is his ability to fault tolerance. When the data is sent to a separate node, the data is also replicated to other nodes in the cluster, which means that there is another copy available in the event of a failure. Not a single point of failure.
Hadoop cluster Configuration instance: schema
1 master,1 Backup (host standby), 3 slave (created by virtual machine).
Node IP address:
Rango (Master) 192.168.56.1 Namenode
VM1 (Backup) 192.168.56.101 Secondarynode
VM2 (Slave1) 192.168.56.102 Datanode
VM3 (Slave2) 192.168.56.103 Datanode
VM4 (Slave3) 192.168.56.104 Datanode
Ps:hadoop is best run under a single user, and users in all clusters should be consistent, that is, the user name is the same.
Master machine configuration file: The Secondarynamenode,slaves file specified in the Masters file specifies the Datanode and Tasktracker to run
The master machine mainly configures the roles of Namenode and Jobtracker, responsible for the implementation of the distributed data and decomposition task of the explorer; The salve machine configures the roles of Datanode and Tasktracker, responsible for distributed data storage and task execution.
In the Hadoop cluster configuration, the IP and hostname of all the machines in the cluster need to be added to the "hosts" file so that master and all slave machines can communicate not only over IP but also through host names. Installation and configuration of JDK (Java Integrated development environment) and Hadoop.
MapReduce: "Breakdown of tasks and aggregation of results". There are two machine roles for performing mapreduce tasks: One is Jobtracker, and the other is Tasktracker,jobtracker is for dispatch, Tasktracker is for performing work. Only one jobtracker in a Hadoop cluster (in master)
The MapReduce framework is responsible for dealing with the complex problems of distributed storage, job scheduling, load balancing, fault-tolerant equalization, fault-tolerant processing and network communication in parallel programming, and the processing process is abstracted to two functions: map and Reduce,map are responsible for decomposing tasks into multiple tasks, Reduce is responsible for aggregating the results of multiple task processing after decomposition.
Hadoop configuration instance: specific process
1. Network, Host Configuration: Configure its host name on all hosts
Hosts: the host names and corresponding IP addresses of all the hosts in the cluster are added to the hosts file of all machines so that the cluster can communicate and authenticate with the host name.
2. Configure SSH password-free login
3.java Environment Installation
Cluster all machines to install JDK,JDK version: JDK1.7.0_45, and configure the environment variable:/etc/profile:
# set Java environnement
Export java_home=/usr/java/jdk1.7.0_45
Export classpath=.: $CLASSPATH: $JAVA _home/lib: $JAVA _home/jre/lib
Export path= $PATH: $JAVA _home/bin: $JAVA _home/jre/bin
Source/etc/profile Make it effective
4.hadoop installation and configuration: All machines must be installed Hadoop,hadoop version: hadoop-1.2.1
4.1 Installation: Tar zxvf hadoop-1.2.1.tar.gz; MV Hadoop-1.2.1/usr/hadoop;
Assign permissions for folder Hadoop to Hadoop users.
4.2 Hadoop environment variable: #set Hadoop path
Export Hadoop_home=/usr/hadoop
Export path= $PATH: $HADOOP _home/bin
Create a "TMP" folder in "/usr/hadoop": mkdir/usr/hadoop/tmp
4.3 Configuring Hadoop
1) Configure hadoop-env.sh:
# set Java environnement
Export java_home=/usr/java/jdk1.7.0_45
2 Configure Core-site.xml File:
3) Configure Hdfs-site.xml files
4) Configure Mapred-site.xml files
5 Configure Masters File: Secondarynamenode IP address added
6 Configure the Slaves file (master host specific): Add the host name or IP address of the Datanode node.
PS: can be installed and configured in master first, then through the scp-r/usr/hadoop root@ server ip:/usr/, the master configured Hadoop folder "/usr/hadoop" copy to All slave "/usr Directory. The Hadoop folder permissions are then given to individual Hadoop users on their respective machines. and configure the environment variables.
5 Startup and Validation
5.1 Format HDFs File system
Working with Hadoop users on master:
Hadoop Namenode-format
PS: Only once, the next start no longer need to format, just start-all.sh
5.2 Start Hadoop:
Turn off the firewall of all the machines in the cluster before starting, or it will appear datanode and automatically shut down:
Service Iptables Stop
Use the following command to start:
start-all.sh
When you start Hadoop successfully, DFS folders are generated in the TMP folder in master, and DFS folders and mapred folders are generated in the TMP folder in slave.
5.3 Verify Hadoop:
(1) Authentication method One: Use "JPS" command
(2) Authentication mode two: Use "Hadoopdfsadmin-report" to verify
6 Web View: Visit "http://masterip:50030"
Hadoop uses port description
Default port Settings Location description information
8020 namenode RPC Interactive port
8021 JT RPC Interactive port
50030 mapred.job.tracker.http.address jobtrackeradministrative Web GUI
Jobtracker HTTP servers and ports
50070 dfs.http.address namenode Administrative Web GUI
Namenode HTTP servers and ports
50010 dfs.datanode.address Datanode Control port (each Datanode listens on this port and registers it with the Namenode Onstartup) Datanode control port, mainly used for Datanode initialization when registering and answering requests to Namenode
50020 dfs.datanode.ipc.address Datanode IPC Port, usedfor block transmits
Datanode RPC server address and port
50060 mapred.task.tracker.http.address per tasktracker webinterface
Tasktracker HTTP servers and ports
50075 dfs.datanode.http.address per Datanode webinterface
Datanode HTTP servers and ports
50090 dfs.secondary.http.address per secondary Namenode web interface
HTTP servers and ports for secondary Datanode
Summary