first, Introduction to Hadoop
1, the original purpose of Hadoop is to solve the nutch of massive data crawling and storage needs, HDFs from Google's gfs,mapreduce from Google's MapReduce, HBase was introduced to the Apache Foundation after it originated from Google's bigtable.hadoop.
2, Hadoop two core design is HDFs and Mapreduce,hdfs is a distributed storage system, to provide high reliability, high scalability, high throughput data storage services; MapReduce is a distributed computing framework with the advantages of easy programming, high fault tolerance and high scalability.
3. Hadoop and traditional databases are larger than stored data, and are semi-structured and unstructured data, more meaningful based on data mining and predictive analytics, and Hadoop is fast and well maintained and inexpensive.
4, Hadoop version of a few lines parallel 0.x,1.x,2.x,0.23 increase user authentication Management (through password access to Hadoop), 2.x plus nn HA, the enterprise currently uses 2.x.
ii. introduction of HDFs
1. Advantages and disadvantages of HDFs
Advantages:
High Fault tolerance : Data automatically save multiple copies, copy lost, automatic recovery, reliability also achieved faster processing speed, a node high load, can read B-node
suitable for batch processing : Mobile computing rather than data, data location burst to the computational framework
suitable for big data processing : Even petabytes of data, the number of files over millions, 10k+ nodes
can be built on inexpensive machines : improve reliability with multiple replicas and provide fault tolerance and recovery mechanisms
Disadvantages:
Low-Latency data access : For example, orders are not suitable for storage in HDFs, requiring data to be detected in milliseconds
Small file access : Not suitable for a large number of small file storage, if there is such a need, to compress small files
concurrent Write, File random modification : not suitable for modification, the actual network disk, cloud disk content is not allowed to modify, can only delete the new upload, they are Hadoop do
2.HDFS Architecture
1>HDFs Storage Unit (block), a file will be cut into a number of fixed-size blocks (the default is 64MB, configurable, if less than 64MB, then a single block), stored on different nodes, the default per block has three copies (the more replicas, The lower the disk utilization), the block size and number of copies are set by the client side of the upload file, the number of copies after successful file upload, block size is immutable. such as a 200M file will be cut into 4 pieces, there are different nodes, such as hanging a machine, will automatically in the replica, Restore to normal state, as long as three machines do not hang at the same time, the data will not be lost.
2, HDFs contains 3 kinds of nodes, NameNode (NN), secondary NameNode (SNN), DataNode (DN).
NN node function
Receive client's read and write request, the metadata data (metadata is the most important, the metadata is lost,datenode is garbage data ) including the file information except the contents of the file, the file contains the block; Which DN The block is saved on (when it is started by DN, why does it have to be escalated?), the metadata information in NN is loaded into memory when it is started, and metadata is stored on disk with the file name Fsimage, The location information for the block is not saved to the Fsimage,edits log for metadata. For example, there is an operation to insert a file, and Hadoop does not directly modify the Fsimage, but is recorded in the edits log file. However, the data in NN memory is modified in real time. After partition time merges edits and Fsimage, the mechanism for generating new fsimage,edits and the pre-commit of relational database transactions are the same.
SNN node function
Its main work is to help the nn merge edits log, reduce the NN start time, on the other hand the merger will have a lot of IO operations, but the main role of NN is to receive the user's read and write services, so a large number of resources can not be used to do this. SNN It is not a backup of NN, but can do a part of the meta-data backup, not real-time backup (not hot-standby).
SNN merge process
When the consolidation time is met (Consolidation opportunity: Configure the Setup interval fs.checkpoint.period, default 3,600 seconds, or configure the edit log size, up to 64M), SNN copies the edits log files and fsimage metadata files of the NN into SNN, May be copied across the network, at the same time NN will create a new edits file to record the user's read-write request operation, then SNN will be merged into a new Fsimage file, and then SNN will send this file to the NN, The last nn will replace the old fsimage with the new fsimage, then repeat ...
Interview questions:
1, the role of SNN?
When Namenode starts, the Fsimage file is loaded first, then the edits file is applied, and the latest directory tree information is then added to the new Fsimage file, and the new edits file is enabled. The whole process is fine, but there is a small flaw, That is, if the namenode occurs after the start of too much change, it will cause the edit file to become very large, and the extent of the namenode with the update frequency. Then, during the next Namenode startup, after reading the Fsimage file, Will apply this incredibly large edits file, resulting in a longer start-up time, and impossible to control, may need to start a few hours or maybe. Namenode edits file too large problem, that is, secondenamenode to solve the major problems.
2. If the nn hangs (the hard drive is broken), can the data be recovered?
If the SNN and NN are not recovered on the same machine, the NN and SNN are best not on a single machine.
DN Node function
Store the data, report the block information to NN when the DN thread is started, and contact the heartbeat by sending it to the NN (3 seconds 1 times), if NN10 minute does not receive the DN heartbeat, consider it lost and copy the block on it to the other DN.
PS: Interview title: Block's copy placement strategy
The first copy: The DN that is placed on the upload file, or a node that is not too full and CPU-free if it is committed outside the cluster. The second copy: placed on a node with a different rack on the first replica. The third copy: the same rack node as the second copy (one rack power To ensure safety while increasing speed).
3. HDFs reading and writing process
Read is a concurrent read
Writes only one block, and then the DN generates a thread to replicate the block copy on the other DN, fast.
4. hdfs file Permissions and authentication
Permissions are similar to Linux, and if a Linux user Wangwei to create a file using the Hadoop command, the file in HDFs is the owner of Wangwei; HDFs does not do password authentication, such a benefit is fast, or each read and write to verify the password,HDFS storage data is generally not very high security data. HDFs theory ended.
iii. HDFs Installation and Deployment
1. Download Hadoop 1.2 and unzip: Tar zxvf hadooop-1.2.1.tar.gz.
2. Configure the Node1 node Conf/core-site.xml
<?XML version= "1.0"?><?xml-stylesheet type= "text/xsl" href= "configuration.xsl "?><!--Put Site-specific property overrides in this file. -<Configuration> <!--Configure the NN on which machine and its port, also can be said to be HDFs's entrance - < Property> <name>Fs.default.name</name> <value>hdfs://node1:9000</value> </ Property> <!--the HDFs working directory setting, by default is Linux/temp, each time the Linux reboot will be emptied, preferably set. - <!--Some other directories are based on this temporary directory, such as Dfs.name.dir and Dfs.name.edits.dir, etc. - < Property> <name>Hadoop.tmp.dir</name> <value>/opt/hadoop-1.2</value> </ Property></Configuration>
3. Configure the Node1 node Conf/hdfs-site.xml
<?XML version= "1.0"?><?xml-stylesheet type= "text/xsl" href= "configuration.xsl "?><!--Put Site-specific property overrides in this file. -<Configuration> <!--set Number of replicas, default number of 3,<=DN nodes - < Property> <name>Dfs.replication</name> <value>2</value> </ Property></Configuration>
4,node1 node on the conf/slaves configuration dn,conf/masters configuration snn (try and nn not on a node), NN has been configured in Core.site.xml.
<!--[[email protected] conf]# vim mastersnode2 [ [email protected] conf]# vim Slavesnode2node3
5. Password-Free login configuration
1> on Node1 ssh node2 do not need a password to login node2 is password-free login.
2> Why do I need a password-free login? HDFs can do a command on any machine to start HDFs, then it can start all the nodes of the Java process (each node is actually a Java process), that is, to start the entire cluster, in fact, telnet to other machines to start those nodes. start-all.sh command. It's really just for a convenience, otherwise you need to start the node one at a.
3> to Node1 on the start-all.sh start the cluster, you need to do node1 to node2,node3 password-free login. First add the/etc/hosts hostname and IP map to the node1,node2,node3 of each node
192.168. 144.11 Node1 192.168. 144.12 Node2 192.168. 144.13 Node3
Then the individual nodes generate the public and private keys, the public key into local authentication, the first to complete the local password-free login
Ssh-keygen ' -f ~/. ssh/cat ~/. ssh/id_dsa.pub >> ~/. ssh/authorized_keys
Finally copy the Node1 id_dsa.pub to Node2 and Node3, then append to the respective local authentication, copy and paste the contents of the problem directly.
SCP ~/. ssh/id_dsa.pub [email protected]:~SCP ~/. SSH/id_dsa.pub [email protected]:~cat id_dsa.pub >> ~/. SSH/cat id_dsa.pub >> ~/. ssh/authorized_keys
6. Set Java_home
1> first, this way, after shutting down the terminal, it fails.
linux-: ~ # declare-x java_home="/usr/local/jdk1.7.0_03/bin/"Linux- from: ~ # export path= $JAVA _home: $PATH
The second, permanent Way
vi /etc/profile, add the following content: Java_home=/usr/local/jdk1. 7. 0_03/Path= $PATH: $JAVA _home/binexport path Java_home
Save the profile file and execute source/etc/profile to let the settings take effect immediately.
2> However, the java_home configured here does not work and is configured in conf/hadoop-env.sh. Then copy all the configuration files under Node1/conf to Node2,node3.
SCP . /* [email protected]: ~/hadoop-1.2.1/conf/[[email protected] conf]# SCP . /* [email protected]: ~/ hadoop-1.2.1/conf/
7. Format nn under Bin directory and start
./hadoop namenode -format // just once ./start-dfs. SH // start HDFs
after the successful startup through the JPS command to view the Java process (need to install the JDK and configured in the/etc/profile java_home), the log hint of NN, all started successfully, but to Node2, Node3 view Java process through JPS but found no boot success, this is the reason for the firewall, in nn execution: service iptables stop off the firewall, stop off HDFs reboot. Then configure the Hosts file to pass http:// node1:50070 can access HDFS.
Hadoop (1) _hdfs Introduction and installation Deployment