Things about Hadoop (a) A preliminary study on –hadoop

Source: Internet
Author: User
Tags shuffle shuffle shuffle ssh server sqoop

Objective

What is Hadoop?
In the Encyclopedia: "Hadoop is a distributed system infrastructure developed by the Apache Foundation." Users can develop distributed programs without knowing the underlying details of the distribution. Take advantage of the power of the cluster to perform high-speed operations and storage. ”
There may be some abstraction, and this problem can be re-viewed after learning the various knowledge of Hadoop in one step.

Hadoop Big Family

Hadoop is not a single project, and after 10 years of development, Hadoop has become a huge family with nearly 20 products.
The most core of these include the following 9 products, and we will follow the following sequence of steps to learn.

Hadoop: A distributed computing open source framework for the Apache Open source organization that provides a distributed File system subproject (HDFS) and a software architecture that supports MapReduce distributed computing
Hive: A Hadoop-based data warehousing tool
Pig: A large-scale data analysis tool based on Hadoop
Zookeeper: is a distributed, open source Coordination service designed for distribution applications, which is mainly used to solve some data management problems frequently encountered in distributed applications, simplify the coordination and management of distributed applications, and provide high-performance distributed services.
hbase: A highly reliable, high-performance, column-oriented, scalable distributed storage system that leverages HBase technology to build large, structured storage clusters on inexpensive PC servers
Mahout: A distributed framework for machine learning and data mining based on Hadoop
Sqoop: A tool used to transfer data from Hadoop and relational databases to and from a relational database (MySQL, Oracle, Postgres, etc.) into the HDFs of Hadoop, HDFs data can also be directed into a relational database.
Cassandra:: Is a set of open source distributed NoSQL database system
Flume: is a distributed, reliable, high-availability mass log aggregation system that can be used for log data collection, log processing, and log data transfer.

OK, then start learning Hadoop formally.

1 Environment Construction

There are three ways to install Hadoop:

stand-alone mode : Easy to install, almost no configuration, but limited to debugging purposes;
Pseudo-Distribution mode : At the same time, the Namenode, DataNode, Jobtracker, Tasktracker, secondary namenode and other 5 processes are started on a single node, simulating the various nodes of distributed operation;
fully distributed mode : a normal Hadoop cluster consisting of multiple nodes that perform their duties

Next, we build a pseudo-distributed environment.

operating system : Ubuntu 16.04 LTS
JDK: JDK1.8
Hadoop: 2.6.0

S1 Creating a Hadoop user

First switch to root user

su root

Create a user, set a password, and assign administrator permissions to it

-ssudo

Log off the current user after successful creation and log back in with the Hadoop user.

S2 Update apt

Following the installation of the software using APT, we will update apt here and execute the following commands:

sudo apt-get update
S3 Installing VIM

To install vim for modifying a file

sudo apt-get install vim
S4 Installing SSH

Install SSH for remote login control.
Ubuntu has SSH client installed by default, we need to install SSH server, using the following command:

sudo apt-get install openssh-server

Log in to the machine using the following command

ssh localhost
S5 Installing the JDK

First download the jdk1.8 installation package on the official website, I downloaded:
jdk-8u111-linux-x64.tar.gz
Create a new JVM folder under the Usr/lib directory and authorize:

cd /user/libsudo mkdir jvmsudo chown hadoop ./jvmsudo777 ./jvm

Copy the downloaded installation package to the JVM directory.
Under the JVM directory, perform the following command to decompress and rename

sudo tar zxvf jdk-8u111-linux-x64.tar.gzsudo mv jdk1.8.0_111 java

Continue to execute the following command under the JVM directory into the Vim editor, press "I" to enter edit mode,

vim ~/.bashrc

Move the cursor to the front, enter the following (must be written to the beginning, written at the end of the case can not find Java_home):

export JAVA_HOME=/usr/lib/jvm/javaexport JRE_HOME=${JAVA_HOME}/jreexport CLASSPATH=.:${JAVA_HOME}/lib;${JRE_HOME}/libexport PATH=${JAVA_HOME}/bin:$PATH

Press the ESC key first, then press Shift+zz to save the exit and return to the command line mode enter the following command to make the changes effective

source ~/.bashrc

The configuration is done, and you can use the following command to verify that the installation was successful.

-version$JAVA_HOME-version
S6 installation Hadoop2

Download hadoop-2.6.0.tar.gz and Hadoop-2.6.0.tar.gz.mds on the http://mirror.bit.edu.cn/apache/hadoop/common/page

Create a new Hadoop file and authorize

cd /usr/localsudo mkdir hadoopsudo chown hadoop ./hadoopsudo777 ./hadoop

Copy the downloaded files to the directory.
Execute the following command (I'm not sure what to do with it)

cat /usr/local/hadoop/hadoop-2.6.0.tar.gz.mds|grep‘MD5‘md5sum /usr/local/hadoop/hadoop-2.6.0.tar.gz|tr"a-z""A-Z"

Switch to the/usr/local/hadoop directory, unzip and rename:

cd /usr/local/hadoopsudo tar -zxf hadoop-2.6.0.tar.gzsudo mv hadoop-2.6.0 hadoop2

After Hadoop is unpacked, use the following command to verify:

cd /usr/localversion
S7 Pseudo-distributed configuration

Hadoop is run in a configuration file.
Files are located in/usr/local/hadoop/hadoop2/etc/hadoop/and need to be modified for CORE-SITE.XMLH and hdfs-site.xml files.
Authorize the user first:

cd /usr/local/hadoopsudo chown hadoop ./hadoop2sudo777 ./hadoop2

Execute the following command:

cd /usr/local./etc/hadoop/core-site.xml

Modify to configure and save as follows:

<configuration>        <property >             <name>Dfs.replication</name>             <value>1</value>        </Property >        <property >             <name>Dfs.namenode.name.dir</name>             <value>File:/usr/local/hadoop/hadoop2/tmp/dfs/name</value>        </Property >        <property >             <name>Dfs.datanode.data.dir</name>             <value>File:/usr/local/hadoop/tmp/dfs/data</value>        </Property ></configuration>

Execute the following command:

./etc/hadoop/hdfs-site.xml

Modify to configure and save as follows:

<configuration>        <property >             <name>Dfs.replication</name>             <value>1</value>        </Property >        <property >             <name>Dfs.namenode.name.dir</name>             <value>File:/usr/local/hadoop/tmp/dfs/name</value>        </Property >        <property >             <name>Dfs.datanode.data.dir</name>             <value>File:/usr/local/hadoop/tmp/dfs/data</value>        </Property ></configuration>

If you need to change to non-distributed, then delete the modified content.
Execute the following command to format the namenode (executed under the HADOOP2 directory)

./bin/hdfs namenode -format

Seeing successfully formatted is a success.
Execute the following command to open the daemon for NameNode and DataNode .

./sbin/start-dfs.sh

Follow the prompts and use the following command to verify when the boot is complete:

jps

"NameNode", "DataNode", and "Secondarynamenode" appear to indicate a successful start. Secondarynamenode does not start, please restart, the other two no boot words before checking the configuration.
After successful startup, enter the browser address bar: localhost:50070 can view information for Namenode and Datanode.

2 Hadoop Exploration

Set up the environment, the heart is steadfast, and then first understand some necessary theoretical knowledge.

2.1 Why use Hadoop?

The creation of any thing has its inevitability.
Since 2012, the word Big Data has been more and more mentioned, and now we have entered the era of big data. In this era of information explosion, the amount of data generated every day is very large. Big data is more than just about data, big Data has four features:
large data volume, wide variety, low value density, fast aging high.
Based on these characteristics, we need a thing that has the following features:
1. Can store large amounts of data
2. Can quickly process large amounts of data
3. Can be analyzed from a large amount of data
This creates a model of Hadoop:

Looks like a system, HDFs and MapReduce are the bottom layer, and hive,pig,mathout,zookeeper,flume,sqoop,hbase is based on some software from the underlying system.
So the core of the Hadoop core is HDFs and MapReduce.

So:
Hadoop is a computing framework for working with big data, and Hadoop has distributed, reliable, and scalable features.
Distributed: Hadoop improves efficiency by assigning files and tasks to a multitude of computer nodes.
Reliable: Because it is distributed, one or several of the nodes fail and do not affect the entire program.
Scalability: the addition and deletion of any one node does not affect the operation of the program.

2.2 Single-node system

In practice we use a cluster of n computers (or multiple virtual machines can be used to simulate it) for Hadoop, and if there is only one computer, it loses the meaning of Hadoop.
Each computer, we call it a node, we need to configure Hadoop for each node, through the configuration process above, we can know that a single node (that is, each computer) Adoop architecture is like the following diagram;

2.2 HDFS

The HDFs full name is the Hadoop distribute file system, the Hadoopp distributed filesystem.
Now the amount of data has reached petabytes (1PB=1024T), for a single hard disk, the storage of this order of magnitude of data has considerable pressure. So it's natural for us to think of cutting the data and storing it in multiple computers, resulting in a distributed file system.
In order to achieve the reliability and security of the data, HDFs will create a copy of each piece of data, the default value is 3, and by default a copy of the data is stored on three different nodes, resulting in HDFs has the following characteristics:
1**. High Data Redundancy * *: There is no way to improve reliability, redundancy is essential.
2. suitable for one-time write, multiple reads : Because the data must be written to create a copy, so the cost of writing is very high, if it involves high-frequency IO write, not suitable for using Hadoop
3. suitable for storing large files, not suitable for small files : The concept of block in HDFs, block is the storage unit of HDFS, the default is 64M, many enterprises have been adjusted to 128M. When the file is stored, the large files are separated and stored in block blocks, and a block chunk stores only one file. So if you store too many small files, it will cause a lot of wasted storage space.
4. Low Hardware requirements : It can be used on inexpensive commercial servers without the need for high performance but expensive server.

2.2.1 Namenode,datanode and secondary NameNode

HDFs cluster is run in Master-slave mode, in fact, it can be said as nginx as Master-worker, a meaning. Mainly involves two kinds of nodes, one is Namenode, that is master, the other is a large number of Datanode namely worker.
is the Apache official website about Namenode and Datanode diagram:

NameNodeThere is only one, responsible for maintaining the metadata for the entire file system, where the metadata contains the memory mappings for each file, the location of the file, and all the data blocks within the Datanode.
DataNodeexists in each node and is responsible for specific work, including servicing read-write requests and executing block creation, deletion, and replication in accordance with Namenode instructions.
clients (client)Interact with Namenode and Datanode on behalf of the user to access the entire file system. The client provides some of the file system interfaces for the columns, so we do not need to know Datanode and Namenode when we are programming, so we can do the functions we want.
Secondary NameNodeTo understand secondary NameNode, you first need to understand the fault-tolerant mechanism of NameNode.
As can be seen, a Hadoop cluster as long as a namenode, once the namenode fails, the entire cluster will be paralyzed, so namenode must have a good fault-tolerant mechanism.
The first is a remote backup: That is, when Namenode writes data to disk, it synchronizes the creation of a copy of the data on a remote server.
The second way is to use auxiliary namenode that is secondary namenode: The first thing we need to know is that Namenode saves the data in the namespace image and the operation log, secondary The main role of Namenode is to consolidate the two documents on a regular basis.
However, due to regular execution, Secondaryname can not synchronize the data of the main namenode in real-time, so once the namenode hangs, the inevitable result is data loss, so the safe way is to combine two methods, when the main namenode down, Copy the data from the secondary namenode to the secondary namenode, and then let Secondarynamenode act as Namenode.

2.3 MapReduce

MapReduce is a distributed computing framework for Hadoop, and now we only have a preliminary understanding.
Mapreduce=map+reduce, which is the process of mapreduce divided into Map and Reduce. These two just need to be controlled by our program.
Let's look at the following picture:

This picture was found on the Internet.
As shown is a complete process of mapreduce processing.
Input: The leftmost is the process of entering and entering the data of the diagram.
Split shard : MapReduce calculates the shards based on the input file, and each shard corresponds to a map task. And the process of sharding is closely related to HDFs, such as a block size of HDFs 64M, we entered the three file ratio is 10m,65m,128m, so that the first file generates a 10M shard, the second file generates a 64M shard and a 1M shard, The third file generates two shards of 64M, so there will be a total of 5 shards.
Map: The map stage is controlled by the programmer through code, the figure shows the approximate content is to separate the string, as a key stored in the MAP, the location of the value is stored 1, indicating the number.
Shuffle Shuffle : Shuffle stage, due to the previous generation map There are many keys the same map, in the shuffle phase will be the same key to merge.
reduce: The reduce phase is also where developers are controlled by code, in this case, the value of the same map with the same key is summed, resulting in the final map
The last data output is the number of occurrences of each string.

Summarize

Writing here, this article is basically over, this article mainly contains two content:
1. Build a pseudo-distributed Hadoop environment in Ubuntu environment.
2. Some of the simple and necessary theoretical knowledge about Hadoop, mainly HDFs and MapReduce.
"If there are inappropriate places, I hope you will not hesitate to point out."

Things about Hadoop (a) A preliminary study on –hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.