Preface
What is
Hadoop?
In the words of the encyclopedia: "Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of distributed. Make full use of the power of clusters for high-speed computing and storage. ."
It may be a little abstract, this problem can be revisited after learning various knowledge of
Hadoop step by step.
Hadoop family
Hadoop is not a single project. After 10 years of development, Hadoop has become a huge family with nearly 20 products.
The core of them includes the following 9 products, and we will learn step by step in the following order.
Hadoop: is a distributed computing open source framework of the Apache open source organization, providing a distributed file system sub-project (HDFS) and a software architecture that supports MapReduce distributed computing
Hive: A data warehouse tool based on Hadoop
Pig: a large-scale data analysis tool based on Hadoop
Zookeeper: is a distributed, open source coordination service designed for distributed applications. It is mainly used to solve some data management problems often encountered in distributed applications, simplifying the difficulty of distributed application coordination and management, and providing High-performance distributed services
HBase: It is a highly reliable, high-performance, column-oriented, and scalable distributed storage system. Using HBase technology, a large-scale structured storage cluster can be built on a cheap PC Server.
Mahout: A distributed framework for machine learning and data mining based on Hadoop
Sqoop: It is a tool used to transfer data between Hadoop and relational databases. It can import data from a relational database (MySQL, Oracle, Postgres, etc.) into HDFS of Hadoop, or transfer data from HDFS. The data is imported into a relational database.
Cassandra:: is an open source distributed NoSQL database system
Flume: is a distributed, reliable, and highly available system for massive log aggregation, which can be used for log data collection, log data processing, and log data transmission.
Ok, let's start learning Hadoop officially.
1 Environment setup
There are three ways to install Hadoop:
Stand-alone mode: simple installation, almost not used for any configuration, but only for debugging purposes;
Pseudo-distribution mode: 5 processes including NameNode, DataNode, JobTracker, TaskTracker, and Secondary Namenode are simultaneously started on a single node to simulate distributed operation of each node;
Fully distributed mode: a normal Hadoop cluster consists of multiple nodes performing their duties
Next, we build a pseudo-distributed environment.
Operating system: Ubuntu 16.04 LTS
JDK: JDK1.8
Hadoop: 2.6.0
S1 Create hadoop user
First switch to root user
su root
Create users, set passwords, and assign administrator rights to them
useradd -m hadoop -s /bin/bash
passwd hadoop
adduser hadoop sudo
After the creation is successful, log out the current user and log in again as the hadoop user.
S2 update apt
Later, you need to use apt to install the software. Here, update apt first and execute the following command:
sudo apt-get update
S3 install vim
Install vim for modifying files
sudo apt-get install vim
S4 install SSH
Install SSH for remote login control.
ubuntu installs SSH client by default, we need to install SSH server, use the following command:
sudo apt-get install openssh-server
Use the following command to log in to this machine
ssh localhost
S5 install JDK
First download the jdk1.8 installation package on the official website, what I downloaded is:
jdk-8u111-linux-x64.tar.gz
Create a new jvm folder in the usr/lib directory and authorize:
cd /user/lib
sudo mkdir jvm
sudo chown hadoop ./jvm
sudo chmod -R 777 ./jvm
Copy the downloaded installation package to the jvm directory.
In the jvm directory, execute the following command to decompress and rename
sudo tar zxvf jdk-8u111-linux-x64.tar.gz
sudo mv jdk1.8.0_111 java
Continue to execute the following command in the jvm directory to enter the vim editor, press "I" to enter the editing mode,
vim ~/.bashrc
Move the cursor to the front, enter the following content (you must write to the beginning, write at the end, and then there will be a situation where java_home cannot be found):
export JAVA_HOME=/usr/lib/jvm/java
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib;${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
First press the ESC key, then press shift+zz to save and exit, return to the command line mode and enter the following command to make the modification effective
source ~/.bashrc
1
This configuration is complete, you can use the following command to verify whether the installation is successful.
echo ¥JAVA_HOME
java -version
$JAVA_HOME/bin/java -version
S6 install Hadoop2
Download hadoop-2.6.0.tar.gz and hadoop-2.6.0.tar.gz.mds at http://mirror.bit.edu.cn/apache/hadoop/common/ page
Create a new hadoop file and authorize
cd /usr/local
sudo mkdir hadoop
sudo chown hadoop ./hadoop
sudo chmod -R 777 ./hadoop
Copy the downloaded file to this directory
Execute the following command (I don't know why it is used)
cat /usr/local/hadoop/hadoop-2.6.0.tar.gz.mds|grep'MD5'
md5sum /usr/local/hadoop/hadoop-2.6.0.tar.gz|tr "a-z" "A-Z"
Switch to the /usr/local/hadoop directory, unzip and rename:
cd /usr/local/hadoop
sudo tar -zxf hadoop-2.6.0.tar.gz
sudo mv hadoop-2.6.0 hadoop2
Hadoop can be used after decompression, use the following command to verify:
cd /usr/local/hadoop/hadoop2
./bin/hadoop version