Hadoop2.7.5 + HBase1.4.0 fully distributed cluster Construction

Source: Internet
Author: User

Hadoop2.7.5 + HBase1.4.0 fully distributed cluster Construction

1. recommended books for beginners before introducing completely distributed systems:
For more information about Hbase principles, see PDF.

2. You should have a simple understanding of their concepts before installing completely distributed systems:
1. Hadoop is good at storing arbitrary, semi-structured, and even structured data. It is almost a supplement to all databases.
2. Hbase is a hadoop database, and hbase is not a columnar storage database. It uses the column storage format on the disk.
3. The column-store database is an aggregate database in the unit of "column" and then stored in disks sequentially.
4. Differences between RDBMS and column-store databases:
Rdbms is suitable for data with limited and regular rules (paradigm, COD 12 Law), and is suitable for real-time data access scenarios.
Hbase is suitable for key-value-to-location data access or orderly data access.
5. the most basic unit of hbase database is "column"
6. One or more columns form a row and are stored by a unique "Row key" (the row key is equivalent to the same primary key index in rdbms)
7. A table can have several rows, and each column may have multiple versions. Each cell stores different values.
8. Several columns constitute a "column family"
9. All columns in a column family are stored in the same underlying storage file. This storage file is called Hfile.
10. The "column family" must be defined during table creation, and the number of columns should not be too large or the modification should not be too frequent. Generally, only 1 ~ is recommended ~ 2, preferably only one
11. opposite to the column family, there is no limit on the number of columns, and there is no limit on the column value type and length.
12. The values of each column and cell have timestamps. By default, they are specified by the system or by the user.
13. Data in a cell is sorted in descending order by default, and the latest data is read only during access.
14. hbase data storage mode:
Test <rowkey, list <tests <column, list <value, timestamp >>>
Test indicates a table. rowkey is the row key and contains a column family list (the first list). The column family contains the tests storage column and corresponding values, these values are stored in the last list type and should have a timestamp
15. Hbase is a distributed, persistent, and highly consistent storage system.

Iii. Build an hbase distributed environment:
1. Not all Hbase running modes need to be distributed. If you only want to test them locally, you only need to install java.
2. fully distributed architecture:
(1) Environment
System: CentOS6.8
Hadoop version: 2.7.5
Zookeeper version: 3.4.11
Hbase version: 1.4.0
(2) download the hbase installation package:
: Http://apache.spinellicreations.com/hbase/
Hbase must correspond to the Hadoop version.

 

(3), the next good hbase installation package upload to the server and then extract to the appropriate directory, we use Hbase version is: hbase-1.4.0-bin.tar.gz
Hbase-1.2.1-bin.tar.gz-C/root/apps/tar-zxvf/
(4) decompress the package and you can see which files are contained in the project directory:
You can view the ll command or ls-lr:

(52.162.16a.description in the license documents and notice.txt.
B .other generated information is in readme.txt.
C.CHANGES.txt is the static snapshot page of the change log. It contains many changes in the current downloaded version.
D. Bin is a binary file. This directory contains all scripts provided by hbase. It can be started and stopped, run an independent daemon process, or start additional master nodes.
The e. Conf directory contains the file that defines hbase configuration.
The f. Docs directory contains a copy of the hbase project webpage, as well as the documentation information of tools, APIs, and projects.
The g. hbase-webapps directory contains java-implemented web interfaces.
The h. Lib directory contains many java application class libraries. These class libraries contain many actual execution programs.
The I. Logs directory does not exist when it is started for the first time, but Hbase automatically creates a log folder through the log framework. Hbase processes are usually run in the form of daemon, that is, running in the background of the operating system. During the lifecycle, hbase prints some status, processes, exceptions, and other information to the log file.
The j. Src file is a source file and its publishing information.
(6) Hbase operation modes are divided into two types: Standalone mode and distributed mode.
(7) The distribution mode is divided into the pseudo distribution mode and the full distribution mode. Here we will introduce the full distribution mode.
Iv. Fully Distributed hbase construction:
1. Assume that all of our Linux environments are ready and four Hadoop + hdfs servers are built, named Sparknode1, Sparknode2, Sparknode3, and sparknode4. Three zookeeper clusters named zookeeper1, zookeeper2, and zookeeper3 are created. Here I didn't use the zookeeper cluster that comes with hbase, but set up another zookeeper cluster.
2. Before installing the hbase cluster, ensure that the HDFS cluster and zookeeper cluster are started, because hbase depends on hdfs and zookeeper, and both are indispensable.
3. Run the following command to start zookeeper:
Bin/zkServer. sh start
HDFS startup command:
Sbin/start-dfs.sh
4. decompress the hbase installation package to the specified folder, you need to modify the configuration file hbase-env.sh, modify as follows:
// Modify JAVA_HOME
Export JAVA_HOME =/usr/java/jdk1.8.0 _ 101/
// Hbase uses a self-built zookeeper cluster and does not use its own zookeeper cluster. Therefore, you need to set the attribute value of HBASE_MANAGES_ZK to false. The default value is true.
Export HBASE_MANAGES_ZK = false
// Modify the Hbase heap settings. The default size is 1 GB and it is set to 4 GB.
Export HBASE_HEAPSIZE = 4G
5. modify the configuration file hbase-site.xml as follows:
<Configuration>
<Property>
<Name> hbase. rootdir </name>/* shared directory of the region server, used to persistently store hbase data, HDFS instance */
<Value> hdfs: // master: 9000/hbase </value>/* access location of the hdfs master node (namenode */
</Property>
<Property>
<Name> hbase. cluster. distributed </name>/* hbase cluster running mode */
<Value> true </value>/* false indicates the standalone mode, and true indicates the distributed mode */
</Property>
<Property>
<Name> hbase. zookeeper. quorum </name>/* List of servers in the zookeeper quorum server */
<Value> zookeeper1: 2181, zookeeper2: 2181, zookeeper3: 2181 </value>/* use commas to separate each server. 2181 is the default zookeeper port number, you can add it based on your port number. It doesn't matter if the default port number is added or not */
</Property>
<Property>
<Name> hbase.master.info. port </name>/* web UI service port of Hbase master */
<Value> 60010 </value>/* if you do not want to start the UI instance, set the parameter to-1. The default value is 60010 */
</Property>
</Configuration>
6. modify the configuration file regionservers as follows:
Configure the region server. This file lists all hosts running the hregionserver daemon process. Each host occupies one line independently. The Hbase cluster starts and closes Based on the hosts listed in the file.
Sparknode1
Sparknode2
Sparknode3
Sparknode4
7. Add HBASE to the environment variable:
Export HBASE_HOME =/usr/soft/hbase-1.4.0
8. Copy the Hadoop configuration file to hbase:
Copy the hadoop-2.7.5 and hdfs-site.xml under the/opt/core-site.xml/etc/hadoop/directory of hadoop to the/usr/soft/hbase-1.4.0/conf/directory of HBase:
Here I didn't use the copy method, but used a soft link (shortcut ), the purpose is to prevent the configuration file under the Hadoop directory from being changed and then go to the/usr/soft/hbase-1.4.0/conf/directory to update:
Ln-s/opt/hadoop-2.7.5/etc/hadoop // usr/soft/hbase-1.4.0/conf/
Usage: ln-s source file (folder) target file (folder)
9. Copy the configured hbase directory to another node.
Run the following commands on Sparknode1 (master node) to copy the hbase-1.4.0 directories on Sparknode2 to the same directories of Sparknode2, Sparknode3, and Sparknode4:
Hbase-1.4.0/Sparknode2: $ PWD
Scp-r hbase-1.4.0/Sparknode3: $ PWD
Scp-r hbase-1.4.0/Sparknode4: $ PWD
10. Synchronization time
When using HDFS, we often encounter some inexplicable problems, which may be caused by the time synchronization of multiple servers. Because it often needs to analyze some timestamps, versions, or time-out times, if the time difference between multiple servers is too long, it may lead to some false positives.
There are two ways to synchronize the time of multiple servers:
(1) Enable the time synchronization process on each server and synchronize it through the network time server;
(2) If your computer cannot be connected to the Internet, you can manually change the time of multiple servers to the same one;
The second method is used to set the synchronization time using the "date-s" command:

11. Start the hbase Cluster

Before starting the Hbase cluster, make sure that both the hdfs cluster and the zookeeper cluster have been started successfully.
Bin/start-hbase.sh
The status is as follows:

12. Run the jps command to view the running process
Jps

The red box contains two hbase processes on the master node, and only the hregionserver process on the other datanode.

13. Stop the hbase Cluster
Bin/stop-hbase.sh
The status is as follows:

 

Hadoop project-Cloudera 5.10.1 (CDH) installation and deployment based on CentOS7

Detailed explanation of Hadoop2.7.2 cluster construction (high availability)

Use Ambari to deploy a Hadoop cluster (build an intranet HDP source)

Hadoop cluster installation in Ubuntu 14.04

Install Hadoop 2.7.2 In CentOS 6.7

Build a distributed Hadoop-2.7.3 cluster on Ubuntu 16.04

Install and test Hadoop2.8 distributed cluster in CentOS 7.3

Build a Hadoop 2.6.4 distributed Cluster Environment in CentOS 7

Hadoop2.7.3 + Spark2.1.0 fully distributed cluster Construction Process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.