Hbase daily operation and maintenance

Source: Internet
Author: User
Tags compact disk usage

1.1 Monitoring HBase Health
1.1.1 Operating System
1.1.1.1IO
A. Cluster network IO, disk Io,hdfs IO
The larger the IO, the more file read and write operations. When the IO suddenly increases, it is possible that the 1.compact queue is large, and the cluster is undergoing a lot of compression operations.
2. Executing a mapreduce job
The data for a single machine can be viewed from the CDH front desk by viewing the entire cluster data or entering the front desk of the designated machine:

B.io wait
Disk IO has a large impact on the cluster, and if the IO wait time is too long, check the system or disk for exceptions. The IO wait will also increase when IO is usually increased, and now the FMS machine normal IO wait is below 50ms
Host-related metrics you can click the Hosts tab in the upper left corner of the CDH foreground and select the host you want to view:

1.1.1.2CPU
If the CPU is high, it is possible that the exception causes cluster resource consumption, and other metrics and logs can be used to see what the cluster is doing.
1.1.1.3 memory
1.1.2 JAVA
GC case
Regionserver long time GC can affect cluster performance and may result in suspended animation
1.1.3 Important hbase Indicators
1.1.3.1region condition
Need to check
Number of 1.region (total and number of region on each regionserver)
Size of 2.region
If you find an exception you can adjust by manually merge region and manually assign region
The number of region can be seen from the CDH front desk and the master front desk and Regionserver's front desk, such as the Master front desk:

The StoreFile size can be seen in the region server foreground:

1.1.3.2 Cache Hit Ratio
The cache hit rate has a significant effect on the reading of HBase, which can be observed to adjust the size of the Blockcache.
You can see the block cache situation from the Regionserver Web page:

1.1.3.3 number of Read and write requests
The number of read and write requests can be seen by the pressure of each regionserver, if the pressure distribution is uneven, you should check the region on the Regionserver and other indicators
The master Web can see the number of read and write requests for Regionserver

Regionserver can see the number of read and write requests per region

1.1.3.4 Compression Queue
The compression queue holds the storefile,compact operation that is being compressed has a greater impact on hbase Read and write
The total compressed queue size of the cluster can be seen through the CDH HBase Chart library:

The compact log can be queried via the CDH HBase home page:

Click "Compress" to enter:

1.1.3.5 Refresh Queue
The Memstore of a single region (128M) or all region Memstore on Regionserver will flush when the total size of all region reaches the threshold, and the flush operation will generate a new storefile
The flush log can also be viewed through the CDH's hbase foreground:

1.1.3.6RPC Call Queue
RPC operations that are not processed in a timely manner are placed in the RPC operations queue, which can be seen from the RPC queue to the server processing requests
Percentage of 1.1.3.7 file blocks saved locally
Datanode and Regionserver are generally deployed on the same machine, so region server-managed region is prioritized for local storage to save network overhead. If the block locality is less likely to have just been balance or just restarted, after the compact region data will be written to the current machine Datanode,block locality will slowly reach nearly 100:

1.1.3.8 Memory Usage
Memory usage, the main can see used heap and memstore size, if USEDHEADP has been more than 80-85% is more dangerous
Memstore is too small or too big to be normal.
From the region server's front desk you can see:

1.1.3.9slowHLogAppendCount
Write Hlog too Slow (>1s) operation times, this indicator can be used as HDFs state of good or bad judgment
View at the region server Front desk:

1.1.4CDH Check Log
CDH has a powerful system event and log search function, each service (such as Hadoop,hbase) home page provides event and alarm queries, daily operations in addition to CDH home alarm, you need to review these events to identify potential problems:

Select the label in Event search (alert, critical) to enter the relevant event log, such as critical:

1.2 Checking data consistency and how to fix it
Data consistency is defined as:
1. Each region is correctly assigned to a regionserver, and the location information and status of region are correct.
2. Each table is complete, and each possible rowkey can correspond to a unique region.
1.2.1 Check
HBase Hbck
Note: Sometimes the cluster is starting or region is doing a split operation, resulting in inconsistent data
HBase hbck-details
Plus –details will list more detailed inspection information, including so the split task in progress
HBase hbck Table1 Table2
If you want to check only the table you specify, you can add a table name after the command, which saves you time
CDH
The results of HBCK can also be seen through the inspection reports provided by CDH, which is only required to see CDH HBCK reports on a daily basis:

Select "Recent HBCK results":

1.2.2 Repair
Partial repair of 1.2.2.1
In the event of inconsistent data, to minimize potential risks when repairing, use the following command to fix the region at a lower risk:
1.2.2.1.1hbase hbck-fixassignments
Fixes an issue where region is not assigned (unassigned), error allocation (incorrectly assigned), and multiple allocations (multiply assigned)
1.2.2.1.2hbase Hbck-fixmeta
Delete the meta table with records but no data records in HDFs
Add in HDFs there is data but the meta table has no record of region to meta tables
1.2.2.1.3hbase Hbck-repairholes
Equivalent to: HBase hbck-fixassignments-fixmeta-fixhdfsholes
The role of-fixhdfsholes:
If the rowkey appears empty, that is, the adjacent two region of the Rowkey is not contiguous, then using this parameter will create a new region in HDFs. The-fixmeta and-fixassignments parameters are used to mount the region after the new region is created, so it is generally used with the first two parameters
1.2.2.2Region Overlay Repair
It is dangerous to do the following because these actions modify the file system and require careful action!
Use Hbck–details to view detailed issues before you do the following, and if you need to fix them first, stop the app, and if you execute the following command, data operations can cause an unexpected exception.
1.2.2.2.1hbase Hbck-fixhdfsorphans
Add the region directory of the file system that does not have a metadata file (. regioninfo) to HBase, that is, create the. RegionInfo directory and assign region to Regionser
1.2.2.2.2hbase Hbck-fixhdfsoverlaps
There are two ways to merge the Rowkey with overlapping region:
1.merge: Merge the overlapping region into a large region
2.sideline: Remove the part that overlaps the region and write the overlapping data to a temporary file before importing it.
If the overlapping data is large, merging directly into a large region will result in a large number of split and compact operations, which can be controlled over large by the following parameters:
-maxmerge the maximum number of overlapping region merges
-sidelinebigoverlaps if there is an overlap of region larger than maxmerge, then the overlap with other region is handled by sideline method.
-maxoverlapstosideline if the overlapping region is handled in sideline way, up to sideline N region.
1.2.2.2.3hbase Hbck-repair
Abbreviations for the following commands:
HBase hbck-fixassignments-fixmeta-fixhdfsholes-fixhdfsorphans-fixhdfsoverlaps-fixversionfile– Sidelinebigoverlaps
You can specify the table name:
HBase Hbck-repair Table1 Table2
1.2.2.2.4hbase hbck-fixmetaonly–fixassignments
If only the region of the meta table is inconsistent, you can use this command to fix
1.2.2.2.5hbase Hbck–fixversionfile
HBase data file starts with a version file, if the file is missing, you can use this command to create a new one, but make sure that the HBCK version is the same as the HBase cluster version
1.2.2.2.6hbase Org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
If the root table and the meta table are all wrong and hbase cannot start, you can use this command to create a new root and meta table.
The premise of this command is that HBase is closed, and it loads the information about HBase (. RegionInfo) from the home directory of HBase when it executes, and creates a new root and meta directory and data if the table information is complete
1.2.2.2.7hbase hbck–fixsplitparents
When region does a split operation, the parent region is automatically erased. But sometimes the sub region does split before the parent region is cleared. The parent region that caused some delay offline exists in the meta table and HDFS, but without deployment, hbase cannot purge them. In this case, you can use this command to reset the region in the meta table to online status and no split. You can then use the previous repair command to fix the region
1.3 Manual Merge Region
Close the balancer before you do the operation, and then open the Balancer
After a period of time, it is possible to generate some small region, to periodically check these region and merge them with neighboring region to reduce the total number of region of the system, reduce management overhead
Merge method:
1. Locate the encoded name for the region you want to merge
2. Enter HBase Shell
3. Execute merge_region ' region1 ', ' Region2 '
1.4 Assigning region manually
If you find that the Regionserver resource is particularly high, you can check if the region on this regionserver has too much large region, and the hbase shell allocates some of the larger region to other not very busy regions Server
Move ' RegionID ', ' serverName '
Cases:
Move ' 54fca23d09a595bd3496cd0c9d6cae85 ', ' vmcnod05,60020,1390211132297 '
1.5 Manual Major_compact
Close the balancer before you do the operation, and then open the Balancer
Select a system to compare idle time manually major_compact, if hbase update is not too frequent, can be one weeks to all the table to do a major_compact, this can be done once major_compact, Watch all the storefile number, if the number of storefile increased to major_compact after the storefile of nearly twice times, you can do all the table major_compact, time is longer, operation to avoid Takahoko period
NOTE: FMS now has automatic major_compact on production and does not need to do manual major compact
1.6balance_switch
Balance_switch true to open balancer
Balance_switch flase off Balancer
Configuring Master to balance the region number of each regionserver, when we need to maintain or restart a regionserver, turns off balancer, which makes the region unevenly distributed on the Regionserver, This time requires manual opening of the balance.

1.7regionserver restart
Graceful_stop.sh–restart–reload–debug nodename
Close the balancer before you do the operation, and then open the Balancer
This operation is a smooth restart of the Regionserver process, the service will not have an impact, he will need to restart the regionserver above all the region to the other server, and then restart, and finally the previous region will be migrated back, but when we modify a configuration , you can restart each machine in this way, for HBase regionserver Restart, do not directly kill the process, this will cause in zookeeper.session.timeout this time long interruption, also do not through bin/ hbase-daemon.sh Stop Regionserver to restart, if luck is not good,-root-or. META. Table above, all requests will fail.
1.8regionserver off the Downline
bin/graceful_stop.sh nodename
Close the balancer before you do the operation, and then open the Balancer
As above, the system migrates all region before shutting down, and then the stop process.
1.9flush table
All Memstore flushed to HDFs, usually if the regionserver memory is found to be used too large, causing the machine to regionserver many threads block, you can perform a flush operation, This operation will cause the storefile number of hbase, should try to avoid this operation, there is a situation, when the hbase migration, if you choose to copy the file mode, you can first stop writing, and then flush all tables, copy files
1.10Hbase Migration
1.10.1copytable mode
Bin/hbase Org.apache.hadoop.hbase.mapreduce.copytable–peer.adr=zookeeper1,zookeeper2,zookeeper3:/hbase ' TestTable '
This operation needs to add Conf/mapred-site.xml in the HBase directory, which can replicate Hadoop.
1.10.2export/import
Bin/hbase org.apache.hadoop.hbase.mapreduce.Export testtable/user/testtable [versions] [starttime] [StopTime]
Bin/hbase Org.apache.hadoop.hbase.mapreduce.Import testtable/user/testtable
1.10.3 Direct copy of HDFs corresponding files
First copy the HDFs file, such as Bin/hadoop distcp hdfs://srcnamenode:9000/hbase/testtable/hdfs://distnamenode:9000/hbase/testtable/
Then execute bin/hbase org.jruby.Main on the destination hbase bin/add_table.rb/hbase/testtable
After generating meta information, restart HBase
2Hadoop daily operation and maintenance
2.1 Monitor Hadoop Health
1.nameNode, Resoursemanager memory (NameNode to have enough memory)
2.DataNode and NodeManager operating status
3. Disk usage
4. Server load Status
2.2 Checking HDFs file health status
Command: Hadoop fsck
2.3 Opening the Trash bin (trash) function
Trash function It is turned off by default, when it is turned on, the data that you delete will be MV to the operating user directory. Trash folder, which can be configured for more than a long time, the system automatically deletes outdated data. This way, when the operation is wrong, you can return the data MV
3 hbase parameter Adjustment in this project scenario

Hbase daily operation and maintenance

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.