Scenarios that cause HBase to hang
Hmaster
Hmaster the scenario where an exception (execute abort ()) Stop occurs is as follows:
A 1.ZK exception causes the master stop service to be the most common scenario, involving operations that include but are not limited to the following:
A) ZK link timeout, timeout time through zookeeper.session.timeout configuration, default is 3 minutes, If the Fail.fast.expired.active.master is configured with a value of false (the default is False), the abort is not immediately, but will attempt to restore ZK's expired session;
b) After opening the region, the opened node needs to be removed from ZK, if ZK has the node, but the deletion fails;
c) In the split region process, when the split node is deleted from ZK;
D) When the master node changes;
e) When creating a unassigned node from ZK;
f) in the regoin of the downline disabled, remove the region of disabled from ZK if a ZK anomaly occurs;
g) There are many nodes that manipulate ZK if an exception occurs.
2. In assign, if region is set to offlined state, but the state before region is not closed or offlined;
3. If you are unable to assign from the. META. Read region information in table;
4. When adding a new HBase cluster to a running HBase cluster, if the ZK/hbase/unassigned node has no data;
5. When using the thread pool to bulk allocate region, if an uncaught exception occurs, the implementation is as follows:
6. An exception occurred while starting the service thread of master;
7. When checking the HBase log path in HDFs, when the dead server is discovered, log is read from HDFs, and if an IO exception is required to check the HDFs file system, if the Fsok status is true, an IO exception occurs when checking through the Fsutils tool class ;
8. When verifying and allocating the region of-root-, if ZK is abnormal, or other exception (other exception will retry 10 times), for example: "-root-is onlined on the dead server".
Hregionserver
Hregionserver the scenario for an abnormal stop (execute abort ()) service is as follows:
1. If an IOException exception occurs while reading and writing to HDFs, the file system check (Checkfilesystem) of HDFS is initiated at this time 1.
An uncaught exception occurred on the 2.Regionserver service thread;
3. An exception occurred when starting Hregionserver;
4. In the Hlog rollback, an exception occurred;
5. When flush Memstore, if the persistence fails, the RS is restarted and the contents of Hlog are reloaded to Memstore in the reboot;
6. ZK anomalies, including but not limited to the following scenarios:
A) ZK link timeout, timeout time through zookeeper.session.timeout configuration, default is 3 minutes, unlike master, if ZK operation will not retry;
b) Keeperexception abnormality occurs when starting hregionserver;
c) in the split operation, if an exception occurs rollback operation, in the rollback process need to remove the spliting state of region from ZK, if the deletion occurs keeperexception or rollback of other operations occur exception;
D) A keeperexception anomaly occurred while opening the region;
e) In the case of hbase cluster replication, many operations that interact with ZK will cause abort when the keeperexception exception occurs;
7. In the close region, if an exception occurs, for example: The flush memstore cannot succeed;
8.Flush Memstore, if Hlog discovers that the region is already Flush, the JVM is forced to terminate, using Runtime.getruntime (). The halt (1) method, which does not perform a graceful exit of the closing hook, This will not flush all the region of RS and will not migrate region, only the master will find that the RS is not available after the session timeout for ZK to do the migration.
Summary
There are many possibilities for hbase to hang up, mainly caused by the problem of ZK or HDFS, so the availability of ZK, HDFs is extremely important for hbase, about ZK:
1.ZK If the service is stopped in many cases will lead to master, RS hangs, hbase cluster basically lost the ability to service, so ZK must be stable and reliable, when the client has established a link in RS, then ZK hung up, If you do not make a split such as decimal and ZK interaction failure will cause the RS abort () operation when the RS is still able to provide services;
2. If the rs/master has a long GC or changed the server time, resulting in ZK session timeout will lead to Rs/master stop service, there have been 2 times since the server time changes caused HBase to stop the service accident;
3. Do not easily artificially change the ZK hbase node data, master/rs in a lot of operations will be more dependent on ZK data, if found not meet expectations may cause master/rs stop service, especially master.
Master through ZK know whether RS is available, in general, the RS will normally exit when the service is stopped, when the normal exit will be removed from the ZK/hbase/rs/$regionserver node, master will listen to the node is deleted, thus the faster ( The speed depends on all region shut-down times) to redistribute this RS-responsible, if forced exit, such as kill- 9 or when the Hregionserver hangs 8th, the only time to wait for ZK's session to timeout is to remove the RS in ZK node (RS in ZK when adding nodes in the createmode.ephemeral mode, the node created by this mode is automatically deleted when the session is closed), that The master does not re-assign.
The process of Kill RS is also normal exit (cannot use kill-9 forced exit), RS uses the runtime's Addshutdownhook method to register the JVM closure hook, in the closing hook will execute RS exit logic, In fact Hbase-daemon.sh's stop RS is the use of kill.
Scenarios that cause HBase to hang