The company HBase (CDH-4.6.0) recently encountered a troublesome problem, and felt it necessary to document the whole process of the settlement.
cause of the problem
The user failed while running the MapReduce task, reading the file from HDFs to write to HBase table (this is a mapred capability provided by HBase). This problem was found in the a environment (a test environment) since Kerberos was enabled. After running the user-given program and the sample written by himself, the discovery program finally hangs on the nullpointerexception. This NPE indicates that a variable called Currentkey on the service side is null.
[Email protected],java.io.ioexception:java.io.ioexception:java.lang.nullpointerexceptionat Org.apache.hadoop.hbase.security.token.AuthenticationTokenSecretManager.createPassword ( authenticationtokensecretmanager.java:129) at Org.apache.hadoop.hbase.security.token.AuthenticationTokenSecretManager.createPassword ( authenticationtokensecretmanager.java:57) at org.apache.hadoop.security.token.token.<init> (Token.java:70) At Org.apache.hadoop.hbase.security.token.AuthenticationTokenSecretManager.generateToken ( authenticationtokensecretmanager.java:162) at Org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken (tokenprovider.java:91) at Sun.reflect.GeneratedMethodAccessor56.invoke (Unknown Source) at Sun.reflect.DelegatingMethodAccessorImpl.invoke ( DELEGATINGMETHODACCESSORIMPL.JAVA:25) at Java.lang.reflect.Method.invoke (method.java:597) at Org.apache.hadoop.hbase.regionserver.HRegion.exec (hregion.java:5610) at Org.apache.hadoop.hbase.regionserver.HRegionServer.execCoproceSsor (hregionserver.java:3918) at Sun.reflect.GeneratedMethodAccessor39.invoke (Unknown Source) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (DELEGATINGMETHODACCESSORIMPL.JAVA:25) at Java.lang.reflect.Method.invoke (method.java:597) at Org.apache.hadoop.hbase.ipc.securerpcengine$server.call ( securerpcengine.java:311) at Org.apache.hadoop.hbase.ipc.hbaseserver$handler.run (HBaseServer.java:1428)
Authenticationtokensecretmanager:
@Override protectedbytecreatePassword(AuthenticationTokenIdentifier identifier) { long now = EnvironmentEdgeManager.currentTimeMillis(); AuthenticationKey secretKey = currentKey; //currentKey赋给secretKey //NPE在这里抛出的,也就是currentKey为null identifier.setIssueDate(now); identifier.setExpirationDate(now + tokenMaxLifetime); identifier.setSequenceNumber(tokenSeq.getAndIncrement()); return createPassword(WritableUtils.toByteArray(identifier), secretKey.getKey()); }
Problem Locator
Since Currentkey is null, let's find out where it is assigned. After reading the source code, understand that the whole process is like this:
1. After Kerberos is turned on, each regionserver will have a authenticationtokensecretmanager to manage tokens.
2. There is only one leader in these managers, only it can produce tokens and put them in zookeeper. Other managers synchronize leader-produced tokens by perceiving zookeeper changes. Leader through competition, who first create/hbase/tokenauth/keymaster node in ZK, who is leader.
Authenticationtokensecretmanager$leaderelector:
Public void Run() {Zkleader.start (); Zkleader.waittobecomeleader ();//The person who has not become a leader will always be stuck here until it senses that the current leader is dead before a new round of competition begins .IsMaster =true; while(!stopped) {Longnow = Environmentedgemanager.currenttimemillis ();//Clear any expiredRemoveexpiredkeys ();//Clear expired tokens and remove it from ZK if(Lastkeyupdate + Keyupdateinterval < now) {//The default period is 1 days //Roll a new master keyRollcurrentkey ();//This function generates a new token, replacing the Currenkey}Try{Thread.Sleep ( the); }Catch(Interruptedexception IE) {if(Log.isdebugenabled ()) {Log.debug ("interrupted waiting for next update", IE); } } } }
Authenticationtokensecretmanager:
synchronized voidRollcurrentkey () {if(!leaderelector.ismaster ()) {Log.info ("Skipping Rollcurrentkey () because not running as master.");return; }Longnow = Environmentedgemanager.currenttimemillis (); Authenticationkey prev = Currentkey; Authenticationkey NewKey =NewAuthenticationkey (++idseq, Long.max_value,//don ' t allow to expire until it's replaced by a new keyGeneratesecret ()); Allkeys.put (Newkey.getkeyid (), NewKey); Currentkey = NewKey;//Rolling Currentkey, set to NewkeyZKWATCHER.ADDKEYTOZK (NewKey);//Put the new token in the zookeeperLastkeyupdate = Now;if(Prev! =NULL) {//Make sure previous key is still storedPrev.setexpiration (now + tokenmaxlifetime);//prev is the original Newkey, is not expired, when there is a new newkey to replace it, its term default setting is 7 daysAllkeys.put (Prev.getkeyid (), prev); ZKWATCHER.UPDATEKEYINZK (prev); } }
3. Since token is produced by leader, no one will produce it unless there is no leader. To verify this idea, I found evidence in the logs of zookeeper and some region server launch day:
a) the/hbase/tokenauth/keymaster node in ZK is used to store leader information, And then went into the zookeeper-client to see the next, there is no this node.
B) Some of the following exceptions were found on the region server where the logs for the day of the cluster start are still maintained:
Org.apache.hadoop.hbase.security.token.AuthenticationTokenSecretManager:Zookeeper initialization Failedorg.apache.zookeeper.keeperexception$noauthexception:keepererrorcode = Noauth For/hbase/tokenauth/keysat Org.apache.zookeeper.KeeperException.create (keeperexception.java:113) at Org.apache.zookeeper.KeeperException.create (keeperexception.java:51) at Org.apache.zookeeper.ZooKeeper.create ( zookeeper.java:783) at Org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential ( recoverablezookeeper.java:421) at Org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create ( recoverablezookeeper.java:403) at Org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents (zkutil.java:1164) at Org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents (zkutil.java:1142) at Org.apache.hadoop.hbase.security.token.ZKSecretWatcher.start (zksecretwatcher.java:58) at Org.apache.hadoop.hbase.security.token.AuthenticationTokenSecretManager.start ( authenticationtokensecretmanager.java:105) at Org.apache.hadOop.hbase.ipc.securerpcengine$server.startthreads (securerpcengine.java:275) at Org.apache.hadoop.hbase.ipc.HBaseServer.start (hbaseserver.java:1650) at Org.apache.hadoop.hbase.regionserver.HRegionServer.startServiceThreads (hregionserver.java:1728) at Org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse (hregionserver.java:1105) at Org.apache.hadoop.hbase.regionserver.HRegionServer.run (hregionserver.java:753) at Java.lang.Thread.run ( thread.java:662)
This is Authenticationtokensecretmanager boot time failure, the start of the time will be created on the ZK/hbase/tokenauth/keys this directory (even if this directory already exists will do this operation, this is a guarantee), This directory is used to store tokens generated by leader. As a result, everyone has no/hbase/tokenauth permission, so all failed (Noauth For/hbase/tokenauth/keys, the hint here is a bit flawed, in fact,/hbase/tokenauth no permissions caused). However, this serious error occurred, the server startup has not been terminated, but continue to run, leaving a hidden trouble.
Authenticationtokensecretmanager:
publicvoidstart() { try { // populate any existing keys this//这里抛出的KeeperException // try to become leader this//这里竞争leader,但是因为异常这里不会被执行,所以没有人去竞争leader catch (KeeperException ke) { LOG.error("Zookeeper initialization failed"//发生异常,仅仅是打印一条error信息,而没有abort。在Hbase的很多地方,发生这样的错误都是会abort server的。 } }
4. The cause of the error is the/hbase/tokenauth permission issue, which is checked in the zookeeper-client for the following permissions:
[zk: localhost:2181(CONNECTED) 0] getAcl /hbase/tokenauth‘sasl,‘hbase/[email protected]: cdrwa
But it's strange that no matter what account I switch to, I can't access this node, and it's also a failure to set its permissions to anyone through SetACL. The reason is obviously, because I am not "hbase/[email protected", I do not have any permission to operate.
Authentication is not valid : /hbase/tokenauth
But why 4048 this machine also failed to become leader (question [1])?
First time solution (Day 0)
Try all sorts of ways also can't get/hbase/tokenauth control, we had to temporarily by zookeeper configuration file zoo.cfg Add parameter skipacl=true, restart zookeeper, so do not validate ACL.
Restart HBase, Trigger Authenticationtokensecretmanager.start, everyone began to compete to become leader, so there is Leader,leader is 4048 this machine.
Then through the zookeeper-client setacl command to change the permissions of this point to anyone, and then close Skipacl, restart zookeeper.
These are my colleagues operation, after the operation of the cluster all normal, MapReduce can also run. But there is a hidden trouble, I notice that the/hbase/tokenauth/keys is also 4048 exclusive privileges, if 4048 hang off, others can not be leader, but think it hangs the probability of a relatively low, and so it hangs up again, so it did not go to bother.
Question 2 (Day 1)
At noon today, the cluster suddenly collapsed, all of the region server has been hung off. I went up to check the log, the results and I yesterday considered the same as the hidden trouble, 4048 hung up, and then other people compete leader when no authority also hung off. 4048 Why hang Up (question [2])? At that time I did not look at 4048 of the log, do not know why it hung up, only to feel very coincidence.
This is the two logs intercepted from the 4050 region server of this machine, which first became leader, and then, because there is no permission to maintain/hbase/tokenauth/keys, it is also a failure to access the key inside. The same is the reason why other machines are hung off.
2015-08-25 14:35:08,273 DEBUG org.apache.hadoop.hbase.zookeeper.ZKLeaderManager:Claimed the leader Znode as ' svr4050hw2285.hadoop.xxx.com,60020,1440397852179 ' 2015-08-25 14:35:08,288 FATAL Org.apache.hadoop.hbase.regionserver.HRegionServer:ABORTING Region Server SVR4050HW2285.hadoop.xxx.com, 60020,1440397852179:unable to synchronize Secretkey 3 in zookeeperorg.apache.zookeeper.keeperexception$ Noauthexception:keepererrorcode = Noauth For/hbase/tokenauth/keys/3at org.apache.zookeeper.KeeperException.create ( keeperexception.java:113) at Org.apache.zookeeper.KeeperException.create (keeperexception.java:51) at Org.apache.zookeeper.ZooKeeper.setData (zookeeper.java:1266) at Org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData (recoverablezookeeper.java:349) at Org.apache.hadoop.hbase.zookeeper.ZKUtil.updateExistingNodeData (zkutil.java:814) at Org.apache.hadoop.hbase.security.token.ZKSecretWatcher.updateKeyInZK (zksecretwatcher.java:197) at Org.apache.hadoop.hbase.security.token.AuthenticatIontokensecretmanager.rollcurrentkey (authenticationtokensecretmanager.java:257) at Org.apache.hadoop.hbase.security.token.authenticationtokensecretmanager$leaderelector.run ( authenticationtokensecretmanager.java:317)
Second settlement (Day 1)
After adding Skipacl, restart ZK and restart HBase. Keep the Skipacl open for the time being so that hbase runs normally.
Thinking (Day 2)
We can not always open the skipacl, which is not very friendly to the resource isolation. I looked at the code for Zkutil.java under HBase.
This is the function that creates the ACL when the Znode is created. It uses permissions for some specific nodes CREATOR_ALL_AND_WORLD_READABLE
, and the rest uses CREATOR_ALL_ACL
permissions. The former is the creator has all permissions, and the others have read-only permissions. The latter is the creator who has all the permissions.
Private StaticArraylist<acl>Createacl(Zookeeperwatcher zkw, String node) {if(Issecurezookeeper (Zkw.getconfiguration ())) {//Certain znodes is accessed directly by the client, //So they must is readable by non-authenticated clients if((node.equals (zkw.baseznode) = =true) || (Node.equals (zkw.rootserverznode) = =true) || (Node.equals (zkw.masteraddressznode) = =true) || (Node.equals (zkw.clusteridznode) = =true) || (Node.equals (zkw.rsznode) = =true) || (Node.equals (zkw.backupmasteraddressesznode) = =true) || (Node.startswith (zkw.assignmentznode) = =true) || (Node.startswith (zkw.mastertableznode) = =true) || (Node.startswith (zkw.mastertableznode92) = =true)) {returnzookeeperwatcher.creator_all_and_world_readable; }returnIds.creator_all_acl; }Else{returnIds.open_acl_unsafe; } }
The/hbase/tokenauth and its child nodes are clearly using CREATOR_ALL_ACL
permissions. The 4048 created the key and then hung it out, and the other machine was obviously not going to be leader. This permission setting seems a bit unscientific.
Because B environment permissions are normal, nothing wrong, I compared the next A and B permissions and configuration.
B leader the authority to produce tokens:
[zk: localhost:2181(CONNECTED) 4] getAcl /hbase/tokenauth/keys/67‘sasl,‘hbase: cdrwa
A leader the authority to produce tokens:
[zk: localhost:2181(CONNECTED) 1] getAcl /hbase/tokenauth/keys/2‘sasl,‘hbase/[email protected]: cdrwa
The former very uniform use hbase this principal, the latter with hostname.
The problem must be here!
I also compared the zk-jaas.conf of hbase, no difference. This configuration file is configured to access the ZK principal, they are all with hostname.
Client {com.sun.security.auth.module.Krb5LoginModule requireduseKeyTab=trueuseTicketCache=falsekeyTab="/etc/hbase.keytab"principal="hbase/[email protected]";};
But why B last principal but did not bring hostname, I compared the zookeeper configuration file zoo.cfg.
B has the following two lines set:
kerberos.removeHostFromPrincipal=truekerberos.removeRealmFromPrincipal=true
and a?
Discussed with colleagues, he told me that a two-line configuration is not the beginning of some, is added later, at that time a first on Kerberos also has a lot of problems. I understood in a moment that all doubts had been solved.
question [1]: Why is 4048 this machine also failed to become leader?
Since the cluster was first on Kerberos boot without adding those two lines of the remove configuration, so the permissions of/hbase/tokenauth and/hbase/tokenauth/keys are 4048 exclusive. Later, because of a problem, these two lines of configuration are added, HBase restarts. At this time everyone's principal into hbase (including 4048), no one can access this 4048 exclusive directory. So, including 4048, no one became a leader.
question [Why did 2]:4048 hang out?]
This is because the first time we solved, only fixed the/hbase/tokenauth and did not repair the/hbase/tokenauth/keys, its permissions are still 4048 all.
[zk: localhost:2181(CONNECTED) 0] getAcl /hbase/tokenauth/keys‘sasl,‘hbase/[email protected]: cdrwa
At that time to restart HBase is still open skipacl, so leader smoothly in/hbase/tokenauth/keys below created token, cluster normal start, all normal.
Then we closed the skipacl, and there seemed to be no problem, but why did it just happen the next day?
Because the default period for leader to update token is exactly one day, the next day it wants to update because there is no/hbase/tokenauth/keys permission to hang up.
[1] because we added those two lines of the remove configuration, even though the leader is 4048, it is inaccessible.
The evidence is also very easy to find.
This is the first time that the token was newly written, and it was created at 24th # 2:30 P.M..
[zk: localhost:2181(CONNECTED) 3] stat /hbase/tokenauth/keys/3 cZxid = 0x1900000097ctime = Mon Aug 24 14:30:48 CST 2015mZxid = 0x1c000000e8mtime = Tue Aug 25 15:35:36 CST 2015pZxid = 0x1900000097cversion = 0dataVersion = 1aclVersion = 0ephemeralOwner = 0x0dataLength = 42numChildren = 0
This is leader hung off the log, time is 2:30 P.M. the next day, the cluster is also in the collapse of about 2:30, just about 24 hours apart.
2015-08-25 14:33:01,515 FATAL org.apache.hadoop.hbase.security.token.ZKSecretWatcher:Unable to synchronize master key 4 to Znode/hbase/tokenauth/keys/4org.apache.zookeeper.keeperexception$noauthexception:keepererrorcode = NoAuth for/ Hbase/tokenauth/keys/4at org.apache.zookeeper.KeeperException.create (keeperexception.java:113) at Org.apache.zookeeper.KeeperException.create (keeperexception.java:51) at Org.apache.zookeeper.ZooKeeper.create ( zookeeper.java:783) at Org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential ( recoverablezookeeper.java:421) at Org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create ( recoverablezookeeper.java:403) at Org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents (zkutil.java:1164) at Org.apache.hadoop.hbase.zookeeper.ZKUtil.createSetData (zkutil.java:868) at Org.apache.hadoop.hbase.security.token.ZKSecretWatcher.addKeyToZK (zksecretwatcher.java:180) at Org.apache.hadoop.hbase.security.token.AuthenticationTokenSecretManager.rollCurrentKeY (authenticationtokensecretmanager.java:250) at Org.apache.hadoop.hbase.security.token.authenticationtokensecretmanager$leaderelector.run ( authenticationtokensecretmanager.java:317) 2015-08-25 14:33:01,516 FATAL Org.apache.hadoop.hbase.regionserver.HRegionServer:ABORTING Region server SVR4048HW2285.hadoop.xxx.com, 60020,1440 397852099:unable to synchronize secret key 4 in zookeeper
The log shows that it has failed to write to the new Token 4, which was written yesterday at 3. Because the new token ID is older than the old ones, it just hangs on the point where you want to write 4.
Authenticationtokensecretmanager:
synchronized voidRollcurrentkey () {if(!leaderelector.ismaster ()) {Log.info ("Skipping Rollcurrentkey () because not running as master.");return; }Longnow = Environmentedgemanager.currenttimemillis (); Authenticationkey prev = Currentkey; Authenticationkey NewKey =NewAuthenticationkey (++idseq,//New token ID than last freshmanLong.max_value,//don ' t allow to expire until it's replaced by a new keyGeneratesecret ()); Allkeys.put (Newkey.getkeyid (), NewKey); Currentkey = NewKey; ZKWATCHER.ADDKEYTOZK (NewKey);//Try to write a new token to ZKLastkeyupdate = Now;if(Prev! =NULL) {//Make sure previous key is still storedPrev.setexpiration (now + tokenmaxlifetime); Allkeys.put (Prev.getkeyid (), prev); ZKWATCHER.UPDATEKEYINZK (prev); } }
Third settlement (Day 2)
Fix all the problems on ZK (set permissions to anyone), delete expired tokens (these tokens because no permissions, no one deleted), close Skipacl, restart ZK.
Because the remove configuration has already been added, the principal of different region server access zookeeper are now the same, and no more permissions issues are present.
PostScript
To ensure that different region servers access Zookeeper's principal, we must add the remove configuration in the zoo.cfg, which does not seem to be a particularly scientific practice.
As HBase, you cannot guarantee that the zookeeper will have a remove configuration. If zookeeper is another team maintenance, do they feel that adding such a configuration has an impact on other apps?
In fact HBase as a client,zookeeper as a server, we seem to be able to configure hbase with a unified client identity?
zk-jaas.conf similar to this:
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/path/to/zkcli.keytab" storeKey=true useTicketCache=false principal="[email protected]<YOUR-REALM>";};
Rather than this:
Client {com.sun.security.auth.module.Krb5LoginModule requireduseKeyTab=trueuseTicketCache=falsekeyTab="/etc/hbase.keytab"principal="hbase/[email protected]";};
So you don't take hostname with you?
Appendix
Zookeeper Authentication
HBase as a MapReduce Job Data Source and Data Sink
Copyright NOTICE: This article for my original article, reproduced please indicate the source
Kerberos access to zookeeper ACL problem