The way of Hiveserver2 agent execution

Source: Internet
Author: User
Tags auth file permissions hadoop fs kinit
background

For the SQL query service of a data platform, Impala provides a query engine that is superior to a batch of sql-on-mr/spark performance, such as Hive/spark SQL, and benefits from Impala can directly share hive Metastore, can provide users with " A set of data, multiple engine "services, currently we plan to integrate Hive/spark/impala these SQL engines in the data platform. As we all know, hive is better than both in stability and maturity, whether it is spark or impala in the process of use will always need to make certain modifications to support the characteristics of hive, which has a platform services most useful features-agent execution.

Hiveserver2 proxy Access allows any user on the platform side to perform SQL operations just as the user himself does (just as an ordinary user uses the hive CLI to perform operations directly). This article mainly explores how Hiveserver2 uses the proxy method to support different users to complete the SQL operations, to modify the Impala support the corresponding operation to pave the way. the realization of HiveServer2

At the start of Hive Server2, We usually need a Hive.keytab user who is configured as an agent on Hadoop (the specific configuration is to add hadoop.proxyuser.hive.hosts= to the Namenode core-site.xml) The XXX and HADOOP.PROXYUSER.HIVE.GROUPS=XXX configuration entries to determine that hive users can perform agents on the specified groups user of the specified host, in addition to HIVESERVER2 configuration principal and keytab , you also need to set the Hive.server2.enable.doAs parameter to True (the default value for this configuration item is true), which means that for the user's actions, Hiveserver2 will access the HDFS and submit the MR Task in a proxy manner.

Well, as a proxy to configure the Hiveserver2, we can look at what it can do with this feature.
In this case, Hive.keytab has been configured with the ability to delegate, and the other users are all ordinary users. The Beeline connection Hiveserver2,beeline as client is executed as a proxy account with the Kerberos user authenticated in the current user's Kerberos cache, for example, the user on the current machine is nrpt:

>  klist
Ticket cache:file:/tmp/krb5cc_50997
Default principal:nrpt/dev@hadoop.hz.netease.com

Valid starting    Expires           Service principal 10/02/2017 09:30 11/02/2017 07:30  . Hz. NetEase. Com@hadoop. Hz. Netease.com
     Renew until 11/02/2017 09:30

Then connect Hiveserver2 Execute the query:

> beeline-u "Jdbc:hive2://db-53.photo.163.org:10000/default;principal=hive/app-20.photo.163.org@hadoop. Hz. netease.com "
Connecting to jdbc:hive2://db-53.photo.163.org:10000/default;principal=hive/ App-20.photo.163.org@hadoop. Hz. netease.com
Connected to:apache Hive (version 1.2.1)
driver:hive JDBC (version 1.2.1)
Transaction Isolation:transaction_repeatable_read
Beeline version 1.2.1 by Apache Hive
0:jdbc:hive2:// Db-53.photo.163.org:10000/def> Select COUNT (1) from foodmart.sales_fact_1997;
+--------+--+
|  _C0   |
+--------+--+
| 86837 |
+--------+--+
1 row selected (37.497 seconds)

The connection is successful, the SQL query is executed at this time, the query needs to submit the Mr Task, through the Hadoop task management interface, you can see that the task submitted by the user is nrpt, that is, the agent of the user.

In addition to the task submission, a lot of HDFS operations are involved when accessing the hive, and whether the operation is performed on the account of the proxy, the following SQL validation allows the create as select to create a new table and write the query results to the table.

0:jdbc:hive2://db-53.photo.163.org:10000/def> CREATE TABLE Test_nrpt as SELECT * from foodmart.sales_fact_1997;
No rows affected (28.992 seconds)

To view the file permissions on the HDFs, you can see that the data for that table is actually performed by the proxy User (NRPT).

> Hadoop fs-ls hdfs://hz-cluster2/user/nrpt/hive-server/test_nrpt
Found 1 Items
-rw-r--r--   3 nrpt HDFs    2971680 2017-02-10 10:25 hdfs://hz-cluster2/user/nrpt/hive-server/test_nrpt/000000_0
> Hadoop fs-ls HDFs ://hz-cluster2/user/nrpt/hive-server/| grep test_nrpt
drwxr-xr-x   -nrpt hdfs          0 2017-02-10 10:25 hdfs://hz-cluster2/user/nrpt/hive-server/test_ Nrpt

Demonstrating so much, why is it particularly important to say this feature? For the development of a data platform, it is often necessary to support users of different products to execute SQL, and different products usually use different Kerberos users (for different users of the security of data), using Hive Server2 agent characteristics, Allows different users to use the same hiveserver2 and to isolate data and permissions from each other. Client Agent

Okay, so if we can do this, it's not going to work out, actually. The ambition of a data platform is not limited to providing a hiveserver2 for user access, often maintaining encapsulation into a query window (such as Mammoth), where the user does not need to care where the hiveserver2 starts. In this case, you need the platform side to create a connection to the Hiveserver2, and then execute the query entered by the user, and most importantly, you need to execute the query as a user. as we saw earlier, if you want to execute a query as user A, then you need the current Kerberos authenticated user as a, and it is difficult to save all the user keytab on the platform side, and then perform a switch while performing different user actions.

If you really want to be so unwieldy can not play, since the hive users in the Hiveserver2 agent any user to execute the query, then the client is not also through the hive proxy any user to execute SQL. We need to do the execution of the agent when the logic is like this:

Usergroupinformation Ugi = Usergroupinformation.createproxyuser (Proxyuser, Usergroupinformation.getloginuser ());
SYSTEM.OUT.PRINTLN ("Current Kerberos User:" + ugi);
Ugi.doas (New privilegedexceptionaction<void> () {public
    Void run () throws Exception {
        //does something with User Proxyuser.
    }
});

The operations performed in DOAs are performed as Proxyuser users, usually by submitting Mr Tasks, accessing HDFs, and so on, believing that Hiveserver2 's implementation is certainly in a similar way, Of course, there is a prerequisite that the getloginuser () (usually the Kinit authenticated user or the user who performs Kerberos in the program) must have proxy permissions, and if any ordinary user can createproxyuser, Everyone has become a super account. We write the Run method to create hive connection and execute the query:

Public Void Run () throws Exception {
    class.forname ("Org.apache.hive.jdbc.HiveDriver");
    Connection conn = drivermanager.getconnection (
        "jdbc:hive2://db-53.photo.163.org:10000/default;principal=hive /App-20.photo.163.org@HADOOP.HZ.NETEASE.COM ");
    Statement Statement = Conn.createstatement ();
    Statement.execute ("SELECT count (1) from foodmart.sales_fact_1997");
    return null;
}

The execution discovery was not as smooth as expected, and a Kerberos-certified error occurred:

Current Kerberos User:nrpt (auth:proxy) via Hive/app-20.photo.163.org@hadoop. Hz. Netease.com (Auth:kerberos) 17/02/10 10:56:31 INFO jdbc. Utils:supplied authorities:db-53.photo.163.org:10000 17/02/10 10:56:31 INFO jdbc. utils:resolved authority:db-53.photo.163.org:10000 17/02/10 10:56:31 ERROR Transport. TSASLTRANSPORT:SASL Negotiation Failure Javax.security.sasl.SaslException:GSS initiate failed [caused by gssexception: No valid credentials provided (mechanism level:failed to find any Kerberos TGT)] at COM.SUN.SECURITY.SASL.GSSKERB.GSS Krb5client.evaluatechallenge (gsskrb5client.java:212) at Org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage (tsaslclienttransport.java:94) at Org.apache.thrift.transport.TSaslTransport.open (tsasltransport.java:271) at Org.apache.thrift.transport.TSaslClientTransport.open (tsaslclienttransport.java:37) at Org.apache.hadoop.hive.thrift.client.tugiassumingtransport$1.run (tugiassumingtransport.java:52) at Org.apAche.hadoop.hive.thrift.client.tugiassumingtransport$1.run (tugiassumingtransport.java:49) at
 Java.security.AccessController.doPrivileged (Native method) at Javax.security.auth.Subject.doAs (subject.java:415)

It seems that the proxy through the client is not, At this point will be sacrificed hiveserver2 for the platform side of a very important parameter: Hive.server2.proxy.user, think carefully for hiveserver2, it needs to know the client's username, the most direct way is to use the current authenticated user execution, then Hivese Rver2 the username is the authenticated user of Kerberos in the connection, and for agent execution, if Kerberos for the current connection is a proxy account, Then the real proxy user can be passed through the Hive.server2.proxy.user parameter, so that no UGI action is required. The premise of using this parameter is still that the current Kerberos user has proxy permissions, and if a common user creates a connection using this parameter, the following error occurs:

> beeline-u "Jdbc:hive2://db-53.photo.163.org:10000/default;principal=hive/app-20.photo.163.org@hadoop. Hz. Netease.com;hive.server2.proxy.user=da "
connecting to Jdbc:hive2://db-53.photo.163.org:10000/default; Principal=hive/app-20.photo.163.org@hadoop. Hz. Netease.com;hive.server2.proxy.user=da
error:failed to validate proxy privilege of nrpt for Da (state=08s01,code=0)
Beeline version 1.2.1 by Apache Hive
0:jdbc:hive2://db-53.photo.163.org:10000/def (closed) > 

If the current Kerberos account is an agent-capable user, performing the connection succeeds and can execute SQL as the user who is the agent:

> Kinit-kt hive.keytab  hive/app-20.photo.163.org@hadoop. Hz. netease.com
> Beeline-u "jdbc:hive2://db-53.photo.163.org:10000/default;principal=hive/ App-20.photo.163.org@hadoop. Hz. Netease.com;hive.server2.proxy.user=nrpt "
connecting to Jdbc:hive2://db-53.photo.163.org:10000/default; Principal=hive/app-20.photo.163.org@hadoop. Hz. Netease.com;hive.server2.proxy.user=nrpt
Connected To:apache Hive (version 1.2.1)
driver:hive JDBC (version 1.2.1)
Transaction isolation:transaction_repeatable_read
Beeline version 1.2.1 by Apache Hive
0:JDBC: Hive2://db-53.photo.163.org:10000/def>

discussed here for the platform side how to use Hiveserver2 proxy arbitrary user execution query has been more clear, summed up is hiveserver2 open DOAs options, The platform side executes the query by adding the Hive.server2.proxy.user parameter, and does not need to have any user's keytab. But there is still a problem here, that is, the connection URL for different users is different, so that the server needs to maintain a connection pool for each user, or a more violent approach is to each query to create a new connection, after the completion of the implementation of the destruction of the connection.

The above discusses the hiveserver2 in the use of proxy mode, but spark and impala do not have such a complete function, the following detailed carding impala how to do, and how to modify to make it achieve hiveserver2 similar functions.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.