Hive as a data source for Mondrian

Source: Internet
Author: User
Tags i18n

      Project report system using open source Mondrian and Saiku as a tool to achieve, and now I have to be familiar with the OLAP this piece of things, the first thing to face is Mondrian this mountain, Listen to their previous developer said Mondrian inside there will be a lot of pits, especially performance problems, in the previous test process himself also encountered some problems, but at that time did not how to record a two months to almost forget how to solve.       But at that time for Mondrian slow or deep experience, as to how he realized I have not read the source code, just by looking at the execution results found that I use the data source is an official FoodMart database, In the previous article, we described how to build this data and how to import it into hive, and then I built a pseudo-distributed cluster on my test cluster, using Mondrian to test MySQL and hive as data sources, An MDX query MySQL can find results in a few seconds, a full run in hive for half an hour, when observing Hive's history record that every SQL will run one or more mapreduce tasks, as for the reason, I think the first Mondrian generated SQL is more, I recorded MySQL query records found an MDX query will generate about 18 SQL queries, including group by and join and other queries, Because Mondrian does not optimize the underlying data source (I think), it is theoretically the same for MySQL and hive-generated SQL (no comparison), so these lengthy SQL slows down the entire query, and hive is not MySQL-like database for TPS , it is throughput-oriented, so the response time for each SQL query is long, and by observing that the generated SQL is executed sequentially, why not slow it down!       Although slow, but the first real function of the other bar, before testing the hive in the process remember that the JDBC source of the hive has been modified, mainly to modify some hive in the implementation of JDBC does not implement but throw an exception interface, And Mondrian will call these interfaces cause the following process can not go down, the overall modification should be said to be relatively simple. Another problem is that hive was not using any authentication mechanism at that time, including Hadoop is not a certification mechanism, now running on the company's Hadoop cluster needs to use Kerberos authentication, this piece of oneself is not familiar with, still just know how to use, So there is a need to fill in the knowledge about Kerberos authentication.
The preparation has already been done, first I have tested the process using MYSQ as the data source using Mondrian as the MDX query engine, and then recorded the FoodMart data if the Mondrian is generated and how to import the data into hive, in addition, There is also how to deploy Hive Hiveserver2, and then use JDBC to connect to this server. Everything is ready, only owed the East wind.first of all, since Hiveserver is Kerberos-certified, it is a proxy server that uses the Proxy User hive Agent to submit mapreduce tasks and perform HDFS operations, so you first need to use Kerberos authentication in your code. The process of certification is as follows:
              Configuration conf = new configuration ();              Conf.setboolean ("Hadoop.security.authorization", true);              Conf.set ("Hadoop.security.authentication", "Kerberos");              Usergroupinformation. Setconfiguration (conf);               try {                     usergroupinformation. Loginuserfromkeytab ("Intern/bigdata", "c:\\users\\administrator\\desktop\\ Intern.keytab ");              } catch (IOException e) {                      //TODO auto-generated catch block                     e.printstacktrace ();              }

after Kerberos has been initialized, the next step is to specify the JDBC driver for hive, which appears to be unspecified when using MySQL. But Mondrian does not have an officially supported hive version (in fact Mondrian only supports the data source that provides the JDBC interface), so you need to specify the hive data source at the beginning of the code
           try {             Class. forname ("Org.apache.hive.jdbc.HiveDriver");           } catch (ClassNotFoundException e) {             E.printstacktrace ();           }

The URL of the olap4j that is specified when MySQL is connected will then be replaced with only the JDBC URL of the JDBC section replaced by Hive, followed by the following URL:
       Connection Connection = DriverManager. Getconnection ("Provider=mondrian;" +               "jdbc=jdbc:hive2://bitest0.server.163.org:10000/foodmart;principal= Hive/[email protected]; "+               " catalog=http://db-53.photo.163.org:16161/cube/queryxml/60548;               + "dynamicschemaprocessor=mondrian.i18n.nrptlocalizingdynamicschemaprocessor;"               + "LOCALE=ZH_CN",              null);

set up the connection, then you can execute an MDX query, it should be noted that the catalog here is a URL instead of a file, because this URL is provided in our project to download the cube definition of the XML file, so you can also see in CThe atalog parameter only needs to specify the address of the XML file that can be obtained, whether it is a URL or a local file path. Then execute the following MDX query:
       Query query = Connection.parsequery ("Select NON EMPTY {hierarchize ({[Time].[ Year]. Members})} on COLUMNS, "                     +" NON EMPTY {[measures].[ Total Sales]} on the ROWS from [Testfoodmart] ");       Result result = connection. Execute (query);       PrintWriter pw = new PrintWriter (system.out);       Result.print (PW);       Pw.flush ();

thisIt should be available .But there was a mistake:
caused By:org.apache.thrift.transport.TTransportException:Peer indicated failure:unsupported mechanism type PLAIN At       org.apache.thrift.transport.TSaslTransport.receiveSaslMessage (tsasltransport.java:190)       at Org.apache.thrift.transport.TSaslTransport.open (tsasltransport.java:288) at       Org.apache.thrift.transport.TSaslClientTransport.open (tsaslclienttransport.java:37) at       Org.apache.hive.jdbc.HiveConnection.openTransport (hiveconnection.java:203) ...       More

This is to prompt me to use the authentication method is not correct, because the server uses the Kerberos authentication and the client does not, but I clearly already added in the URL inside the principal this parameter, ah, Tried a bit of direct connection hive but JDBC URL does not add principal this parameter, sure enough, there is the same error, which means that this principal does not take effect, why? by observing the URL of the olap4j, it happens to be using ";" and the server port number of the principal and the front hiveserver is also used ";" Split, see the Hiveserver2 document found that the original use of Kerberos authentication must use this way of the URL (host:port;principal=xxx), which is a bit tangled, Mondrian inside the ";" As a delimiter, you will only put the previous host:ip as the JDBC URL, of course, the connection error AH. also tried to create a connection using the properties parameter, but the attempt failed because the hive URL had to be principal and Host:port ";" Split, and then only to see how the source code mondrian the URL of the partition, the creation of the connection is in the Mondrian.olap.DriverManager class static function getconnection complete, the function is as follows:
     Connection static More ... getconnection (         48 String connectstring,47         cataloglocator Locator)     {         Util.propertylist properties = util.parseconnectstring (connectstring);         return getconnection ( properties, locator);     }2729 public    static propertylist more ... parseconnectstring (String s) {2730        return new Connectstringparser (s). Parse (); 2731    }

in the util.parseconnectstring function, the URL is divided, the result of the partition is saved in a list<pair> using the Key:value form, and the Pair's first is key, Second is value, it is conceivable that when parsing is based on "=" as the key and value of the separator, using ";" As a key:value to the direct delimiter.     in the Connectstringparser class, there are several functions: Parsepair (parsing a key:value pair, Divided into parsing key and parsing value two parts), ParseName (parsing a pair of key), Parsevalue (parsing a pair of value) and parsequoted (parsing special symbols), Because this URL is the value of this key as JDBC, looking at Parsevalue This function finds the following code:
2830            if (c = = ' "' | | c = = ' \ ') {2831                String value = parsequoted (c); 2832                //Skip over trailing white space2833 while                (i < n &  ;& (c = S.charat (i)) = = ") {2834                    i++;2835                }2836                if (i >= N) {2837                    return value;2838                } else if (S.charat (i) = = '; ') {2839                    i++;2840                    return value;2841                } else {2842                    throw new RuntimeException (2843                        "quoted value Ended too soon, at position "+ i2844                        +" in "+ S +" ' "); 2845                }2846            }

This is when encountered "\" and "\" The two characters need special handling, in parsequoted it will be the two quotation marks inside of all the content as value, so that in the JDBC parameters to use a quotation mark can solve the problem of delimiters, here also give me a hint, In the case of a string delimiter, it is important to enclose the quotation mark as a complete character, otherwise it may affect the use (if there is no special handling of the quotation marks, the problem with hive will be difficult to solve). The following URLs are modified:
       Connection Connection = DriverManager. Getconnection ("Provider=mondrian;" +               "jdbc=\" jdbc:hive2://bitest0.server.163.org:10000/foodmart;principal= Hive/[email protected]\ ";" +               "catalog=http://db-53.photo.163.org:16161/cube/queryxml/60548;"               + "dynamicschemaprocessor=mondrian.i18n.nrptlocalizingdynamicschemaprocessor;"               + "LOCALE=ZH_CN",              null);

try again, the previous error no longer occurs, but there is another error:
caused By:java.sql.SQLException:Method not supported at       org.apache.hive.jdbc.HiveDatabaseMetaData.isReadOnly ( hivedatabasemetadata.java:770) at       org.apache.commons.dbcp.DelegatingDatabaseMetaData.isReadOnly ( delegatingdatabasemetadata.java:679) at       mondrian.spi.impl.JdbcDialectImpl.deduceReadOnly ( jdbcdialectimpl.java:196)       ... More

This shows that the isreadonly function is not supported in the JDBC implementation of hive, but throws an exception, and this function is called when Mondrian creates the connection, but does not catch the exception and therefore terminates, so it is necessary to modify the source code of hive jdbc. The statement that throws the exception is commented out. Download the hive source code, and then just compile the JDBC section, and then modify it after you finishThe IsReadOnly function in the Src/java/org/apache/hive/jdbc/hivedatabasemetadata.java file was originally thrown with an exception and modified to return FALSE. Then run it again to execute the MDX query.
After this configuration and test can finally try to use hive as a Mondrian data source, so the functionality is achievable, next to the data source in our project to add, the main work should be placed in the data source for the MDX query optimization, such as pre-calculation, cache, and so on.
In this attempt to connect to the hive database using Mondrian, there were three problems: 1, HIVESERVER2 Configuration and Kerberos authentication, which I had to learn about the principle of Kerberos. 2, through the JDBC Direct connection Hiveserver2 completes, uses the same JDBC connection Mondrian to have the problem, uses the same delimiter the question also to hinder also for a period of time, finally by the view source code solves this problem, In addition, when you encounter the use of delimiters, be sure to add the way you can use quotation marks to form a complete paragraph. 3, the JDBC source code needs to be modified, this is relatively easy, only need to find the problem, replace the jar package on it.
As for performance issues, I used the same MDX query statement "SELECT NON EMPTY {hierarchize ({[Time].[ Year]. Members})} on COLUMNS,NON EMPTY {[measures].[ Total Sales]} on the ROWS from [Testfoodmart]"The same data set is used for MySQL and hive to get the same result: Axis #0: {}axis #1: {[Time].[ 1998]}axis #2: {[measures].[ Total sales]}row #0:1,079,147 but MySQL uses about 300ms of time, and Hive uses about 150000ms, about 500 times times the difference.
Preparation: Build Mondrian test Database Footmart and import to hive using MySQL as Mondrian data source deployment 0.14.0 version of the Hiveserver server


Hive as a data source for Mondrian

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.