Dry Goods | Spark SQL metadata dual-master high availability solution secrets, spark metadata

Last Update:2018-01-17 Source: Internet

Author: User

Tags metabase

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Dry Goods | Spark SQL metadata dual-master high availability solution secrets, spark metadata

Click "ZTE developer community" at the top to follow us

Read an original article by a front-line developer every day

Source | ZTE Big Data

Open-source Spark SQL does not support high availability, but the high availability in practical applications is of great significance to users. The DAP of the emerging big data platform enables High Availability of Spark SQL in the corresponding ZDH.

The high availability of Spark SQL is the registration with SQL when the two Spark SQL services are launched. The user-connected JDBC URL is specified as the list of Zookeeper, when connected, the Spark SQL node information is obtained through the Zookeeper cluster, and then connected to the Spark SQL service node.

Spark SQL metadata dual-master is mainly implemented using MySQL. MySQL supports unidirectional and asynchronous replication. One server acts as the master server during the replication process, and one or more other servers act as the slave server. The master server writes updates to the binary log file and maintains the index of the log file to track log loops. The slave server receives any updates from that time, blocks them, and waits for the master server to notify you of the next update.

In actual projects, the two hosts distributed in different regions are installed with MySQL databases, and the two servers are both active and standby. When one machine fails, another application that can take over the server requires the data of the two databases to be consistent in real time. Here, the MySQL synchronization function is used to Achieve Synchronous replication between two machines.

Implementation Scheme

Currently, SparkSQL nodes have two primary metadatabase access solutions:

The SparkSQL node directly connects to the MySQL node:

The SparkSQL node in connects to a single MySQL node. changes made to the metabase by different SparkSQL nodes are synchronized between MySQL nodes.

The SparkSQL node connects to the metabase through the MetaStore node:

The SparkSQL node in connects to multiple MetaStore nodes, and each MetaStore node connects to the corresponding MySQL node. changes made to the metabase by different SparkSQL nodes are synchronized between MySQL nodes.

In the above two metadatabase solutions, the client obtains the SparkSQL service in the same way, mainly through the following methods:

Beeline connection
Program access through JDBC Port

Beeline method first obtains SparkSQL node information through the Zookeeper cluster, and then connects to the SparkSQL service node. When the connected SparkSQL node is abnormal, You can retry several times to obtain the SparkSQL service.

The program connects to the corresponding SparkSQL node through the JDBC port. If an exception occurs on the SparkSQL node being connected, you can re-obtain the SparkSQL service by capturing exceptions in the code.

The following describes the functional feasibility and exceptions of the two solutions.

Test Environment

MySQL: 10.43.183.121 and 10.43.183.122 hosts

SparkSQL: 10.43.183.121 and 10.43.183.122 hosts

Hive MetaStoreServer: 10.43.183.121 and 10.43.183.122

Test scenario

Scenario 1: SparkSQL nodes directly connect to MySQL for High Availability Verification

Each SparkSQL node connects directly to a MySQL node. Verify whether the metadata can be synchronized successfully and whether the MySQL node can automatically switch if it fails.

The test procedure is as follows:

1. modify configuration

The SparkSQL configuration is modified as follows:

The JDBC connection for 10.43.183.121 is configured as MySQL on 10.43.183.121.

The JDBC connection corresponding to 10.43.183.122 is configured as MySQL on 10.43.183.122.

2. Beeline connects to SparkSQL of node 10.43.183.121.

3. Create the test table and find the Tsung table of the hiveomm database of two MySQL Databases respectively. You can see the test record. The metadata is synchronized successfully.

4. Stop the MySQL currently connected to SparkSQL.

5. Run the "show tables" command on the Beeline interface. The query is abnormal.

6. Disconnect Beeline and reconnect SparkSQL of 10.43.183.121 multiple times. The connection is abnormal.

7. Use an SQL URL to connect to the SparkSQL service! Connectjdbc: hive2: // 10.43.183.121: 2181/; serviceDiscoveryMode = zooKeeper; zooKeeperNamespace = sparkThriftServer can connect to SparkSQL several times and check the test table through the "show tables" command.

8. Start the MySQL node, and Beeline reconnects to the 10.43.183.121 node to connect to the SparkSQL node. Run the "show tables" command to query the test table information.

Test conclusion:

Metadata between MySQL databases can be synchronized.
If the MySQL node fails, the Beeline cannot be queried.
Beeline reconnect cannot connect to the corresponding SparkSQL node.
Beeline connects to the SparkSQL service through an SQL URL and can connect to available SparkSQL nodes after a certain number of attempts.

Scenario 2: SparkSQL node uses HiveMetaStoreServer to connect to MySQL for High Availability Verification

MetaStoreServer nodes are mainly used for fault tolerance when a MySQL node fails. Each MetaStoreServer node corresponds to a MySQL node, and multiple MetaStoreServer nodes are configured for each SparkSQL node. Verify whether the metadata can be synchronized successfully and whether the MySQL node can automatically switch if it fails.

The test procedure is as follows:

1. modify configuration

Modify the node configuration of MetaStoreServer as follows:

The MetaStoreServer of 10.43.183.121 node configures MySQL on 10.43.183.121 Node

The MetaStoreServer of 10.43.183.122 node configures MySQL on 10.43.183.122 Node

Add the corresponding configuration in SparkSQL:

2. Beeline connects to SparkSQL of node 10.43.183.121.

3. Create the test table and find the Tsung table of the hiveomm database of two MySQL Databases respectively. The test record is displayed, indicating that the metadata synchronization is successful.

4. Stop the MySQL corresponding to the MetaStoreServer currently connected to SparkSQL (as shown in the log.

5. Run the "show tables" command on the Beeline interface. The query is abnormal.

6. Disconnect Beeline and reconnect SparkSQL of 10.43.183.121 multiple times. An exception occurs during the connection.

7. Use an SQL url to connect to the SparkSQL service! Connectjdbc: hive2: // 10.43.183.121: 2181/; serviceDiscoveryMode = zooKeeper; zooKeeperNamespace = sparkThriftServer cannot connect to SparkSQL. You can check the test table by running the "show tables" command.

8. Restart the MySQL node and Beeline to connect to the 10.43.183.121 node. You can connect to the SparkSQL node. Run the "show tables" command to query the test table information.

Test conclusion:

Metadata between MySQL databases can be synchronized.
If the MySQL node fails, the Beeline cannot be queried.
When the MySQL node fails, Beeline reconnects to the SparkSQL node and cannot connect to the corresponding SparkSQL node.
When the MySQL node fails, Beeline connects to SparkSQL through the SQL URL and cannot connect to the corresponding SparkSQL service.

Scenario 3: connect the JDBC interface of the Code to the SparkSQL Service

In the same way as scenario 1 in service configuration, connect to the SparkSQL node through JDBC. The sample code is as follows:

Public static void main (String [] args)
{
Int num = 0;
Connection conn = null;
ResultSet rs = null;
PreparedStatement pstat = null;
While (num <3)
{
Num = num + 1;
Try
{
Class. forName ("org. apache. hive. jdbc. HiveDriver ");
If (conn = null)
{
If (args. length> 0)
Conn = DriverManager. getConnection (args [0], "mr", "mr ");
Else
Conn = DriverManager
. GetConnection ("jdbc: hive2: // 10.43.183.121: 18000", "mr", "mr ");
}
Pstat = conn. prepareStatement ("select * from test ");
Rs = pstat.exe cuteQuery ();

While (rs. next ())
{
System. out. println (rs. getString ("name "));
}
Break;
} Catch (Exception e)
{
E. printStackTrace ();
Try
{
Conn = DriverManager
. GetConnection ("jdbc: hive2: // 10.43.183.122: 18000", "mr", "mr ");
} Catch (SQLException e1)
{
E1.printStackTrace ();
}
} Finally
{
Try
{
Rs. close ();
Pstat. close ();
Conn. close ();
} Catch (SQLException e)
{
E. printStackTrace ();
}
}

}
}

Test conclusion:

You can use JDBC to connect to a SparkSQL node and capture exceptions to obtain the SparkSQL service after a MySQL node fails.

Test conclusion

SparkSQL can access the metadatabase and synchronize the metadatabase. In case of an exception in the metadatabase, the two methods are the same.

However, SparkSQL needs to connect to the MetaStore node of Hive to connect to the MetaStore node. Compared with the direct connection to MySQL, there will be redundant MetaStore processes that require maintenance, more complex metabase management methods, and version issues.

Therefore, we recommend that you use the MySQL dual-master mode to achieve high availability of SparkSQL. Different SparkSQL nodes are directly connected to different MySQL databases.

You can access the SparkSQL service in two ways:

Beeline connects to the SQL URL. In case of MySQL exceptions, You can reconnect to continue to obtain the SparkSQL service.
Call the JDBC interface to access SparkSQL, capture and handle exceptions, and reconnect to other SparkSQL nodes.

Configuration method

The configuration for directly connecting a SparkSQL node to MySQL is as follows:

On the service configuration page, configure javax. jdo. option. ConnectionURL to use any MySQL node as the default MySQL node, as shown in:

The SparkSQL node can be configured as the corresponding MySQL node in the Custom HIVE_SITE_CUSTOM_CONFIG. The configuration format is as follows:

<Name> javax. jdo. option. ConnectionURL </name>

<Value >$ {{ mysql_url }}</value>

</Property>

$ {Mysql_url} is the URL of the MySQL service.

As shown in:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More