Spark trample--database (Hbase+mysql) turn

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://www.cnblogs.com/xlturing/p/spark.html

Objective

In the process of using spark streaming to persist the results of calculations, we often need to manipulate the database to count or change some values. A recent real-time consumer processing task, when using spark streaming for real-time data flow processing, I needed to update the calculated data into HBase and MySQL, so this article summarizes the content of the spark operation HBase and MySQL, and record some of the pits that you've stepped on.

Spark Streaming Persistence design mode dstreams output operation

Print: Prints the first 10 batch elements in each dstream on a driver node, often used for development and debugging
Saveastextfiles (prefix, [suffix]): Saves the current dstream as a file, and the file name naming rules for each interval batch are based on prefix and suffix: "prefix-time_in_ms[. Suffix] ".
Saveasobjectfiles (prefix, [suffix]): Saves the current dstream content as a serialized file of a Java serializable object, and the file naming conventions for each interval batch are based on prefix and suffix: "Prefix-time_in_ms[.suffix]".
Saveashadoopfiles (prefix, [suffix]): Saves Dstream as a Hadoop file, and the file naming conventions for each interval batch are based on prefix and suffix:: " Prefix-time_in_ms[.suffix] ".
Foreachrdd (func): The most common output operation that can apply a function _fun_ to each RDD generated from the data flow. Typically _fun_ saves data from each RDD to an external system, such as saving the Rdd to a file, or saving it to a database over a network connection. It is important to note that _fun_ executes in the driver process of running the application and usually contains the RDD action to induce the data flow rdd to start computing.

Using the Foreachrdd design pattern

Dstream.foreachrdd provides a lot of flexibility for development, but it also avoids many common pits when used. The process in which we typically save data to an external system is to establish a remote connection, and then close the connection by connecting the data to the remote system. For this process we are directly thinking of the following program code:

dstream.foreachRDD { rdd =>  val connection = createNewConnection()  // executed at the driver rdd.foreach { record => connection.send(record) // executed at the worker }}

In Spark's pit-in-the-middle, spark's workers and driver are sorted, and we know that in cluster mode, the connection in the above code needs to be sent from driver to the worker in the form of a serialized object, However, connection cannot be passed between machines, that is, connection cannot be serialized, which may cause errors in _serialization errors (Connection object not serializable) _. To avoid this error, we will create the conenction in the worker, the code is as follows:

rdd =>  rdd.foreach { record =>    val connection = createNewConnection()    connection.send(record)    connection.close() }}

Does that seem to solve the problem? But to think about it, we connection and shut down each of the RDD's records, which leads to unnecessary high loads and lowers the overall system throughput. So a better way is to use _rdd.foreachpartition_ to establish a unique connection for each RDD partition (note: Each partition is the inside of the RDD is running on the same worker), the code is as follows:

dstream.foreachRDD { rdd =>  rdd.foreachPartition { partitionOfRecords =>    val connection = createNewConnection()    partitionOfRecords.foreach(record => connection.send(record))    connection.close()  }}

This allows us to reduce the load of frequent connections, usually when we connect to the database using connection pooling, the concept of connection pooling is introduced, and the code is optimized as follows:

dstream.foreachRDD { rdd =>  rdd.foreachPartition { partitionOfRecords =>    // ConnectionPool is a static, lazily initialized pool of connections    val connection = ConnectionPool.getConnection()    partitionOfRecords.foreach(record => connection.send(record))    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse  }}

By holding a static connection pool object, we can reuse the connection to further optimize the cost of connection setup, thus reducing the load. It is also worth noting that the connection pool with the database is similar, the connection pool we call here should also be lazy on-demand connection, and timely recovery of the time-out of the connection.
It is also noteworthy that:

If multiple foreachrdd are used in spark streaming, they are executed in the order of the program.
Dstream the execution strategy for the output operation is lazy, so if we do not add any RDD action in Foreachrdd, then the system simply receives the data and discards the data.

Spark accesses HBase

Above we describe the basic design pattern for outputting the dstream of spark streaming to an external system, where we describe how to output the Dstream to the HBase cluster.

HBase Universal Connection Class

Scala connects HBase to get information through zookeeper, so you need to provide information about zookeeper when you configure it, as follows:

Import Org.apache.hadoop.hbase.HbaseconfigurationImport Org.apache.hadoop.hbase.client.ConnectionImport Org.apache.hadoop.hbase.HconstantsImport Org.apache.hadoop.hbase.client.ConnectionFactoryObjectHbaseutilExtendsSerializable {PrivateVal conf =Hbaseconfiguration.create ()private val para = Conf.hbaseConfig Span class= "hljs-comment" >//conf for configuration class, get configuration of HBase Conf.set (hconstants. Zookeeper_client_port, Para.get ( "PORT"). Getorelse ( "2181")) Conf.set (hconstants. Zookeeper_quorum, Para.get ( "127-0-0-1") ) //hosts private val connection = connectionfactory.createconnection (conf)  def gethbaseconn: Connection = Connection}

According to the information on the Internet, the particularity of the connection of hbase we do not use connection pooling

HBase output operation

We take the put operation as an example to demonstrate the application of the above design pattern to the HBase output operation:

Dstream.foreachrdd (Rdd = {if (!rdd.isempty) {rdd.foreachpartition (partitionrecords = {val connection = Hbaseutil.gethbaseconnGets the HBase connection partitionrecords.foreach (data = {val TableName = tablename.valueof ( "ta Blename ") Val t = connection.gettable (tableName) try {val put = new Put (Bytes.tobytes (_ROWKEY_)) //row key //column, qualifier, value Put.addcolumn (_column_.getbytes, _qualifier_.getbytes, _value _.getbytes) try (T.put (Put)). Getorelse (T.close ()) //do some Log (displayed on worker)} catch {case e:  Exception = //log Error E.printstacktrace ()} finally { T.close ()})}) //do some log (shown on driver)})       /span>

Other operations on HBase can be referred to under Spark Operation HBase (1.0.0 new API)

Pits Records

Highlights the issue of configuring Hconstants.zookeeper_quorum during connection to HBase:

Because HBase connections cannot be accessed directly using an IP address, it is often necessary to configure the hosts, such as I 127-0-0-1 (arbitrary) in the above code snippet, which we need to configure in the hosts
```
127-0-0-1 127.0.0.1
```
In the case of a single machine, we only need to configure the hosts of the HBase where zookeeper is located, but when switching to the HBase cluster, we encounter a strange bug
Problem Description: When you save Dstream to HBase in Foreachrdd, you get stuck and no error message pops up (yes!). It's stuck, it's not responding.
Problem analysis: Because the HBase cluster has multiple machines, and we only configure the hosts of an hbase machine, this causes the spark cluster to continue to look for when it accesses HBase, but is stuck there when it can't find it.
Workaround: The hosts on each worker are configured with all HBase node IPs, problem solving

Spark access to MySQL

Similar to accessing hbase, we also need to have a serializable class to establish the MySQL connection, where we take advantage of MySQL's c3p0 connection pool

MySQL Universal Connection Class

Import java.sql.ConnectionImport Java.util.PropertiesImport com.mchange.v2.c3p0.CombopooleddatasourceClassMysqlpoolExtendsSerializable {PrivateVal CPDs:Combopooleddatasource =NewCombopooleddatasource (TruePrivateVal conf =Conf.mysqlconfigtry {cpds.setjdbcurl (Conf.get ("url"). Getorelse ("Jdbc:mysql://127.0.0.1:3306/test_bee?useunicode=true&amp;characterencoding=utf-8")); Cpds.setdriverclass ("Com.mysql.jdbc.Driver"); Cpds.setuser (Conf.get ("username"). Getorelse ("Root")); Cpds.setpassword (Conf.get ("Password"). Getorelse ("")) Cpds.setmaxpoolsize ((Cpds.setminpoolsize) () Cpds.setacquireincrement (5) Cpds.setmaxstatements (180)}catch {Case e:Exception = E.printstacktrace ()}DefGetconnection:connection = {try {return Cpds.getconnection (); } catch {case ex: Exception = Ex.printstacktrace () null}}} Object mysqlmanager {var Mysqlmanager: mysqlpool = _ def getmysqlmanager: mysqlpool = {synchronized {if ( Mysqlmanager = = null) {Mysqlmanager = new mysqlpool}} Mysqlmanager}}

We use C3P0 to establish a MySQL connection pool and then access it each time we remove the connection from the connection pool for data transfer.

MySQL output operation

Also using the previous Foreachrdd design pattern, the code to output the Dstream to MySQL is as follows:

 dstream.foreachrdd (Rdd = {if (!rdd.isempty) {Rdd.fore Achpartition (partitionrecords = {//get a connection from the connection pool Val conn = Mysqlmanager.getmy Sqlmanager.getconnection val statement = conn.createstatement try {conn . Setautocommit (false) partitionrecords. foreach (record = {val sql =  "INSERT into table ..." //SQL operation required to execute Statement.addbatch (SQL)}) Statement.executebatch Conn.commit} catch {case e: exception = //do some log} finally { Statement.close () Conn.close ()})})

It is worth noting that:

When we submitted the MySQL operation, not every record was submitted once, but in the form of batch submission, so we need to Conn.setautocommit (false), so as to further improve the efficiency of MySQL.
If we update MySQL with indexed fields, it will cause the update is slow, this situation should try to avoid, if unavoidable, then hard bar (t^t)

Deployment

Provide a MAVEN configuration of the jar packages that spark connects to MySQL and HBase:

<Dependency><!--Hbase--<Groupid>org.apache.hbase</Groupid><Artifactid>hbase-client</Artifactid><version>1.0.0</Version></Dependency><Dependency><Groupid>org.apache.hbase</Groupid><Artifactid>hbase-common</Artifactid><version>1.0.0</Version></Dependency><Dependency><Groupid>org.apache.hbase</Groupid><Artifactid>hbase-server</Artifactid><version>1.0.0</Version></Dependency><Dependency><!--Mysql--<Groupid>mysql</Groupid><Artifactid>mysql-connector-java</artifactid> < version>5.1.31</version></dependency>< dependency> <groupId>c3p0</groupid> < Artifactid>c3p0</artifactid>  <version>0.9.1.2</version>< Span class= "Hljs-tag" ></DEPENDENCY>

Reference documents:

Spark Streaming Programming Guide
HBase Introduction
Operation HBase under Spark (1.0.0 new API)
Quick Start for Spark development
Kafka->spark->streaming->mysql (Scala) real-time data processing example
Use C3P0 connection pool to operate MySQL database in Spark streaming

Spark trample--database (Hbase+mysql) turn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More