https://cloud.tencent.com/developer/article/1004820
Spark Stomp: Database (Hbase+mysql)
Objective
In the process of using spark streaming to persist the results of calculations, we often need to manipulate the database to count or change some values.
A recent real-time consumer processing task, when using spark streaming for real-time data flow processing, I needed to update the calculated data into HBase and MySQL, so this article summarizes the content of the spark operation HBase and MySQL, and record some of the pits that you've stepped on.
Spark Streaming Persistence design mode
Dstreams output operation
Print: Prints the first 10 batch elements in each dstream on driver nodes, commonly used for development and debugging
Saveastextfiles (prefix, [suffix]): Saves the current dstream as a file, and the file name naming rules for each interval batch are based on prefix and suffix: "prefix-time_in_ms[. Suffix] ".
Saveasobjectfiles (prefix, [suffix]): Saves the current dstream content as a serialized file of a Java serializable object, each interval The file naming rules for batch are based on prefix and suffix:: "Prefix-time_in_ms[.suffix]".
Saveashadoopfiles (prefix, [suffix]): Saves Dstream as a Hadoop file, and the file naming conventions for each interval batch are based on prefix and suffix:: " Prefix-time_in_ms[.suffix] ".
Foreachrdd (func): The most common output operation that can apply a function fun to every RDD generated from a data flow. Usually fun saves the data in each RDD to an external system, such as saving the Rdd to a file, or saving it to a database over a network connection. It is worth noting that fun is performed in the driver process of running the application, and usually contains the RDD action to facilitate the data flow RDD to start computing. The
Use Foreachrdd design mode
Dstream.foreachrdd provides great flexibility for development, but avoids many common pits when used. The process in which we typically save data to an external system is to establish a remote connection, and then close the connection by connecting the data to the remote system. For this process we are directly thinking of the following program code:
Dstream.foreachrdd {Rdd =
Val connection = createnewconnection ()//executed at the driver
Rdd.foreach {record =
Connection.send (record)//executed at the worker
}
}
In the previous article, Spark Pit-first interview, the worker and driver of spark were sorted out, and we know that in cluster mode, the connection in the above code needs to be sent from driver to the worker in the form of a serialized object, However, connection cannot be passed between machines, that is, connection cannot be serialized, which may cause errors in Cserialization errors (Connection object not serializable). To avoid this error, we will create the conenction in the worker, the code is as follows:
Dstream.foreachrdd {Rdd =
Rdd.foreach {record =
Val connection = Createnewconnection ()
Connection.send (Record)
Connection.close ()
}
}
Does that seem to solve the problem? But to think about it, we connection and shut down each of the RDD's records, which leads to unnecessary high loads and lowers the overall system throughput.
So a better way is to use rdd.foreachpartition to establish a unique connection for each RDD partition (note: Each partition is the inside of the RDD is running on the same worker), the code is as follows:
Dstream.foreachrdd {Rdd =
rdd.foreachpartition {partitionofrecords =
Val connection = Createnewconnection ()
Partitionofrecords.foreach (record = Connection.send (record))
Connection.close ()
}
}
This allows us to reduce the load of frequent connections, usually when we connect to the database using connection pooling, the concept of connection pooling is introduced, and the code is optimized as follows:
Dstream.foreachrdd {Rdd =
rdd.foreachpartition {partitionofrecords =
ConnectionPool is a static, lazily initialized pool of connections
Val connection = Connectionpool.getconnection ()
Partitionofrecords.foreach (record = Connection.send (record))
Connectionpool.returnconnection (connection)//Return to the pool for future reuse
}
}
By holding a static connection pool object, we can reuse the connection to further optimize the cost of connection setup, thus reducing the load. It is also worth noting that the connection pool with the database is similar, the connection pool we call here should also be lazy on-demand connection, and timely recovery of the time-out of the connection.
It is also noteworthy that:
If multiple foreachrdd are used in spark streaming, they are executed in the order of the program.
Dstream the execution strategy for the output operation is lazy, so if we do not add any RDD action in Foreachrdd, then the system simply receives the data and discards the data.
Spark accesses HBase
Above we describe the basic design pattern for outputting the dstream of spark streaming to an external system, where we describe how to output the Dstream to the HBase cluster.
HBase Universal Connection Class
Scala connects HBase to get information through zookeeper, so you need to provide information about zookeeper when you configure it, as follows:
Import Org.apache.hadoop.hbase.HBaseConfiguration
Import Org.apache.hadoop.hbase.client.Connection
Import org.apache.hadoop.hbase.HConstants
Import Org.apache.hadoop.hbase.client.ConnectionFactory
Object Hbaseutil extends Serializable {
Private Val conf = Hbaseconfiguration.create ()
Private Val para = conf.hbaseconfig//Conf for configuration class, get HBase configuration
Conf.set (Hconstants.zookeeper_client_port, Para.get ("PORT"). Getorelse ("2181"))
Conf.set (Hconstants.zookeeper_quorum, Para.get ("QUORUM"). Getorelse ("127-0-0-1"))//Hosts
Private Val connection = connectionfactory.createconnection (conf)
def gethbaseconn:connection = Connection
}
According to the information on the Internet, the particularity of the connection of hbase we do not use connection pooling
HBase output operation
We take the put operation as an example to demonstrate the application of the above design pattern to the HBase output operation:
Dstream.foreachrdd (Rdd = {
if (!rdd.isempty) {
Rdd.foreachpartition (partitionrecords = {
Val connection = hbaseutil.gethbaseconn//Get HBase Connection
Partitionrecords.foreach (data = {
Val tableName = tablename.valueof ("TableName")
Val t = connection.gettable (tableName)
try {
Val put = new put (bytes.tobytes (rowKey))//Row key
column, qualifier, value
Put.addcolumn (columngetBytes, qualifier. GetBytes, value. getBytes)
Try (T.put (Put)). Getorelse (T.close ())
Do some log (shown on worker)
} catch {
Case e:exception =
Log error
E.printstacktrace ()
} finally {
T.close ()
}
})
})
Do some log (shown on driver)
}
})
Other operations on HBase can be referred to under Spark Operation HBase (1.0.0 new API)
Pits Records
Highlights the issue of configuring Hconstants.zookeeper_quorum during connection to HBase:
Because HBase connections cannot be accessed directly using an IP address, it is often necessary to configure the hosts, such as I 127-0-0-1 (arbitrary) in the above code snippet, which we need to configure in the hosts
127-0-0-1 127.0.0.1
In the case of a single machine, we only need to configure the hosts of the HBase where zookeeper is located, but when switching to the HBase cluster, we encounter a strange bug
Problem Description: When you save Dstream to HBase in Foreachrdd, you get stuck and no error message pops up (yes!). It's stuck, it's not responding.
Problem analysis: Because the HBase cluster has multiple machines, and we only configure the hosts of an hbase machine, this causes the spark cluster to continue to look for when it accesses HBase, but is stuck there when it can't find it.
Workaround: The hosts on each worker are configured with all HBase node IPs, problem solving
Spark access to MySQL
Similar to accessing hbase, we also need to have a serializable class to establish the MySQL connection, where we take advantage of MySQL's c3p0 connection pool
MySQL Universal Connection Class
Import Java.sql.Connection
Import Java.util.Properties
Import Com.mchange.v2.c3p0.ComboPooledDataSource
Class Mysqlpool extends Serializable {
Private Val Cpds:combopooleddatasource = new Combopooleddatasource (TRUE)
Private Val conf = Conf.mysqlconfig
try {
Cpds.setjdbcurl (conf.get ("url"). Getorelse ("jdbc:mysql://127.0.0.1:3306/test_bee?useunicode=true& Characterencoding=utf-8 "));
Cpds.setdriverclass ("Com.mysql.jdbc.Driver");
Cpds.setuser (Conf.get ("username"). Getorelse ("root");
Cpds.setpassword (Conf.get ("password"). Getorelse (""))
Cpds.setmaxpoolsize (200)
Cpds.setminpoolsize (20)
Cpds.setacquireincrement (5)
Cpds.setmaxstatements (180)
} catch {
Case e:exception = E.printstacktrace ()
}
def getconnection:connection = {
try {
return Cpds.getconnection ();
} catch {
Case ex:exception =
Ex.printstacktrace ()
Null
}
}
}
Object Mysqlmanager {
var Mysqlmanager:mysqlpool = _
def getmysqlmanager:mysqlpool = {
Synchronized {
if (Mysqlmanager = = null) {
Mysqlmanager = new Mysqlpool
}
}
Mysqlmanager
}
}
We use C3P0 to establish a MySQL connection pool and then access it each time we remove the connection from the connection pool for data transfer.
MySQL output operation
Also using the previous Foreachrdd design pattern, the code to output the Dstream to MySQL is as follows:
Dstream.foreachrdd (Rdd = {
if (!rdd.isempty) {
Rdd.foreachpartition (partitionrecords = {
Get a connection from the connection pool
Val conn = MysqlManager.getMysqlManager.getConnection
Val statement = Conn.createstatement
try {
Conn.setautocommit (False)
Partitionrecords.foreach (record = {
Val sql = "INSERT INTO Table ..."//SQL action to be performed
Statement.addbatch (SQL)
})
Statement.executebatch
Conn.commit
} catch {
Case e:exception =
Do some log
} finally {
Statement.close ()
Conn.close ()
}
})
}
})
It is worth noting that:
When we submitted the MySQL operation, not every record was submitted once, but in the form of batch submission, so we need to Conn.setautocommit (false), so as to further improve the efficiency of MySQL.
If we update MySQL with indexed fields, it will cause the update is slow, this situation should try to avoid, if unavoidable, then hard bar (t^t)
Deployment
Provide a MAVEN configuration of the jar packages that spark connects to MySQL and HBase:
Org.apache.hbase
Hbase-client
1.0.0
Org.apache.hbase
Hbase-common
1.0.0
Org.apache.hbase
Hbase-server
1.0.0
Mysql
Mysql-connector-java
5.1.31
C3p0
C3p0
0.9.1.2
Reference documents:
Go Spark Stomp: Database (Hbase+mysql)