Scenario: Use spark streaming to receive real-time data and query operations related to tables in the relational database;
Using technology: Spark streaming + spark JDBC External datasources
Code prototype:
Packagecom.luogankun.spark.streamingImportorg.apache.spark.SparkConfImportorg.apache.spark.streaming. {Seconds, StreamingContext}ImportOrg.apache.spark.sql.hive.HiveContextImportOrg.apache.spark.SparkContext Case classStudent (Id:int, name:string, Cityid:int) object hdfsstreaming {def main (args:array[string]) {Val location = args (0)//hdfs file path val sparkconf=NewSparkconf (). Setappname ("HDFS JDBC streaming") val sc =NewSparkcontext (sparkconf) Val SSC=NewStreamingContext (SC, Seconds (5)) Val SqlContext=NewHivecontext (SC)ImportSqlcontext.createschemarddImportcom.luogankun.spark.jdbc._
Use the external data sources to work with database Val cities in MySQL= Sqlcontext.jdbctable ("Jdbc:mysql://hadoop000:3306/test", "root", "root", "SELECT ID, name from city")
Register the Cities Rdd as a City temp table cities.registertemptable ("City") Val Inputs=Ssc.textfilestream (location) Inputs.foreachrdd (Rdd= { if(Rdd.partitions.length > 0) {
The data received in streaming is registered as Student temporary table Rdd.map (_.split ("\ T")). Map (x = Student (x (0). ToInt, X (1), X (2). ToInt)). Registertemptable ("Student");
Correlate streaming and MySQL tables for query operations Sqlcontext.sql ("Select S.id, S.name, S.cityid, c.name from student S joins city_table C on S.cityid=c.id"). Collect (). foreach (println)}) Ssc.start () Ssc.awaittermination ()}}
Submit to the Spark cluster processing script:
spark-------master Spark://hadoop000:7077 \--executor- 1/home/spark/lib/streaming.jar hdfs://Hadoop000:8020/data/hdfs
Spark streaming combined with spark JDBC External datasouces processing case