Spark Learning five: Spark SQL
tags (space delimited): Spark
- Spark learns five spark SQL
- An overview
- Development history of the two spark
- Three spark SQL and hive comparison
- Quad Spark SQL schema
- Five SPRK SQL access to hive data
- Six catalyst
- Seven Thriftserver
- Eight Dataframe
- Nine load external data sources
- Spark SQL Power was born
One, overview:
Second, the history of Spark's development
Third, Spark SQL vs. hive comparison
IV, SPARK SQL schema
Five, SPRK SQL Access hive data
hive-site.xml需要拷贝到spark的conf目录下面
Starting mode one:
//launch app Bin/spark-shell -- driver-class -path jars/mysql-connector -java - 5.1 .27 -bin Jar -- master local [ 2 ]
sqlContext.sql("show databases").show()
sqlContext.sql("use default").show()sqlContext.sql("show tables").show()
Starting mode two:
//启动应用bin/spark-sql--driver-class-path jars/mysql-connector-java-5.1.27-bin.--local[2]
show databases;
//缓存cache table emp;//取消缓存uncache table emp;
VI, Catalyst
Seven, Thriftserver
Start the service
Sbin/start-Thriftserver.SH --Master Local[2] --Driver-class-Path Jars/mysql-connector-Java-5.1. --bin.Jar
Start the Beeline client
bin/beelinebeeline> !connect jdbc:hive2://localhost:10000
BA, Dataframe
Nine. Loading external data sources
1. Loading JSON data
val json_df=sqlContext.jsonFile("hdfs://study.com.cn:8020/spark/people.json")json_df.show()
2. Load Hive Data
sqlContext.table("default").show()
3, loading Parquet format data
val parquet_df=sqlContext.jsonFile("hdfs://study.com.cn:8020/spark/users.parquet")parquet_df.show()
4,jdbc Way to get data
= sqlContext.jdbc("jdbc:mysql://localhost:3306/db_0306?user=root&password=123456""my_user"= sqlContext.load("jdbc"Map("url"-> "jdbc:mysql://localhost:3306/db_0306?user=root&password=123456","dbtable"-> "my_user"))
5, read the text file
The first way:
case class Person(name:String,age:Int)val people_rdd = sc.textFile("spark/sql/people.txt")val rowRdd = people_rdd.map(x => x.split(",")).map(x => Person(x(0), x(1).trim.toInt))val people_df=rowRdd.toDF()
The second way:
Val People_rdd = SC. Textfile("Spark/sql/people.txt") Import org. Apache. Spark. SQL. _val Rowrdd = People_rdd. Map(x=x. Split(",")). Map(x= = Row (x(0),x(1). Trim. ToInt)) Import org. Apache. Spark. SQL. Types. _val schema = Structtype (Array (Structfield ("Name", StringType, True), Structfield ("Age", Integertype, False))) Val rdd2df = SqlContext. Createdataframe(Rowrdd, Schema)
Test:
Spark SQL Power was born,
Hive Table
Emp
MySQL Table
Dept
Join for the above two tables,
val hive_emp_df = sqlContext.table("db_0228.emp")val mysql_dept_df = sqlContext.jdbc("jdbc:mysql://localhost:3306/db_0306?user=root&password=123456""tb_dept")val join_df = hive_emp_df.join(mysql_dept_df, hive_emp_df("deptno") === mysql_dept_df("deptno"))join_df.show
Case analysis
Sqlloganalyzer.scala
PackageCom.ibeifeng.bigdata.spark.appImportOrg.apache.spark.sql.SQLContextImportOrg.apache.spark. {sparkconf, Sparkcontext}/** * Created by Xuanyu on 2016/4/17. * * Object sqlloganalyzer { defMain (args:array[string]) {//Create sparkconf instance Valsparkconf =NewSparkconf (). Setappname ("Sqlloganalyzer"). Setmaster ("local[2]")//Create Sparkcontext instance Valsc =NewSparkcontext (sparkconf)//Create SqlContext instance ValSqlContext =NewSqlContext (SC)ImportSqlcontext.implicits._// ============================================================== //Input Files ValLogFile ="Hdfs://bigdata-senior01.ibeifeng.com:8020/user/beifeng/apache.access.log" // //create Rdd ValACCESSLOGS_DF = Sc.textfile (logFile)/** * Filter log datas */. Filter (Apacheaccesslog.isvalidatelogline)/** * Parse log * /. map (log = apacheaccesslog.parselogline (log)). TODF () accesslogs_df.registertemptable ("Accesslogs")//CacheAccesslogs_df.cache ()// ======================================================================================= //Compute ValAvgcontentsize = Sqlcontext.sql ("SELECT AVG (contentsize) from Accesslogs"). First (). Get (0)ValMincontentsize = Sqlcontext.sql ("Select min (contentsize) from Accesslogs"). First (). Get (0)ValMaxcontentsize = Sqlcontext.sql ("select Max (contentsize) from Accesslogs"). First (). Get (0)//printlnprintln"Content Size Avg:%s, Min:%s, Max:%s". Format (avgcontentsize, Mincontentsize, maxcontentsize))//Accesslogs_df.unpersist ()ValAVG_DF = Accesslogs_df.agg ("Contentsize"-"AVG")ValMIN_DF = Accesslogs_df.agg ("Contentsize"-"Min")ValMAX_DF = Accesslogs_df.agg ("Contentsize"-"Max")//printlnprintln"= = = Content Size Avg:%s, Min:%s, Max:%s". Format (Avg_df.first (). Get (0), Min_df.first (). Get (0), Max_df.first (). Get (0) ))// ============================================================== //Stop SparkcontextSc.stop ()}}
Spark Learning five: Spark SQL