Spark Learning five: Spark SQL

Source: Internet
Author: User

Spark Learning five: Spark SQL

tags (space delimited): Spark

    • Spark learns five spark SQL
      • An overview
      • Development history of the two spark
      • Three spark SQL and hive comparison
      • Quad Spark SQL schema
      • Five SPRK SQL access to hive data
      • Six catalyst
      • Seven Thriftserver
      • Eight Dataframe
      • Nine load external data sources
      • Spark SQL Power was born

One, overview:

Second, the history of Spark's development

Third, Spark SQL vs. hive comparison

IV, SPARK SQL schema

Five, SPRK SQL Access hive data

hive-site.xml需要拷贝到spark的conf目录下面

Starting mode one:

//launch app  Bin/spark-shell  -- driver-class  -path  jars/mysql-connector  -java  - 5.1  .27  -bin   Jar -- master local  [ 2  ]     
sqlContext.sql("show databases").show()

sqlContext.sql("use default").show()sqlContext.sql("show tables").show()

Starting mode two:

//启动应用bin/spark-sql--driver-class-path jars/mysql-connector-java-5.1.27-bin.--local[2]
show databases;

//缓存cache table emp;//取消缓存uncache table emp;

VI, Catalyst

Seven, Thriftserver

Start the service

Sbin/start-Thriftserver.SH --Master Local[2] --Driver-class-Path Jars/mysql-connector-Java-5.1. --bin.Jar

Start the Beeline client

bin/beelinebeeline> !connect jdbc:hive2://localhost:10000

BA, Dataframe

Nine. Loading external data sources

1. Loading JSON data

val json_df=sqlContext.jsonFile("hdfs://study.com.cn:8020/spark/people.json")json_df.show()

2. Load Hive Data

sqlContext.table("default").show()

3, loading Parquet format data

val parquet_df=sqlContext.jsonFile("hdfs://study.com.cn:8020/spark/users.parquet")parquet_df.show()

4,jdbc Way to get data

= sqlContext.jdbc("jdbc:mysql://localhost:3306/db_0306?user=root&password=123456""my_user"= sqlContext.load("jdbc"Map("url"-> "jdbc:mysql://localhost:3306/db_0306?user=root&password=123456","dbtable"-> "my_user"))

5, read the text file
The first way:

case class Person(name:String,age:Int)val people_rdd = sc.textFile("spark/sql/people.txt")val rowRdd = people_rdd.map(x => x.split(",")).map(x => Person(x(0), x(1).trim.toInt))val people_df=rowRdd.toDF()

The second way:

Val People_rdd = SC. Textfile("Spark/sql/people.txt") Import org. Apache. Spark. SQL. _val Rowrdd = People_rdd. Map(x=x. Split(",")). Map(x= = Row (x(0),x(1). Trim. ToInt)) Import org. Apache. Spark. SQL. Types. _val schema = Structtype (Array (Structfield ("Name", StringType, True), Structfield ("Age", Integertype, False))) Val rdd2df = SqlContext. Createdataframe(Rowrdd, Schema)

Test:

Spark SQL Power was born,

Hive Table
Emp
MySQL Table
Dept

Join for the above two tables,

val hive_emp_df = sqlContext.table("db_0228.emp")val mysql_dept_df = sqlContext.jdbc("jdbc:mysql://localhost:3306/db_0306?user=root&password=123456""tb_dept")val join_df = hive_emp_df.join(mysql_dept_df, hive_emp_df("deptno") === mysql_dept_df("deptno"))join_df.show

Case analysis

Sqlloganalyzer.scala

 PackageCom.ibeifeng.bigdata.spark.appImportOrg.apache.spark.sql.SQLContextImportOrg.apache.spark. {sparkconf, Sparkcontext}/** * Created by Xuanyu on 2016/4/17. * * Object sqlloganalyzer {  defMain (args:array[string]) {//Create sparkconf instance    Valsparkconf =NewSparkconf (). Setappname ("Sqlloganalyzer"). Setmaster ("local[2]")//Create Sparkcontext instance    Valsc =NewSparkcontext (sparkconf)//Create SqlContext instance    ValSqlContext =NewSqlContext (SC)ImportSqlcontext.implicits._// ==============================================================    //Input Files    ValLogFile ="Hdfs://bigdata-senior01.ibeifeng.com:8020/user/beifeng/apache.access.log" //    //create Rdd    ValACCESSLOGS_DF = Sc.textfile (logFile)/** * Filter log datas */. Filter (Apacheaccesslog.isvalidatelogline)/** * Parse log * /. map (log = apacheaccesslog.parselogline (log)). TODF () accesslogs_df.registertemptable ("Accesslogs")//CacheAccesslogs_df.cache ()// =======================================================================================    //Compute    ValAvgcontentsize = Sqlcontext.sql ("SELECT AVG (contentsize) from Accesslogs"). First (). Get (0)ValMincontentsize = Sqlcontext.sql ("Select min (contentsize) from Accesslogs"). First (). Get (0)ValMaxcontentsize = Sqlcontext.sql ("select Max (contentsize) from Accesslogs"). First (). Get (0)//printlnprintln"Content Size Avg:%s, Min:%s, Max:%s". Format (avgcontentsize, Mincontentsize, maxcontentsize))//Accesslogs_df.unpersist ()ValAVG_DF = Accesslogs_df.agg ("Contentsize"-"AVG")ValMIN_DF = Accesslogs_df.agg ("Contentsize"-"Min")ValMAX_DF = Accesslogs_df.agg ("Contentsize"-"Max")//printlnprintln"= = = Content Size Avg:%s, Min:%s, Max:%s". Format (Avg_df.first (). Get (0), Min_df.first (). Get (0), Max_df.first (). Get (0)    ))// ==============================================================    //Stop SparkcontextSc.stop ()}}

Spark Learning five: Spark SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.