The evolutionary path of sparksql
Before 1.0:
Shark
1.1.x Start:
Sparksql (Test-only) SQL
1.3.x:
Sparksql (official version) +dataframe
1.5.x:
Sparksql Tungsten Filament Project
1.6.x:
Sparksql+dataframe+dataset (Beta version)
- X:
Sparksql+dataframe+dataset (official version)
Sparksql: There are other optimizations.
Structuredstreaming (DataSet)
Second, know SparkSQL2.1 what is Sparksql?
Spark SQL is a module of spark that is used primarily for the processing of structured data. The most core programming abstraction it provides is dataframe.
The role of 2.2 sparksql
Provides a programming abstraction (DataFrame) and acts as a distributed SQL query engine
DataFrame: It can be built on many sources, including: structured data files, tables in hive, external relational databases, and Rdd
2.3 Operating principle
Convert Spark SQL to RDD and commit to the cluster execution
2.4 Features
(1) Easy integration
(2) Unified approach to data access
(3) compatible with Hive
(4) Standard data connection
2.5 sparksession
Sparksession is the new concept of Spark 2.0. Sparksession provides users with a unified entry point for users to learn about the various functions of spark.
In earlier versions of Spark, Sparkcontext was the main entry point for Spark, and since Rdd was the main API, we created and manipulated the RDD through Sparkcontext. For each of the other APIs, we need to use a different context. For example, for streming, we need to use StreamingContext; for SQL, use SqlContext; for hive, use Hivecontext. But as the datasets and Dataframe APIs become standard APIs, they need to be built into an access point. So in spark2.0, the introduction of Sparksession as a pointcut for the dataset and Dataframe API Sparksession encapsulates sparkconf, Sparkcontext, and SqlContext. For backwards compatibility, SqlContext and Hivecontext are also preserved.
Sparksession is essentially a combination of sqlcontext and hivecontext (and may add StreamingContext in the future), So the APIs available on SqlContext and Hivecontext can also be used on sparksession. The sparksession internally encapsulates the Sparkcontext, so the calculation is actually done by Sparkcontext.
Characteristics:
---- provide users with a unified pointcut using Spark features
---- allows users to write programs by invoking the DataFrame and Dataset-related APIs
---- reduces the number of concepts users need to understand and can easily interact with Spark
---- The creation of sparkconf, Sparkcontext, and SqlContext, which are closed in sparksession, are not required to be displayed when interacting with Spark
2.7 dataframes
In Spark, Dataframe is a distributed data set based on an RDD, similar to a two-dimensional table in a traditional database. The main difference between Dataframe and Rdd is that the former has the schema meta information, that is, each column of the two-dimensional table dataset represented by Dataframe has a name and type. This allows spark SQL to gain insight into more structural information, optimizing the data sources hidden behind Dataframe and the transformations that work on dataframe, ultimately achieving the goal of significantly improving runtime efficiency. In view of the RDD, the Spark core can only perform simple, general-purpose pipeline optimizations at the stage level, since it is not possible to know the specific internal structure of the data elements that are stored.
Third, the Rdd converted into dataframe
How to use the spark1.x version
Test Data Catalog: Spark/examples/src/main/resources (inside Spark's installation directory)
People.txt
3.1 Way One:
Create Dataframes (reflection) from case class
Defines the case class, which is equivalent to the table structure, class people (Var name:string,var age:int) Object TestDataFrame1 { def main (args:array[ String]): Unit = { val conf = new sparkconf (). Setappname ("Rddtodataframe"). Setmaster ("local") val sc = new SPARKC Ontext (conf) val context = new SqlContext (SC) //Read the local data into the RDD and associate the RDD with the case class val Peoplerdd = SC.TEXTF Ile ("E:\\666\\people.txt") . Map (line + people (Line.split (",") (0), Line.split (",") (1). Trim.toint)) Import context.implicits._ //convert Rdd to Dataframes val df = peoplerdd.todf //Create Dataframes as a temporary view Df.createorreplacetempview ("People") //Use SQL statement to query context.sql ("SELECT * from people"). Show () }}
Run results
3.2 Way Two:
creating Dataframes with Structtype (programming interface)
Object TestDataFrame2 { def main (args:array[string]): Unit = { val conf = new sparkconf (). Setappname ("Testdatafr Ame2 "). Setmaster (" local ") val sc = new Sparkcontext (conf) val sqlcontext = new SqlContext (SC) val Filerdd = s C.textfile ("E:\\666\\people.txt") //Map RDD data to Row, import Org.apache.spark.sql.Row val Rowrdd:rdd[row] = Filerdd.map (line + = val fields = Line.split (",") Row (Fields (0), fields (1). Trim.toint) }) //Create Structtype to define the structure val structtype:structtype = Structtype ( //Field name, field type, whether it can be empty Structfield ("name", StringType, True):: Structfield ("Age", Integertype, True): Nil ) /** * rows:java.util.list[row ], * Schema:structtype * * * /val df:dataframe = Sqlcontext.createdataframe (rowrdd,structtype) Df.createorreplacetempview ("People") sqlcontext.sql ("SELECT * from people"). Show () }}
Run results
3.3 Way Three:
Create a dataframes from a JSON file
Object TestDataFrame3 { def main (args:array[string]): Unit = { val conf = new sparkconf (). Setappname ("Testdatafr Ame2 "). Setmaster (" local ") val sc = new Sparkcontext (conf) val sqlcontext = new SqlContext (SC) val DF: DataFrame = SqlContext.read.json ("E:\\666\\people.json") Df.createorreplacetempview ("people") Sqlcontext.sql ("Select * from people"). Show () }}
Iv. Read and save of Dataframe and savemode4.1 data
Object Testread { def main (args:array[string]): Unit = { val conf = new sparkconf (). Setappname ("TestDataFrame2") . Setmaster ("local") val sc = new Sparkcontext (conf) val sqlcontext = new SqlContext (SC) //Way One val df1 = s QlContext.read.json ("E:\\666\\people.json") val df2 = SqlContext.read.parquet ("E:\\666\\users.parquet") / /Way Two val df3 = SqlContext.read.format ("JSON"). Load ("E:\\666\\people.json") val df4 = SqlContext.read.format ( "Parquet"). Load ("E:\\666\\users.parquet") //mode three, default is parquet format val df5 = sqlcontext.load ("e:\\666\\ Users.parquet ") }}
4.2 Saving of data
Object Testsave { def main (args:array[string]): Unit = { val conf = new sparkconf (). Setappname ("TestDataFrame2") . Setmaster ("local") val sc = new Sparkcontext (conf) val sqlcontext = new SqlContext (SC) val df1 = SqlContext. Read.json ("E:\\666\\people.json") //mode one Df1.write.json ("e:\\111") Df1.write.parquet ("e:\\222") //Mode two Df1.write.format ("JSON"). Save ("e:\\333") Df1.write.format ("parquet"). Save ("e:\\444" ) Way three Df1.write.save ("e:\\555") }}
4.3 Saving mode of data
Using mode
Df1.write.format ("parquet"). Mode (Savemode.ignore). Save ("e:\\444")
Five, data source 5.1 data source only JSON
Reference 4.1
5.2 Parquet of data sources
Reference 4.1
5.3 Data source MySQL
Object Testmysql { def main (args:array[string]): Unit = { val conf = new sparkconf (). Setappname ("Testmysql"). Set Master ("local") val sc = new Sparkcontext (conf) val sqlcontext = new SqlContext (SC) val url = "Jdbc:mysql://1 92.168.123.102:3306/hivedb " val table =" DBS " val Properties = new properties () Properties.setproperty (" User "," root ") properties.setproperty (" password "," root ") //URL to be passed in MySQL, indicate, properties (username password to connect to database) val df = sqlContext.read.jdbc (url,table,properties) Df.createorreplacetempview ("DBS") Sqlcontext.sql ("SELECT * from DBS"). Show () }}
Run results
5.4 Data source hive (1) Preparation
Adding dependencies in the Pom.xml file
<!--https://mvnrepository.com/artifact/org.apache.spark/spark-hive-- <dependency> < groupid>org.apache.spark</groupid> <artifactId>spark-hive_2.11</artifactId> < Version>2.3.0</version> </dependency>
The development environment will add hive-site.xml files under the Resource folder, and the cluster environment will send hive configuration files to the $spark_home/conf directory.
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hivedb?createDatabaseIfNotExist=true</value> <d ESCRIPTION>JDBC connect string for a JDBC metastore</description> <!--if MySQL and hive are in the same server section Point, then please change hadoop02 to localhost-</property> <property> <name>javax.jdo. Option. Connectiondrivername</name> <value>com.mysql.jdbc.Driver</value> <DESCR Iption>driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>username to use against Metastore database</description> </property> <property&Gt <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> <des Cription>password to use against Metastore database</description> </property> <property> <name>hive.metastore.warehouse.dir</name> <VALUE>/HIVE/WAREHOUSE</VALUE&G T <description>hive default Warehouse, if nessecory, change it</description> </property> </conf Iguration>
(2) test code
Object Testhive { def main (args:array[string]): Unit = { val conf = new sparkconf (). Setmaster ("local"). Setappnam E (this.getClass.getSimpleName) val sc = new Sparkcontext (conf) val sqlcontext = new Hivecontext (SC) Sqlcontext.sql ("SELECT * from Myhive.student"). Show () }}
Run results
Spark (17) Sparksql Simple to use