Spark (17) Sparksql Simple to use

Source: Internet
Author: User

The evolutionary path of sparksql

Before 1.0:

Shark

1.1.x Start:

Sparksql (Test-only) SQL

1.3.x:

Sparksql (official version) +dataframe

1.5.x:

Sparksql Tungsten Filament Project

1.6.x:

Sparksql+dataframe+dataset (Beta version)

    1. X:

Sparksql+dataframe+dataset (official version)

Sparksql: There are other optimizations.

Structuredstreaming (DataSet)

Second, know SparkSQL2.1 what is Sparksql?

Spark SQL is a module of spark that is used primarily for the processing of structured data. The most core programming abstraction it provides is dataframe.

The role of 2.2 sparksql

Provides a programming abstraction (DataFrame) and acts as a distributed SQL query engine

DataFrame: It can be built on many sources, including: structured data files, tables in hive, external relational databases, and Rdd

2.3 Operating principle

Convert Spark SQL to RDD and commit to the cluster execution

2.4 Features

(1) Easy integration

(2) Unified approach to data access

(3) compatible with Hive

(4) Standard data connection

2.5 sparksession

Sparksession is the new concept of Spark 2.0. Sparksession provides users with a unified entry point for users to learn about the various functions of spark.
In earlier versions of Spark, Sparkcontext was the main entry point for Spark, and since Rdd was the main API, we created and manipulated the RDD through Sparkcontext. For each of the other APIs, we need to use a different context. For example, for streming, we need to use StreamingContext; for SQL, use SqlContext; for hive, use Hivecontext. But as the datasets and Dataframe APIs become standard APIs, they need to be built into an access point. So in spark2.0, the introduction of Sparksession as a pointcut for the dataset and Dataframe API Sparksession encapsulates sparkconf, Sparkcontext, and SqlContext. For backwards compatibility, SqlContext and Hivecontext are also preserved.

Sparksession is essentially a combination of sqlcontext and hivecontext (and may add StreamingContext in the future), So the APIs available on SqlContext and Hivecontext can also be used on sparksession. The sparksession internally encapsulates the Sparkcontext, so the calculation is actually done by Sparkcontext.

Characteristics:

---- provide users with a unified pointcut using Spark features

        ----  allows users to write programs by invoking the DataFrame and Dataset-related APIs

        ----  reduces the number of concepts users need to understand and can easily interact with Spark

        ----  The creation of sparkconf, Sparkcontext, and SqlContext, which are closed in sparksession, are not required to be displayed when interacting with Spark

2.7 dataframes

In Spark, Dataframe is a distributed data set based on an RDD, similar to a two-dimensional table in a traditional database. The main difference between Dataframe and Rdd is that the former has the schema meta information, that is, each column of the two-dimensional table dataset represented by Dataframe has a name and type. This allows spark SQL to gain insight into more structural information, optimizing the data sources hidden behind Dataframe and the transformations that work on dataframe, ultimately achieving the goal of significantly improving runtime efficiency. In view of the RDD, the Spark core can only perform simple, general-purpose pipeline optimizations at the stage level, since it is not possible to know the specific internal structure of the data elements that are stored.

Third, the Rdd converted into dataframe

How to use the spark1.x version

Test Data Catalog: Spark/examples/src/main/resources (inside Spark's installation directory)

People.txt

3.1 Way One: Create Dataframes (reflection) from case class
Defines the case class, which is equivalent to the table structure, class people (Var name:string,var age:int) Object TestDataFrame1 {  def main (args:array[ String]): Unit = {    val conf = new sparkconf (). Setappname ("Rddtodataframe"). Setmaster ("local")    val sc = new SPARKC Ontext (conf)    val context = new SqlContext (SC)    //Read the local data into the RDD and associate the RDD with the case class    val Peoplerdd = SC.TEXTF Ile ("E:\\666\\people.txt")      . Map (line + people (Line.split (",") (0), Line.split (",") (1). Trim.toint))    Import context.implicits._    //convert Rdd to Dataframes    val df = peoplerdd.todf    //Create Dataframes as a temporary view    Df.createorreplacetempview ("People")    //Use SQL statement to query    context.sql ("SELECT * from people"). Show ()  }}

Run results

3.2 Way Two: creating Dataframes with Structtype (programming interface)
Object TestDataFrame2 {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setappname ("Testdatafr Ame2 "). Setmaster (" local ")    val sc = new Sparkcontext (conf)    val sqlcontext = new SqlContext (SC)    val Filerdd = s C.textfile ("E:\\666\\people.txt")    //Map RDD data to Row, import Org.apache.spark.sql.Row    val Rowrdd:rdd[row] = Filerdd.map (line +      = val fields = Line.split (",")      Row (Fields (0), fields (1). Trim.toint)    })    //Create Structtype to define the structure    val structtype:structtype = Structtype (      //Field name, field type, whether it can be empty      Structfield ("name", StringType, True)::      Structfield ("Age", Integertype, True): Nil    )    /**      * rows:java.util.list[row ],      * Schema:structtype      * * *    /val df:dataframe = Sqlcontext.createdataframe (rowrdd,structtype)    Df.createorreplacetempview ("People")    sqlcontext.sql ("SELECT * from people"). Show ()  }}

Run results

3.3 Way Three: Create a dataframes from a JSON file
Object TestDataFrame3 {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setappname ("Testdatafr Ame2 "). Setmaster (" local ")    val sc = new Sparkcontext (conf)    val sqlcontext = new SqlContext (SC)    val DF: DataFrame = SqlContext.read.json ("E:\\666\\people.json")    Df.createorreplacetempview ("people")    Sqlcontext.sql ("Select * from people"). Show ()  }}

Iv. Read and save of Dataframe and savemode4.1 data
Object Testread {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setappname ("TestDataFrame2") . Setmaster ("local")    val sc = new Sparkcontext (conf)    val sqlcontext = new SqlContext (SC)    //Way One    val df1 = s QlContext.read.json ("E:\\666\\people.json")    val df2 = SqlContext.read.parquet ("E:\\666\\users.parquet")    / /Way Two    val df3 = SqlContext.read.format ("JSON"). Load ("E:\\666\\people.json")    val df4 = SqlContext.read.format ( "Parquet"). Load ("E:\\666\\users.parquet")    //mode three, default is parquet format    val df5 = sqlcontext.load ("e:\\666\\ Users.parquet ")  }}
4.2 Saving of data
Object Testsave {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setappname ("TestDataFrame2") . Setmaster ("local")    val sc = new Sparkcontext (conf)    val sqlcontext = new SqlContext (SC)    val df1 = SqlContext. Read.json ("E:\\666\\people.json")    //mode one    Df1.write.json ("e:\\111")    Df1.write.parquet ("e:\\222")    //Mode two    Df1.write.format ("JSON"). Save ("e:\\333")    Df1.write.format ("parquet"). Save ("e:\\444"    ) Way three    Df1.write.save ("e:\\555")  }}
4.3 Saving mode of data

Using mode

Df1.write.format ("parquet"). Mode (Savemode.ignore). Save ("e:\\444")

Five, data source 5.1 data source only JSON

Reference 4.1

5.2 Parquet of data sources

Reference 4.1

5.3 Data source MySQL
Object Testmysql {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setappname ("Testmysql"). Set Master ("local")    val sc = new Sparkcontext (conf)    val sqlcontext = new SqlContext (SC)    val url = "Jdbc:mysql://1 92.168.123.102:3306/hivedb "    val table =" DBS "    val Properties = new properties ()    Properties.setproperty (" User "," root ")    properties.setproperty (" password "," root ")    //URL to be passed in MySQL, indicate, properties (username password to connect to database)    val df = sqlContext.read.jdbc (url,table,properties)    Df.createorreplacetempview ("DBS")    Sqlcontext.sql ("SELECT * from DBS"). Show ()  }}

Run results

5.4 Data source hive (1) Preparation

Adding dependencies in the Pom.xml file

<!--https://mvnrepository.com/artifact/org.apache.spark/spark-hive--        <dependency>            < groupid>org.apache.spark</groupid>            <artifactId>spark-hive_2.11</artifactId>            < Version>2.3.0</version>        </dependency>

The development environment will add hive-site.xml files under the Resource folder, and the cluster environment will send hive configuration files to the $spark_home/conf directory.

<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hivedb?createDatabaseIfNotExist=true</value> <d ESCRIPTION&GT;JDBC connect string for a JDBC metastore</description> <!--if MySQL and hive are in the same server section Point, then please change hadoop02 to localhost-</property> <property> <name>javax.jdo. Option. Connectiondrivername</name> <value>com.mysql.jdbc.Driver</value> &LT;DESCR                Iption>driver class name for a JDBC metastore</description> </property> <property>                <name>javax.jdo.option.ConnectionUserName</name> <value>root</value>        <description>username to use against Metastore database</description> </property> <property&Gt <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> <des                Cription>password to use against Metastore database</description> </property> <property> <name>hive.metastore.warehouse.dir</name> &LT;VALUE&GT;/HIVE/WAREHOUSE&LT;/VALUE&G                T <description>hive default Warehouse, if nessecory, change it</description> </property> </conf Iguration>
(2) test code
Object Testhive {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setmaster ("local"). Setappnam E (this.getClass.getSimpleName)    val sc = new Sparkcontext (conf)    val sqlcontext = new Hivecontext (SC)    Sqlcontext.sql ("SELECT * from Myhive.student"). Show ()  }}

Run results

Spark (17) Sparksql Simple to use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.