International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Spark (17) Sparksql Simple to use

Last Update:2018-07-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The evolutionary path of sparksql

Before 1.0:

Shark

1.1.x Start:

Sparksql (Test-only) SQL

1.3.x:

Sparksql (official version) +dataframe

1.5.x:

Sparksql Tungsten Filament Project

1.6.x:

Sparksql+dataframe+dataset (Beta version)

Sparksql+dataframe+dataset (official version)

Sparksql: There are other optimizations.

Structuredstreaming (DataSet)

Second, know SparkSQL2.1 what is Sparksql?

Spark SQL is a module of spark that is used primarily for the processing of structured data. The most core programming abstraction it provides is dataframe.

The role of 2.2 sparksql

Provides a programming abstraction (DataFrame) and acts as a distributed SQL query engine

DataFrame: It can be built on many sources, including: structured data files, tables in hive, external relational databases, and Rdd

2.3 Operating principle

Convert Spark SQL to RDD and commit to the cluster execution

2.4 Features

(1) Easy integration

(2) Unified approach to data access

(3) compatible with Hive

(4) Standard data connection

2.5 sparksession

Sparksession is the new concept of Spark 2.0. Sparksession provides users with a unified entry point for users to learn about the various functions of spark.
In earlier versions of Spark, Sparkcontext was the main entry point for Spark, and since Rdd was the main API, we created and manipulated the RDD through Sparkcontext. For each of the other APIs, we need to use a different context. For example, for streming, we need to use StreamingContext; for SQL, use SqlContext; for hive, use Hivecontext. But as the datasets and Dataframe APIs become standard APIs, they need to be built into an access point. So in spark2.0, the introduction of Sparksession as a pointcut for the dataset and Dataframe API Sparksession encapsulates sparkconf, Sparkcontext, and SqlContext. For backwards compatibility, SqlContext and Hivecontext are also preserved.

Sparksession is essentially a combination of sqlcontext and hivecontext (and may add StreamingContext in the future), So the APIs available on SqlContext and Hivecontext can also be used on sparksession. The sparksession internally encapsulates the Sparkcontext, so the calculation is actually done by Sparkcontext.

Characteristics:

---- provide users with a unified pointcut using Spark features

---- allows users to write programs by invoking the DataFrame and Dataset-related APIs

---- reduces the number of concepts users need to understand and can easily interact with Spark

---- The creation of sparkconf, Sparkcontext, and SqlContext, which are closed in sparksession, are not required to be displayed when interacting with Spark

2.7 dataframes

In Spark, Dataframe is a distributed data set based on an RDD, similar to a two-dimensional table in a traditional database. The main difference between Dataframe and Rdd is that the former has the schema meta information, that is, each column of the two-dimensional table dataset represented by Dataframe has a name and type. This allows spark SQL to gain insight into more structural information, optimizing the data sources hidden behind Dataframe and the transformations that work on dataframe, ultimately achieving the goal of significantly improving runtime efficiency. In view of the RDD, the Spark core can only perform simple, general-purpose pipeline optimizations at the stage level, since it is not possible to know the specific internal structure of the data elements that are stored.

Third, the Rdd converted into dataframe

How to use the spark1.x version

Test Data Catalog: Spark/examples/src/main/resources (inside Spark's installation directory)

People.txt

3.1 Way One: Create Dataframes (reflection) from case class

Defines the case class, which is equivalent to the table structure, class people (Var name:string,var age:int) Object TestDataFrame1 {  def main (args:array[ String]): Unit = {    val conf = new sparkconf (). Setappname ("Rddtodataframe"). Setmaster ("local")    val sc = new SPARKC Ontext (conf)    val context = new SqlContext (SC)    //Read the local data into the RDD and associate the RDD with the case class    val Peoplerdd = SC.TEXTF Ile ("E:\\666\\people.txt")      . Map (line + people (Line.split (",") (0), Line.split (",") (1). Trim.toint))    Import context.implicits._    //convert Rdd to Dataframes    val df = peoplerdd.todf    //Create Dataframes as a temporary view    Df.createorreplacetempview ("People")    //Use SQL statement to query    context.sql ("SELECT * from people"). Show ()  }}

Run results

3.2 Way Two: creating Dataframes with Structtype (programming interface)

Object TestDataFrame2 {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setappname ("Testdatafr Ame2 "). Setmaster (" local ")    val sc = new Sparkcontext (conf)    val sqlcontext = new SqlContext (SC)    val Filerdd = s C.textfile ("E:\\666\\people.txt")    //Map RDD data to Row, import Org.apache.spark.sql.Row    val Rowrdd:rdd[row] = Filerdd.map (line +      = val fields = Line.split (",")      Row (Fields (0), fields (1). Trim.toint)    })    //Create Structtype to define the structure    val structtype:structtype = Structtype (      //Field name, field type, whether it can be empty      Structfield ("name", StringType, True)::      Structfield ("Age", Integertype, True): Nil    )    /**      * rows:java.util.list[row ],      * Schema:structtype      * * *    /val df:dataframe = Sqlcontext.createdataframe (rowrdd,structtype)    Df.createorreplacetempview ("People")    sqlcontext.sql ("SELECT * from people"). Show ()  }}

Run results

3.3 Way Three: Create a dataframes from a JSON file

Object TestDataFrame3 {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setappname ("Testdatafr Ame2 "). Setmaster (" local ")    val sc = new Sparkcontext (conf)    val sqlcontext = new SqlContext (SC)    val DF: DataFrame = SqlContext.read.json ("E:\\666\\people.json")    Df.createorreplacetempview ("people")    Sqlcontext.sql ("Select * from people"). Show ()  }}

Iv. Read and save of Dataframe and savemode4.1 data

Object Testread {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setappname ("TestDataFrame2") . Setmaster ("local")    val sc = new Sparkcontext (conf)    val sqlcontext = new SqlContext (SC)    //Way One    val df1 = s QlContext.read.json ("E:\\666\\people.json")    val df2 = SqlContext.read.parquet ("E:\\666\\users.parquet")    / /Way Two    val df3 = SqlContext.read.format ("JSON"). Load ("E:\\666\\people.json")    val df4 = SqlContext.read.format ( "Parquet"). Load ("E:\\666\\users.parquet")    //mode three, default is parquet format    val df5 = sqlcontext.load ("e:\\666\\ Users.parquet ")  }}

4.2 Saving of data

Object Testsave {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setappname ("TestDataFrame2") . Setmaster ("local")    val sc = new Sparkcontext (conf)    val sqlcontext = new SqlContext (SC)    val df1 = SqlContext. Read.json ("E:\\666\\people.json")    //mode one    Df1.write.json ("e:\\111")    Df1.write.parquet ("e:\\222")    //Mode two    Df1.write.format ("JSON"). Save ("e:\\333")    Df1.write.format ("parquet"). Save ("e:\\444"    ) Way three    Df1.write.save ("e:\\555")  }}

4.3 Saving mode of data

Using mode

Df1.write.format ("parquet"). Mode (Savemode.ignore). Save ("e:\\444")

Five, data source 5.1 data source only JSON

Reference 4.1

5.2 Parquet of data sources

Reference 4.1

5.3 Data source MySQL

Object Testmysql {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setappname ("Testmysql"). Set Master ("local")    val sc = new Sparkcontext (conf)    val sqlcontext = new SqlContext (SC)    val url = "Jdbc:mysql://1 92.168.123.102:3306/hivedb "    val table =" DBS "    val Properties = new properties ()    Properties.setproperty (" User "," root ")    properties.setproperty (" password "," root ")    //URL to be passed in MySQL, indicate, properties (username password to connect to database)    val df = sqlContext.read.jdbc (url,table,properties)    Df.createorreplacetempview ("DBS")    Sqlcontext.sql ("SELECT * from DBS"). Show ()  }}

Run results

5.4 Data source hive (1) Preparation

Adding dependencies in the Pom.xml file

<!--https://mvnrepository.com/artifact/org.apache.spark/spark-hive--        <dependency>            < groupid>org.apache.spark</groupid>            <artifactId>spark-hive_2.11</artifactId>            < Version>2.3.0</version>        </dependency>

The development environment will add hive-site.xml files under the Resource folder, and the cluster environment will send hive configuration files to the $spark_home/conf directory.

<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hivedb?createDatabaseIfNotExist=true</value> <d ESCRIPTION&GT;JDBC connect string for a JDBC metastore</description> <!--if MySQL and hive are in the same server section Point, then please change hadoop02 to localhost-</property> <property> <name>javax.jdo. Option. Connectiondrivername</name> <value>com.mysql.jdbc.Driver</value> &LT;DESCR                Iption>driver class name for a JDBC metastore</description> </property> <property>                <name>javax.jdo.option.ConnectionUserName</name> <value>root</value>        <description>username to use against Metastore database</description> </property> <property&Gt <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> <des                Cription>password to use against Metastore database</description> </property> <property> <name>hive.metastore.warehouse.dir</name> &LT;VALUE&GT;/HIVE/WAREHOUSE&LT;/VALUE&G                T <description>hive default Warehouse, if nessecory, change it</description> </property> </conf Iguration>

(2) test code

Object Testhive {  def main (args:array[string]): Unit = {    val conf = new sparkconf (). Setmaster ("local"). Setappnam E (this.getClass.getSimpleName)    val sc = new Sparkcontext (conf)    val sqlcontext = new Hivecontext (SC)    Sqlcontext.sql ("SELECT * from Myhive.student"). Show ()  }}

Run results

Spark (17) Sparksql Simple to use

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark (17) Sparksql Simple to use

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support