sparkSQL1.1 Introduction VI: The basic application of Sparksql

Source: Internet
Author: User
Tags joins

sparkSQL1.1 queries for data are divided into 2 branches: SqlContext and Hivecontext. In SqlContext, Sparksql can query a defined table using SQL-92 syntax, and the source data for the table can come from:
    • RDD
    • Parquet file
    • JSON file
In Hivecontext, Sparksql can query the hive data using HQL syntax, sparkSQL1.1 support hive0.12 's HQL syntax, and if an unsupported syntax is encountered, the user can switch to SQL syntax by changing the configuration. I guess, starting with spark1.1, will open the barriers between SqlContext and hivecontext, mixing the tables defined in SqlContext and Hivecontext. Due to the time relationship, this feature has not been tested, such as the official release of spark1.1, and then specifically discussed this issue. Also, in Hivecontext, HQL () will be deprecated, and SQL () will submit the query statement instead of HQL ().

To facilitate the demonstration, we present the following demonstrations in Spark-shell and illustrate them. First, start the Spark cluster, and then start Spark-shell on the client WY:
Bin/spark-shell--master spark://hadoop1:7077--executor-memory 3g

1:sqlcontext Basic Application SqlContext first converts the externally read-in data to SCHEMARDD and registers it as a table for table operation. To use SqlContext, you first introduce the SqlContext library and its implicit functions:
Val sqlcontext = new Org.apache.spark.sql.SQLContext (SC) Import sqlcontext._

1.1:rdd to convert the RDD to a table in SqlContext, first define the case class, which can be implicitly converted to Schemardd in the transform process of the RDD. Then use registerastable to register as a table. After registering as a table, you can manipulate the table in SqlContext, such as SELECT, Insert, join, and so on.      Note that the case class can be nested, or you can use complex data types like sequences or arrays. The following example defines a case clase (person) that conforms to the/sparksql/people.txt type of the data file and then implicitly converts the data file into schemardd:people after it is read in. And the people in the SqlContext register as a table rddtable, and finally to query the table, find out the age of 13-19 years old between the names.
The contents of/sparksql/people.txt are 3 lines:

Run the following code:
The RDD demonstrates case class person (Name:string,age:int) Val rddpeople=sc.textfile ("/sparksql/people.txt"). Map (_.split (",")). Map (P=>person (p (0), p (1). Trim.toint)) rddpeople.registerastable ("Rddtable") sqlcontext.sql ("SELECT name from Rddtable WHERE Age >= and <= "). Map (t = =" Name: "+ t (0)). Collect (). foreach (println)
Operation Result:
1.2:parquet file Also, SqlContext can read the parquet file, because the parquet file retains the information of the schema, so you do not need to use case class to implicitly convert. SqlContext read into the parquet file and switch directly to Schemardd, or you can save the Schemardd to the Parquet file format.
We will first save the above-established schemardd:people as a parquet file:
Rddpeople.saveasparquetfile ("/sparksql/people.parquet")
After running the/sparksql/directory, a directory called People.parquet is added:
Then, read the People.parquet, register it as a table parquettable, and query for people older than 25 years old:
Parquet demo val parquetpeople = Sqlcontext.parquetfile ("/sparksql/people.parquet") parquetpeople.registerastable (" Parquettable ") sqlcontext.sql (" Select name from Parquettable WHERE age >= "). Map (t =" Name: "+ t (0)). Collect (). fo Reach (println)
Operation Result:
1.3:json file sparkSQL1.1 begins to provide support for the JSON file format, which means that developers can use more data sources, such as the famous NoSQL database Mongdb. SqlContext can get schema information from Jsonfile or Jsonrdd to build Schemardd, which can be used after registering as a table.
    • Jsonfile-The data in the JSON file directory is loaded, and each line of the file is a JSON object.
    • Jsonrdd-loads data from an existing RDD, where each element of the RDD contains a string of JSON objects.
The following example reads a JSON file/sparksql/people.json, registers it as a jsontable, and queries people older than 25 years of age.
Content of/sparksql/people.json:
Run the following code:
JSON demo val jsonpeople = Sqlcontext.jsonfile ("/sparksql/people.json") jsonpeople.registerastable ("JsonTable") Sqlcontext.sql ("Select name from Jsontable, WHERE age >="). Map (t = = "Name:" + t (0)). Collect (). foreach (println)
Operation Result:
</pre></div><div style=" LINE-HEIGHT:28PX; font-family: ' Hiragino Sans GB W3 ', ' Hiragino Sans GB ', Arial, Helvetica, SimSun, song body; font-size:16px; " >2:hivecontext Basic Application </div><div style= "LINE-HEIGHT:28PX; font-family: ' Hiragino Sans GB W3 ', ' Hiragino Sans GB ', Arial, Helvetica, SimSun, song body; font-size:16px; " >      Use Hivecontext first to confirm the following two points: </div><div style= "LINE-HEIGHT:28PX; font-family: ' Hiragino Sans GB W3 ', ' Hiragino Sans GB ', Arial, Helvetica, SimSun, song body; font-size:16px; " ><ul style= "margin:5px 0px 5px 40px; padding:0px; " ><li> using Spark is a configuration file that supports hive</li><li>hive Hive-site.xml already exists in conf directory </li></ul>       The former can see if there are 3 jars starting with DataNucleus in the Lib directory, and the latter note whether URIs is configured in Hive-site.xml to access hive metaStore </div><div style= "LINE-HEIGHT:28PX; font-family: ' Hiragino Sans GB W3 ', ' Hiragino Sans GB ', Arial, Helvetica, SimSun, song body; font-size:16px; " ><div></div><div> to use Hivecontext, you need to first build Hivecontext:</div><div><pre name= "code "class=" html ">val Hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC)
The hive data can then be manipulated, and we will use the sales data in hive (hive data in the fifth summary) to first switch the database to Saledata and view a few tables:
Hivecontext.sql ("Use Saledata") hivecontext.sql ("Show Tables"). Collect (). foreach (println)
You can see that there are 3 tables defined in section fifth:

Now inquire about the sales singular and total sales for each of the orders:
Total sales per year for all orders, total sales//Three tables connected, distinct a.ordernumber, sum (b.amount) Sales Totals hivecontext.sql ("Select C.theyear,count (distinct a.ordernumber), sum (b.amount) from Tblstock a join Tblstockdetail B on a.ordernumber= B.ordernumber join Tbldate C on A.dateid=c.dateid Group by c.theyear Order by C.theyear "). Collect (). foreach (println)
Operation Result:

Make a slightly more complex query to find out the sales of the maximum order amount for all orders per year:
/************************ sales of the maximum order for all orders per year: The first step is to find out the sales for each order at the time of its occurrence select A.dateid,a.ordernumber,sum (B.amount) as Sumofamount from Tblstock a joins Tblstockdetail B on A.ordernumber=b.ordernumber GROUP by A.dateid,a.ordernumber The second step to the first step of the Query as a child table, and a table tbldate connection, to find the sales of the maximum amount of orders per year select C.theyear,max (D.sumofamount) from tbldate C join (select A.dateid, A.ordernumber,sum (B.amount) as Sumofamount from Tblstock a joins Tblstockdetail B on A.ordernumber=b.ordernumber GROUP by a . dateid,a.ordernumber) d on  C.dateid=d.dateid Group by c.theyear Sort by c.theyear*************************/ Hivecontext.sql ("Select C.theyear,max (D.sumofamount) from tbldate C join (select A.dateid,a.ordernumber,sum (B.amount) As Sumofamount from Tblstock a join Tblstockdetail B to A.ordernumber=b.ordernumber Group by A.dateid,a.ordernumber) D
   on C.dateid=d.dateid GROUP by c.theyear Sort by c.theyear "). Collect (). foreach (println)
Operation Result:

Finally make a more complex query to find out all the orders of the annual best-selling items:
/************************ Annual Best sellers of all orders: First step: Find out the sales amount per item per year Select C.theyear,b.itemid,sum (B.amount) as Sumofamount from Tblstock A join Tblstockdetail B on A.ordernumber=b.ordernumber joins Tbldate C on A.dateid=c.dateid GROUP by C.THEYEAR,B.I Temid Second step: Find out the maximum amount per year for a single item select D.theyear,max (D.sumofamount) as Maxofamount from (select C.theyear,b.itemid,sum ( B.amount) as Sumofamount from Tblstock a joins Tblstockdetail B on A.ordernumber=b.ordernumber joins Tbldate C on A.dateid=c . DateID GROUP BY C.theyear,b.itemid) d GROUP by D.theyear step Three: Find the best selling items that match your sales every year, select distinct e.theyear,e.itemid,f . Maxofamount from (select C.theyear,b.itemid,sum (b.amount) as Sumofamount from Tblstock a join Tblstockdetail B on A.order Number=b.ordernumber join Tbldate C on A.dateid=c.dateid Group by C.theyear,b.itemid) e Join (select D.theyear,max (d.sumof Amount) as Maxofamount from (select C.theyear,b.itemid,sum (b.amount) as Sumofamount from Tblstock a join Tblstockdetail b On A.ordernumber=b.ordernumber join Tbldate COn A.dateid=c.dateid GROUP by C.theyear,b.itemid) d Group by D.theyear) F on (E.theyear=f.theyear and E.sumofamount=f.maxo Famount) Order by E.theyear*************************/hivecontext.sql ("SELECT distinct E.theyear,e.itemid, F.maxofamount from (select C.theyear,b.itemid,sum (b.amount) as Sumofamount from Tblstock a join Tblstockdetail B on A.orde Rnumber=b.ordernumber join Tbldate C on A.dateid=c.dateid Group by C.theyear,b.itemid) e Join (select D.theyear,max (D.sumo  Famount) as Maxofamount from (select C.theyear,b.itemid,sum (b.amount) as Sumofamount from Tblstock a join Tblstockdetail b On A.ordernumber=b.ordernumber join Tbldate C on A.dateid=c.dateid Group by C.theyear,b.itemid) d Group by d.theyear) F O N (e.theyear=f.theyear and E.sumofamount=f.maxofamount) Order by E.theyear "). Collect (). foreach (println)
Operation Result:
3: Mixed use of tables from different data sources in SqlContext and Hivecontext can be mixed, but SqlContext and hivecontext are not yet mixed.
Mixed use in SqlContext:
SqlContext Mixed Use Parquettable ("Select Sqlcontext.sql,"//sqlcontext from the Rdd table rddtable and table a.name,a.age from the parquet file. B.age from Rddtable a joins parquettable B on A.name=b.name "). Collect (). foreach (println)
Operation Result:
Mixed use in Hivecontext:
Hivecontext Mixed use//create a hivetable and load the data, note that people.txt second column has spaces, so age takes string type Hivecontext.sql ("Create TABLE Hivetable (name String,age string) ROW FORMAT delimited fields TERMINATED by ', ' LINES TERMINATED by ' \ n ') hivecontext.sql ("LOAD DATA LOCAL inpath '/home/mmicky/mboo/myclass/doc/sparksql/data/people.txt ' into TABLE hivetable")// Create a table parquetTable2 that originates from the parquet file, and then Mix hivecontext.parquetfile ("/sparksql/people.parquet") with hivetable. Registerastable ("ParquetTable2") hivecontext.sql ("Select A.name,a.age,b.age from Hivetable a joins ParquetTable2 B on A.name=b.name "). Collect (). foreach (println)
Operation Result:
However, tables defined in SqlContext cannot be mixed with tables defined in Hivecontext。 4: Cache using Sparksql caches can be implemented using two methods:
    • Cachetable () method
    • CACHE Table command
Never use the cache Schemardd first, then registerastable, and the cache () using the RDD will use the original cache instead of the SQL-optimized memory columnstore. Look at the source code for cachetable:
By default, the compression function of the memory Columnstore is turned off, and the variable compress_cached needs to be configured to use the compression function.
In SqlContext, you can use the following cache:
The SqlContext cache uses sqlcontext.cachetable ("Rddtable") sqlcontext.sql ("Select name from Rddtable WHERE age >= Age <= "). Map (t =" Name: "+ t (0)). Collect (). foreach (println) sqlcontext.sql (" CACHE TABLE parquettable ") Sqlcontext.sql ("Select name from Parquettable, WHERE age >=, and Age <="). Map (t = "Name:" + t (0)). Collect () . foreach (println)
Observe the WebUI and you can see the cache information. (Note that the cache is lazy and must be implemented by action; Uncache is eager and can be implemented immediately)
Use the following command to cancel the cache:
Sqlcontext.uncachetable ("rddtable") sqlcontext.sql ("Uncache TABLE parquettable")
Similarly, the above method can also be used in the Hivecontext cache or Uncache.
5:DSL's use of Sparksql supports DSL (Domain specific Language) in addition to supporting HIVEQL and SQL-92 syntax. In a DSL, using the Scala notation ' + identifier to represent the columns in the underlying table, Spark's execution engine converts these implicitly into expressions. In addition, many DSL-related methods can be found in the API, such as where (), select (), limit (), and so on, detailed information can be viewed in the Catalyst module of the DSL submodule, below which define several common methods:
For the use of DSLs, for example, combined with a DSL method, it is easy to get started:
DSL demo val teenagers_dsl = Rddpeople.where (' Age >= '). Where (' Age <= '). Select (' name) teenagers_dsl.map (t = " Name: "+ t (0)"). Collect (). foreach (println)

6:tips above describes the basic application of Sparksql, Sparksql still in the rapid development, there are many defects, such as:
    • The scala2.10.4 itself has a 22-column limit on case class, which makes it inconvenient when using the RDD data source;
    • 3 tables in SqlContext can not join at the same time, need 22 join and then join once;
    • You cannot insert data directly using values in SqlContext;
    • 。。。
Overall, the hivecontext is still satisfactory, sqlcontext is somewhat passable. In addition, by the way, when writing an SqlContext application, the case class is defined outside of object.
will be into gold in the last few years of smelting .Spark Big Data Fast computing platform (phase III), this material is a new lesson material.

sparkSQL1.1 Introduction VI: The basic application of Sparksql

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.