The Spark SQL operation is explained in detail

Source: Internet
Author: User
Tags java web

I. Spark SQL and SCHEMARDD

There is no more talking about spark SQL before, we are only concerned about its operation. But the first thing to figure out is what is Schemardd? From the Scala API of spark you can know Org.apache.spark.sql.SchemaRDD and class Schemardd extends Rdd[row] with Schemarddlike, We can see that the class Schemardd inherits from the abstract class Rdd. The official document is defined as the "an RDD of Row objects, that have an associated schema." In addition-to-standard RDD functions, Schemardds can is used in relational queries ", directly translated" Schemardd consists of row objects that have a pattern to describe The data type for each column in the row. I think Schemardd is the spark SQL provides a special RDD, the main purpose is to SQL query, therefore, in the operation of the need to change the RDD such as Schemardd. More popular, we can compare schemardd to a table in a traditional relational database.

From what we can see, Spark SQL can handle data formats such as Hive,json,parquet (Columnstore format), which means that Schemardd can be created from these data formats. We can manipulate spark SQL through the Jdbc/odbc,spark Application,spark shell, and then read the data from spark SQL and manipulate it through data mining, data visualization (Tableau), and more.

Two. Spark SQL operation TXT file

The first thing to note is that in Spark 1.3 and later, Schemardd changed to be called Dataframe. People who have learned the Pandas class library in Python should have a very good understanding of dataframe, and intuitively, it is actually a form. However, we generally call dataframe Schemardd, just because the spark API changes cause the operation of Spark SQL to change accordingly. We experimented with the spark 1.3.0 version.

1. Create SqlContext

Create the SqlContext according to Sparkcontext (SC) as follows:

1 New Org.apache.spark.sql.SQLContext (SC) 2 Import Sqlcontext.implicits._

Analytical:

    • Line 1th: SC refers to Org.apache.spark.SparkContext, when we run the spark shell, the built-in object SC has been created, similar to the built-in objects in the Java Web.
    • Line 2nd: Implicitly convert the Rdd into dataframe (i.e. Schemardd).

2. Define Case class

We define the case class as shown below:

1  Case class Person (name:string, age:int)

Parse: The parameter name of the case class is read by reflection, and then as the name of the column. Case class can be nested or contain complex data types, such as sequences,arrays.

3. Create Dataframe

Create the Dataframe as follows:

1 val rddperson = Sc.textfile ("/home/essex/people.txt"). Map (_.split (",")). Map (P=>person (p (0), p (1). Trim.toint)). TODF ()

Analytical:

    • Through the transform process of the RDD, we can implicitly convert the case class into dataframe (i.e. Addperson).
    • The contents of the file People.txt are Mechel, 29;andy, 30;jusdin, 19. (This is written in order to typeset, in fact, each <name, age> line)

4. Register as a table

1 rddperson.registertemptable ("rddtable")

Parsing: We register Rddperson as a table rddtable in SqlContext. Since registering a table, you can manipulate the table, such as Select,insert,join.

5. Query Operations

1 sqlcontext.sql ("Select name from Rddtable, where age >=, and Age <="). Map (t = "Name:" + t (0)). Col Lect (). foreach (println)

Resolution: Find the name between the ages of 13-19 years old.

Summary: With the above steps, the Spark SQL basic operation is to first create sqlcontext and define the case class, and then through the transform process of the RDD, the case class is implicitly converted into Dataframe, Finally, the Dataframe is registered as a table in the SqlContext, so we can manipulate the table.

Three. Spark SQL operations Parquet file

Four. Spark SQL operations JSON file

Five. Spark SQL operations JDBC

Six. Hivecontext detailed explanation

Seven. Spark SQL other advanced operations


Reference documents:

[1] Spark SQL Deep comprehension: module Implementation, code structure and execution Process overview: HTTP://WWW.CSDN.NET/ARTICLE/2014-07-15/2820658/1
[2] "Learning Spark"
[3] Spark SQL programming guide:http://spark.apache.org/docs/1.0.0/sql-programming-guide.html
[4] Spark SQL Summary: http://blog.selfup.cn/657.html
[5] Schemardd display conversion and implicit conversion: http://www.iteblog.com/archives/1224
[6] Spark SQL Basic application: http://www.it165.net/database/html/201409/8093.html
[7] dataframe:http://ju.outofmemory.cn/entry/128891 in Spark SQL
[8] Data source in Spark sql: http://blog.javachen.com/2015/04/03/spark-sql-datasource.html
[9] Spark SQL Application Sample: http://blog.itpub.net/10037372/viewspace-1449008/

The Spark SQL operation is explained in detail

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.