Spark customization 5: Instructions for use

Source: Internet
Author: User
Background

Spark-shell is a Scala programming and interpretation execution environment that can be programmed to handle complicated logic computing. However, for simple SQL-like data processing, such as grouping and summation, the SQL statement is "select g, count (1) From sometable group by G". The program to be written is:

Val hive = neworg. Apache. Spark. SQL. hive. hivecontext (SC)

Import hive ._

Val RDD = hql ("selectg, count (1) From sometable group by G ")

RDD. Collect

It seems tedious. For those who only focus on business data, there are too many spark tools attached.

Easy to submit SQL commands

You can use the spark-shell-I parameter to set the startup script, which saves two sentences: hive variable definition and import.

Use object-oriented programming to merge the last two sentences into hql ("select g, count (1) From sometable group by G"). Collect.

Use Scala for implicit conversion and then simply use "selectg, count (1) From sometable group by G". hqlgo.

Using Scala to omit the bracket feature, you can write it as "selectg, count (1) From sometable group by G" hqlgo.

Simplified statement: "select g, count (1) From sometable group by G"Hqlgo

If multiple lines are written, you can write them as follows:

""

Selectg, count (1)

From sometable

Group by G

"" Hqlgo

Easy to save results

Programs to be written for saving query results:

Val RDD = hql ("selectg, count (1) From sometable group by G ")

RDD. saveastextfile ("HDFS:/somedir ")

Similar to the preceding SQL statement, the simplified statement is: "select g, count (1) From sometable group by G"Saveto"HDFS:/somedir"

Multi-row format:

""

Selectg, count (1)

From sometable

Group by G "saveto" HDFS:/somedir"

Note:

1) when writing multiple lines, saveto and the front cannot be further divided, and the back path cannot be further divided.

2) If the file is saved to a local file, the file should contain the extension suffix

3) There is a problem with the output format of the original spark implementation. Hive cannot parse the data structure correctly. The new custom version has been solved.

Easy to create memory tables for reading files

SQL is performed on the data in HDFS. If you want to create a table in hive, use "create externaltable..." hqlgo. If you only create a memory table for data processing, you need to write the program:

Val RDD = SC. textfile ("HDFS:/somedir ")

Case class someclass (Name: String, age: int, weight: Double)

Val schemardd = RDD. map (_. split ("\ t ")). map (t => someclass (T (0), T (1), T (2 )))

Hive. registerrddastable (schemardd, "sometable ")

Hql ("selectg, count (1) From sometable group by G"). Collect

Simplified statement:

"Create table sometable (name string, age int, weightdouble )"From"HDFS:/somedir"

"Selectg, count (1) From sometable group by G" hqlgo

Multi-row format:

""

Create tablesometable (

Name string,

Age int,

Weight double)

"From" HDFS:/somedir"

"Selectg, count (1) From sometable group by G" hqlgo

Note:

1) "create table" must be strictly written here. After "CREATE" and "table", there must be a space

2) the output path must contain 24 or more characters to avoid overwriting large directories.

Result check

The calculation result may be a data table or output to a file.

Data Table check: "sometable"Isok

File Check: ”somefile.txt"Isok

"HDFS:/somedir"Isok

It is determined that the file is not empty, the length is greater than 0, the path is not empty, there are files with a length greater than below, and the number of data table records is greater than 0.

Note:

1) The file should contain the extension suffix. If the input string contains "." and "/", it is considered as a file or directory, and does not contain data tables.

2) If you want to save the query results to the memory, use VAL data = "select * From testperson" hqlresult to view the query results in the memory. Use do show data

Customize spark startup

/Sysdir/spark-1.0.0/bin/myspark

Enter help to get help.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.