Spark customization 5: Instructions for use

Last Update:2014-07-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Background

Spark-shell is a Scala programming and interpretation execution environment that can be programmed to handle complicated logic computing. However, for simple SQL-like data processing, such as grouping and summation, the SQL statement is "select g, count (1) From sometable group by G". The program to be written is:

Val hive = neworg. Apache. Spark. SQL. hive. hivecontext (SC)

Import hive ._

Val RDD = hql ("selectg, count (1) From sometable group by G ")

RDD. Collect

It seems tedious. For those who only focus on business data, there are too many spark tools attached.

Easy to submit SQL commands

You can use the spark-shell-I parameter to set the startup script, which saves two sentences: hive variable definition and import.

Use object-oriented programming to merge the last two sentences into hql ("select g, count (1) From sometable group by G"). Collect.

Use Scala for implicit conversion and then simply use "selectg, count (1) From sometable group by G". hqlgo.

Using Scala to omit the bracket feature, you can write it as "selectg, count (1) From sometable group by G" hqlgo.

Simplified statement: "select g, count (1) From sometable group by G"Hqlgo

If multiple lines are written, you can write them as follows:

Selectg, count (1)

From sometable

Group by G

"" Hqlgo

Easy to save results

Programs to be written for saving query results:

Val RDD = hql ("selectg, count (1) From sometable group by G ")

RDD. saveastextfile ("HDFS:/somedir ")

Similar to the preceding SQL statement, the simplified statement is: "select g, count (1) From sometable group by G"Saveto"HDFS:/somedir"

Multi-row format:

Selectg, count (1)

From sometable

Group by G "saveto" HDFS:/somedir"

Note:

1) when writing multiple lines, saveto and the front cannot be further divided, and the back path cannot be further divided.

2) If the file is saved to a local file, the file should contain the extension suffix

3) There is a problem with the output format of the original spark implementation. Hive cannot parse the data structure correctly. The new custom version has been solved.

Easy to create memory tables for reading files

SQL is performed on the data in HDFS. If you want to create a table in hive, use "create externaltable..." hqlgo. If you only create a memory table for data processing, you need to write the program:

Val RDD = SC. textfile ("HDFS:/somedir ")

Case class someclass (Name: String, age: int, weight: Double)

Val schemardd = RDD. map (_. split ("\ t ")). map (t => someclass (T (0), T (1), T (2 )))

Hive. registerrddastable (schemardd, "sometable ")

Hql ("selectg, count (1) From sometable group by G"). Collect

Simplified statement:

"Create table sometable (name string, age int, weightdouble )"From"HDFS:/somedir"

"Selectg, count (1) From sometable group by G" hqlgo

Multi-row format:

Create tablesometable (

Name string,

Age int,

Weight double)

"From" HDFS:/somedir"

"Selectg, count (1) From sometable group by G" hqlgo

Note:

1) "create table" must be strictly written here. After "CREATE" and "table", there must be a space

2) the output path must contain 24 or more characters to avoid overwriting large directories.

Result check

The calculation result may be a data table or output to a file.

Data Table check: "sometable"Isok

File Check: ”somefile.txt"Isok

"HDFS:/somedir"Isok

It is determined that the file is not empty, the length is greater than 0, the path is not empty, there are files with a length greater than below, and the number of data table records is greater than 0.

Note:

1) The file should contain the extension suffix. If the input string contains "." and "/", it is considered as a file or directory, and does not contain data tables.

2) If you want to save the query results to the memory, use VAL data = "select * From testperson" hqlresult to view the query results in the memory. Use do show data

Customize spark startup

/Sysdir/spark-1.0.0/bin/myspark

Enter help to get help.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark customization 5: Instructions for use

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark customization 5: Instructions for use

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support