Background
Spark-shell is a Scala programming and interpretation execution environment that can be programmed to handle complicated logic computing. However, for simple SQL-like data processing, such as grouping and summation, the SQL statement is "select g, count (1) From sometable group by G". The program to be written is:
Val hive = neworg. Apache. Spark. SQL. hive. hivecontext (SC)
Import hive ._
Val RDD = hql ("selectg, count (1) From sometable group by G ")
RDD. Collect
It seems tedious. For those who only focus on business data, there are too many spark tools attached.
Easy to submit SQL commands
You can use the spark-shell-I parameter to set the startup script, which saves two sentences: hive variable definition and import.
Use object-oriented programming to merge the last two sentences into hql ("select g, count (1) From sometable group by G"). Collect.
Use Scala for implicit conversion and then simply use "selectg, count (1) From sometable group by G". hqlgo.
Using Scala to omit the bracket feature, you can write it as "selectg, count (1) From sometable group by G" hqlgo.
Simplified statement: "select g, count (1) From sometable group by G"Hqlgo
If multiple lines are written, you can write them as follows:
""
Selectg, count (1)
From sometable
Group by G
"" Hqlgo
Easy to save results
Programs to be written for saving query results:
Val RDD = hql ("selectg, count (1) From sometable group by G ")
RDD. saveastextfile ("HDFS:/somedir ")
Similar to the preceding SQL statement, the simplified statement is: "select g, count (1) From sometable group by G"Saveto"HDFS:/somedir"
Multi-row format:
""
Selectg, count (1)
From sometable
Group by G "saveto" HDFS:/somedir"
Note:
1) when writing multiple lines, saveto and the front cannot be further divided, and the back path cannot be further divided.
2) If the file is saved to a local file, the file should contain the extension suffix
3) There is a problem with the output format of the original spark implementation. Hive cannot parse the data structure correctly. The new custom version has been solved.
Easy to create memory tables for reading files
SQL is performed on the data in HDFS. If you want to create a table in hive, use "create externaltable..." hqlgo. If you only create a memory table for data processing, you need to write the program:
Val RDD = SC. textfile ("HDFS:/somedir ")
Case class someclass (Name: String, age: int, weight: Double)
Val schemardd = RDD. map (_. split ("\ t ")). map (t => someclass (T (0), T (1), T (2 )))
Hive. registerrddastable (schemardd, "sometable ")
Hql ("selectg, count (1) From sometable group by G"). Collect
Simplified statement:
"Create table sometable (name string, age int, weightdouble )"From"HDFS:/somedir"
"Selectg, count (1) From sometable group by G" hqlgo
Multi-row format:
""
Create tablesometable (
Name string,
Age int,
Weight double)
"From" HDFS:/somedir"
"Selectg, count (1) From sometable group by G" hqlgo
Note:
1) "create table" must be strictly written here. After "CREATE" and "table", there must be a space
2) the output path must contain 24 or more characters to avoid overwriting large directories.
Result check
The calculation result may be a data table or output to a file.
Data Table check: "sometable"Isok
File Check: ”somefile.txt"Isok
"HDFS:/somedir"Isok
It is determined that the file is not empty, the length is greater than 0, the path is not empty, there are files with a length greater than below, and the number of data table records is greater than 0.
Note:
1) The file should contain the extension suffix. If the input string contains "." and "/", it is considered as a file or directory, and does not contain data tables.
2) If you want to save the query results to the memory, use VAL data = "select * From testperson" hqlresult to view the query results in the memory. Use do show data
Customize spark startup
/Sysdir/spark-1.0.0/bin/myspark
Enter help to get help.