Summary of SPARKSQL related statements

Source: Internet
Author: User
Tags hadoop fs

1.in does not support sub-query eg. SELECT * from SRC where key in (select key from Test);
Number of support queries eg. SELECT * from SRC where key in (1,2,3,4,5);
In 40,000 time 25.766 seconds
In 80,000 time 78.827 seconds

2.union all/union
The top-level union ALL eg is not supported. Select key from src UNION all select key from test;
Support for SELECT * FROM (select Key from src UNION ALL select key from test) AA;
Union not supported
Support for SELECT DISTINCT key from (select Key from src UNION ALL select key from test) AA;

3.intersect not supported

4.minus not supported

5.except not supported

6.inner join/join/left Outer join/right outer join/full outer Join/left semi Join support
Left Outer join/right Outer join/full outer join must have outer in the middle
Join is the simplest associative operation, with the intersection of the two sides only;
The left outer join is driven by the right table, and the key that does not exist on the table is assigned null;
The right outer join is driven by an rvalue, and the left table does not exist with a key assigned to null;
Full OUTER join All table Association, the two tables complete Cartesian product operation, the left and right table can be assigned to null;
The main use scene of the left semi join is to solve exist in;
Hive does not support subqueries in the WHERE clause, and SQL common exist in clauses are not supported in hive
Subqueries are not supported eg. SELECT * from src AA where Aa.key in (select Bb.key from test BB);
It can be replaced in the following two ways:
SELECT * from src AA left outer join test BB on Aa.key=bb.key where bb.key <> null;
SELECT * from src AA left semi join test BB on Aa.key=bb.key;
In most cases, JOIN on and left semi on are equivalent
A, b Two table connection, if the B table exists duplicate data
When a join is used, A/b table will correlate two records, which should be in accordance with the on condition;
Instead of using the left SEMI JOIN, when the record in table A is returned after a match is made on table B, no further lookup of the B table record is done.
Therefore, if the B table is duplicated, it will not produce duplicate multiple records.
The left OUTER join supports subqueries eg. Select aa.* from src AA left outer joins (SELECT * from test111) BB on AA.KEY=BB.A;

7. How data is imported in hive four
1) Import data from the local file system to the Hive table
CREATE TABLE Wyp (ID int,name string) ROW FORMAT delimited fields terminated by ' \ t ' STORED as textfile;
Load data local inpath ' wyp.txt ' into table wyp;
2) Import data from HDFs to hive table
[Email protected]/home/q/hadoop-2.2.0]$ Bin/hadoop fs-cat/home/wyp/add.txt
hive> load Data inpath '/home/wyp/add.txt ' into table wyp;
3) query the corresponding data from other tables and import them into the hive table
Hive> CREATE TABLE Test (
> ID int, name string
>, tel String)
> Partitioned by
> (age int)
> ROW FORMAT Delimited
> Fields TERMINATED by ' \ t '
> STORED as Textfile;

Note: The test table uses age as the partition field, Partition: In hive, the corresponding directory under each partition of the table, the data for all partitions is stored in the corresponding directory.
For example, the WYP table has a DT and city two partitions, the corresponding DT=20131218CITY=BJ table directory is/USER/HIVE/WAREHOUSE/DT=20131218/CITY=BJ,
All data belonging to this partition is stored in this directory.

hive> INSERT INTO table test
> Partition (age= ' 25 ')
> select ID, Name, tel
> from WYP;

You can also specify partitions dynamically in a SELECT statement by using partition values:
Hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> INSERT INTO table test
> Partition (age)
> select ID, Name,
> Tel, age
> from WYP;

Hive also supports insert overwrite mode for inserting data
hive> Insert Overwrite table test
> PARTITION (age)
> select ID, Name, tel, age
> from WYP;

Hive also supports multiple table insertions
Hive> from WYP
> INSERT INTO Table Test
> Partition (age)
> select ID, Name, tel, age
> INSERT INTO Table test3
> select ID, Name
> Where age>25;
4) When creating a table, by querying the corresponding record from another table and inserting it into the created table
Hive> CREATE TABLE Test4
> As
> select ID, Name, tel
> from WYP;

8. View the Build Table statement
Hive> Show CREATE TABLE test3;

9. Table renaming
hive> ALTER TABLE Events RENAME to 3KOOBECAF;

10. Table Add columns
Hive> ALTER TABLE pokes ADD COLUMNS (New_col INT);

11. Add a column and add a column field comment
Hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT ' a COMMENT ');

12. Delete a table
Hive> DROP TABLE pokes;

13.top N
Hive> SELECT * FROM Test order by key limit 10;
14. Create a database
Create Database Baseball;


INSERT INTO table ci_cuser_20141117154351522 Select Mainresult.product_no,dw_coclbl_m02_3848.l1_01_02_01,dw_ Coclbl_d01_3845.l2_01_01_04 from (select Product_no from ci_cuser_20141114203632267) Mainresult left join DW_COCLBL_M02 _201407 dw_coclbl_m02_3848 on mainresult.product_no = Dw_coclbl_m02_3848.product_no LEFT join DW_COCLBL_D01_20140515 DW _coclbl_d01_3845 on dw_coclbl_m02_3848.product_no = Dw_coclbl_d01_3845.product_no

Insert into ci_cuser_20141117142123638 (product_no,attr_col_0000,attr_col_0001) Select Mainresult.product_no,dw_ Coclbl_m02_3848.l1_01_02_01,dw_coclbl_m02_3848.l1_01_03_01 from (select Product_no from ci_cuser_20141114203632267) Mainresult left join dw_coclbl_m02_201407 dw_coclbl_m02_3848 on mainresult.product_no = Dw_coclbl_m02_3848.product_no


CREATE TABLE ci_cuser_yymmddhhmisstttttt_tmp (product_no string) row format Serde ' Com.bizo.hive.serde.csv.CSVSerde ';
LOAD DATA LOCAL inpath '/home/ocdc/coc/yuli/test123.csv ' OVERWRITE into TABLE test_yuli2;

Create a testfile file that supports CSV format
CREATE TABLE test_yuli7 row format Serde ' Com.bizo.hive.serde.csv.CSVSerde ' as select * from ci_cuser_20150310162729786;

Create a comma-delimited table of jar packages that do not rely on Csvserde
"CREATE TABLE" +listname+ "ROW FORMAT delimited fields TERMINATED by ', '" +
"AS SELECT * from" + listName1;

CREATE table aaaa ROW FORMAT delimited fields TERMINATED by ', ' LINES TERMINATED by ' \ n ' STORED as textfile as SELECT * From

Thriftserver Open Fair mode
Sparksql Thrift Server on fair scheduling:
1. Modify $spark_home/conf/spark-defaults.conf, add
2.spark.scheduler.mode FAIR
3.spark.scheduler.allocation.file/users/tianyi/github/community/apache-spark/ Conf/fair-scheduler.xml
4. Modify the $spark_home/conf/fair-scheduler.xml (or add the file) to edit the following format:
5.<?xml version= "1.0"
6.<allocations>
7.<pool name= "Production";
8. <schedulingmode>fair</ Schedulingmode>
9. <!--weight indicates that two queues can use the proportion of resources--> minshare the same situation;
. <weight>1</weight
<!--Minshare indicates the number of resources to be prioritized-->
<minshare>2</minshare>
13.</pool>
14. <pool name= "Test";
. <schedulingmode>fifo</schedulingmode>
. <weight>2</ Weight>
<minshare>3</minshare>
18.</pool>
19.</allocations>
20. Restart Thrift Server
21. Execute
22.set spark.sql.thriftserver.scheduler.pool= the specified queue name before executing SQL

When you're done, create table yangsy555 like CI_CUSER_YYMMDDHHMISSTTTTTT and insert into yangsy555 select * from yangsy555

Sparksql parsing and HIVEQL parsing execution flow:

Summary of SPARKSQL related statements

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.