Familiarize yourself with hive statements through the student-course relationship table

Source: Internet
Author: User

1. Create the following three tables in hive.

Create Table student (SNO int, sname string, sex string, sage int, sdept string) Row format delimited fields terminated by ', 'stored as textfile;
Create Table course (CNO int, cname string) Row format delimited fields terminated by ', 'stored as textfile;
Create Table SC (SNO int, CNO int, grade INT) Row format delimited fields terminated by ', 'stored as textfile; 2. load data to three tables. Load data local inpath '/home/student.txt' overwrite into Table student;
Load data local inpath '/home/SC .txt' overwrite into Table SC;
Load data local inpath '/home/course.txt' overwrite into Table course; 95001, LI Yong, male, 20, CS
95002, Liu Chen, female, 19, is
95003, Wang Min, female, 22, Ma
95004, Zhang Li, male, 19, is
95005, Liu Gang, male, 18, Ma
Sun Qing, male, 23, CS 95006
95007, Yi siling, female, 19, Ma
95008, Li Na, female, 18, CS
95009, mengyuan, female, 18, Ma
95010, Kong Xiaotao, male, 19, CS
95011, Bao Xiaobai, male, 18, Ma
95012, Sun Hua, female, 20, CS
95013, Feng Wei, male, 21, CS
95014, Wang Xiaoli, female, 19, CS
95015, Wang Jun, male, 18, Ma
95016, Qian Guo, male, 21, Ma
95017, Wang fengjuan, female, 18, is
95018, Wang Yi, female, 19, is
95019, Xiao Lili, female, 19, is
95020, Zhao Qian, male, 21, is
Tuesday, male, 17, Ma, 95021
95022, Zheng Ming, male, 20, Ma 1. Database
2. Mathematics
3. Information System
4. Operating System
5. Data Structure
6. Data Processing , 81
95001,2, 85
95001,3, 88
, 70
95002,2, 90
95002, 3, 80
95002, 4, 71
95002,5, 60
95003,1, 82
95003,3, 90
100
95004,1, 80
95004,2, 92
95004,4, 91
95004,5, 70
95005, 1, 70
95005, 2, 92
95005, 3, 99
95005, 6, 87
95006, 1, 72
95006, 2, 62
95006,3, 100
95006, 4, 59
95006, 5, 60
95006, 6, 98
95007,3, 68
95007,4, 91
95007,5, 94
95007,6, 78
95008, 1, 98
95008, 3, 89
95008, 6, 91
95009,2, 81
95009,4, 89
100
95010,2, 98
95010,5, 90
95010,6, 80
95011,1, 81
95011,2, 91
95011,3, 81
95011,4, 86
95012,1, 81
95012,3, 78
95012,4, 85
95012,6, 98
95013,1, 98
95013,2, 58
95013,4, 88
95013,5, 93
95014,1, 91
95014,2, 100
95014,4, 98
95015,1, 91
95015,3, 59
95015,4, 100
95015,6, 95
95016,1, 92
95016,2, 99
95016,4, 82
95017,4, 82
95017,5, 100
95017,6, 58
95018,1, 95
95018,2, 100
95018,3, 67
95018,4, 78
95019,1, 77
95019,2, 90
95019,3, 91
95019,4, 67
95019,5, 87
95020,1, 66
95020,2, 99
95020,5, 93
95021,2, 93
95021,5, 91
95021,6, 99
95022,3, 69
95022, 4, 93
95022,5, 82
95022,6, 100 3. Select of hive
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number]
 
  • Query the student ID and name of all students
hive> select Sno,Sname from student;Total MapReduce jobs = 1
 
  • Query the names of students who have selected the course
hive> select distinct Sname from student inner join sc on student.Sno=Sc.Sno;Total MapReduce jobs = 2
3. Multiple Aggregate functions can appear in one query for the group by and set functions of hive. However, two Aggregate functions in one query statement can only have one distinct. 
Hive. map. aggr controls how we aggregate data. The default value is false. If it is set to true, the first level of aggregation will be executed in the map task. This usually results in high efficiency, but requires a larger memory.
  • Query the total number of students
Hive> select count (distinct SnO) count from student;
Total mapreduce jobs = 1
 
  • Calculate the average score of students of course 1
hive> select avg(distinct Grade) from sc where Cno=1;
 
  • Queries the maximum number of students enrolled in course 1.
Select grade from SC where CNO = 1 sort by grade DESC limit 1; Total mapreduce jobs = 2 
  • Find the course number and the number of students
Hive> select CNO, count (1) from SC group by CNO;
Total mapreduce jobs = 1
  • Query the student ID of more than three courses
Hive> select SnO from (select SnO, count (CNO) countcno from SC group by SnO) A where a. countcno> 3;
Total mapreduce jobs = 1 hive> select SnO from SC group by SnO having count (CNO)> 3;
Total mapreduce jobs = 14. Hive order by/sort by/distrisponby/cluster byorder by, in strict mode (hive. mapred. mode = strict), the order by statement must follow the limit statement, but it is not necessary in a non-strict statement. The reason for this is that there must be a reduce statement to sort the final result, if the number of rows output at the end is too large, it takes a long time for a reduce operation. Hive> set hive. mapred. mode = strict;
Hive> select SnO from student order by SnO;
Failed: Error in semantic analysis: In strict mode, if order by is specified, limit must also be specified. error encountered near token 'sno' sort by, which usually occurs in every redcue. The difference between "order by" and "sort by" is that the former can ensure that the output is ordered, the latter ensures that the output is partially ordered when there are multiple reduce tasks. Set mapred. Reduce. Tasks = <number> can be specified in sort by. When sort by is used, if no column is specified, it will be randomly allocated to different reduce instances. Distribute by divides data into different output Reducers Based on specified fields. This method is divided into different Reducers Based on gender, and then sorted by age and output to different files. Hive> set mapred. reduce. tasks = 2; hive> insert overwrite local directory '/home/hadoop/out' select * from student distribute by sex sort by Sage; cluster by ensures that the same column adjacent to the same reduce is allocated, except for distribute
In addition to the completed functions. 5. Join
  • Join only supports equivalent join e.g.
Query the status of each student and their Optional Courses
  • If the key value of join on is not the same, it is converted into two map/reduc e.g. if the key is the same, it will be converted into a map/reduce job. Since the MAP/reduce result of the first job will be buffered and then joined with the second job, in order to reduce the memory, you must put the smaller value on the first join.
Hive> select student. *, SC .*
From student join SC on (student. Sno = SC. SnO); Queries student scores. Hive> select student. sname, course. cname, SC. Grade
From student join SC on student. Sno = SC. Sno join course on SC. CNO = course. CNO; Total mapreduce jobs =
2
  • Left, right, and full
    The outer keyword is used to process join null records.
Hive> select student. sname, SC. CNO from student left Outer Join SC on student. sno = SC. sno; if the SC corresponding to the sno value of student has no value in, student will be output. sname
Null. If right out join is used, the value on the right is retained, and the value on the left is null.

Join occurs in the WHERE clauseBefore. If you want to limit the join output, you should write filtering conditions in the WHERE clause -- or write in the join clause. A confusing problem is Table Partitioning:

Select a. Val, B. Val from

Left Outer Join B on (A. Key = B. Key)

Where a. DS = '2017-07-07 'and B. DS = '2017-07-07'

Join Table A to table B (Outer Join) to list records of A. Val and B. Val. Other columns can be used as filter conditions in the WHERE clause. However, as described above, if the records corresponding to Table A cannot be found in Table B, all columns in Table B will list null,Includes DS Columns. That is to say, the join clause filters out join operations that match Table A in table B.
Key. In this case, left outer makes the query result irrelevant to the WHERE clause. The solution is to use the following syntax during outer join:

Select a. Val, B. Val from a left Outer Join B

On (A. Key = B. Key and

B. DS = '2017-07-07 'and

A. DS = '2017-07-07 ')

The results of this query are filtered out in the join stage in advance, so the above problems do not exist. This logic can also be applied to right and full join operations.

Join cannot exchange locations. Both left and right join are connected on the left.

Select a. val1, A. val2, B. Val, C. Val

From

Join B on (A. Key = B. Key)

Left outer join C on (A. Key = C. Key)

Join Table A to table B first, discard all records that do not match the join key, and then join the table C with the result of this middle. This statement is not obvious. When a key exists in both table A and Table C but does not exist in Table B: the entire record is joined for the first time, that is,
Join B is lost (including. val1,. val2 and. key), and then when we join the C table, if C. key and. key or B. if the key is equal, the following result is obtained: NULL,
Null, null, C. Val.
  • Left semi join is a more efficient implementation of in/exists subqueries. Hive does not currently implement in/exists subqueries, so you can use left semi join to override your subquery statements. The restriction of left semi join is that the table on the right of the join clause can only set filtering conditions in the on clause, but cannot be filtered in the WHERE clause, select clause, or other places.

Select a. Key, A. Value
From
Where a. Key in
(Select B. Key
From B );
It can be rewritten as follows:
Select a. Key, A. Val
From a left semi join B on (A. Key = B. Key)

  • If the two join tables are very small, the join process may be completed only in mapper.
  • If two tables join on have the same value as the buckets column and the number of buckets is the same, it can be completed only in the Mapper stage. E.g.
 SELECT /*+ MAPJOIN(b) */ a.key, a.value  FROM a join b on a.key = b.key
In the Mapper stage, values of B can be taken according to buckets. A buckets in Table A can join a bucket in Table B. This is not a mode and must be set.
 
set hive.optimize.bucketmapjoin = true
 
  • Exercise:
Query all students who have taken course 2 and scored more than 90. Hive> select student. sname, SC. grade from student join SC on student. sno = SC. sno where SC. CNO = 2 and SC. grade> 90; query hive students in the same department as "Liu Chen"> select s1.sname from student S1 left semi join student S2 on s1.sdept = s2.sdept and s2.sname = 'Liu chen '; 6. Hive Optimization
  • A good model design can get twice the result with half the effort.
  • Solve the problem of data skew.
  • Reduce the number of jobs.
  • Setting a reasonable number of map reduce tasks can effectively improve performance. (For example, 160 +-level computing, with reduce, is a waste, and one is enough ).
  • Understanding data distribution and solving data skew by yourself is a good choice. Set hive. groupby. skewindata = true; this is a general algorithm optimization, but algorithm optimization sometimes cannot adapt to a specific business background. developers understand the business and understand the data, and can solve the data skew problem accurately and effectively through the business logic.
  • When the data volume is large, use count (distinct) with caution, and count (distinct) is prone to skew.
  • Merging Small files is an effective way to improve the scheduling efficiency. If a reasonable number of files is set for all jobs, the overall scheduling efficiency of the cloud ladder will also have a positive impact.
  • The overall optimization is better than the overall optimization for a single job.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.