Query operations
groupbyorderbyjoinbybybyunion all
implementation of the underlying
mapreduce
Common aggregation Operations
Count counts
count(*) 所有值不全为NULL时,加1操作count(1) 不管有没有值,只要有这条记录,值就加1count(col) col列里面的值为null,值不会加1,这个列里面的值不为NULL,才加1
Sum sum
sum(可转成数字的值) 返回bigint
AVG Averaging
avg(可转成数字的值)返回double
Distinct number of different values
count(distinct col)
Order by
Sort by some fields
Sample Example
select col1,other...from tablewhere conditioorder by col1,col2 [asc|desc]
Note
Order by can have more than one column sorted by default, sorted by dictionary
Order BY is a global sort
The order by requires the reduce operation, and only one reduce, regardless of configuration. when the amount of data is very large, use caution.
Execution Process
Reads the data from the table, executes the where condition, makes a combination key with the value of the Col1,col2 column, other column values as value, and then passes the data to the same reduce, according to the sort order required.
Group by
Grouped by the values of some fields, with the same values put together.
Sample Example
select col1 [,col2] ,count(1),sel_expr(聚合操作)from tablewhere condition -->Map端执行group by col1 [,col2] -->Reduce端执行[having] -->Reduce端执行
Note
A non-aggregate column after select must appear in GROUP by
In addition to the normal column in the Select is some aggregation operations
Group by can be followed by an expression, such as substr (COL)
features
Use the reduce operation, limited by the number of reduce , set the reduce parameter mapred.reduce.tasks
The output file is the same as the number of reduce, and the file size is related to the amount of data that reduce.
problem
Heavy Network Load
Data skew, the optimization parameter hive.groupby.skewindata is true, an optimizer is launched to avoid data skew.
Execution Process
The data is read from the table, the Where condition is executed, the col1 column is grouped, the content of the Col column is the key, the other column values are uploaded to reduce as value, and the aggregation operation and the having filter are performed on the reduce side.
eg
Setmapred.reduce.tasks=5;Select* fromTabOrderOrder byChASCNumdesc;Setmapred.reduce.tasks=3;SelectCH, COUNT (1) asNum fromTabOrderGroup byChSetHive.groupby.skewindata =true;SelectCH, COUNT (1) asNum fromTabOrderGroup byCh Having count (1) >2;SelectCol fromTableNameGroup byCol <==>Select distinctCol fromTableName
Join table Join
Two table m,n is connected by an on condition, and one record in M and one record in n form a new record.
Join equivalent join (inner join), when only one value exists simultaneously in M and N.
left outer joinLeft outer joins, the values in the left table are output regardless of whether they exist in B or not, and the values in the right table are output only if they exist in the left table.
right outer joinand the left outer join opposite.
left semi joinSimilar exists . That is, find the data in table A, whether it exists in table B, and find out the existing data.
Mapjoin: completes the join operation on the map side, does not need to use reduce, the memory-based join, belongs to the optimization operation.
select M.col as col1, m.col2 as col2, n.col3 as col3 from (select Col1,col2 from , Test where ... (map side Execution)) m (left table) [left outer | Right outer | left semi] join N (right table) on m.col=n.colwhere condition (reduced side execution) Span class= "Hljs-keyword" >set hive.optimize.skewjoin=true ;
Read the data execution where condition, grouped by Col column, the contents of Col column as key, other columns as value, upload to reduce, perform connection operation and where filtering on the reduce side.
eg
Create table m (ch string,num string)row format delimited fields terminated by c6> ' \ t ' lines terminated by ' \ n ' stored as textfile; load data local inpath '/liguodong/hivedata/m ' into table m; Create table N (ch string,num string)row format delimited fields terminated by c5> ' \ t ' lines terminated by ' \ n ' stored as textfile; load data local inpath '/liguodong/hivedata/n ' into table n; Select * from M; Select * from N;Internal connection Select s.ch,s.num,t.num from (select ch,num from M) sjoin(Select ch,num from N) t on s.ch=t.ch;Left outer connection Select s.ch,s.num,t.num from (select ch,num from M) s left outer Join (Select ch,num from N) t on s.ch=t.ch; Right outer connection Select s.ch,s.num,t.num from (select ch,num from M) s Right outer Join(select ch,num from N) t on s.ch=t.ch;
Data Output comparison
select s.ch,s.num from(select ch,num from m)sleft semi join(select ch,num from n)ton s.ch=t.ch;运行结果:A 1C 5C 3
Mapjoin
Mapjoin (map side join)
Load the small table into memory on the map side, then read the large table, and the small table in memory to complete the connection operation. The distributed cache technology is used.
Advantages
Reduce resource consumption of the cluster (reduce is relatively scarce).
Reduces reduce operations and accelerates program execution.
Reduce Network load.
Disadvantages
Takes up part of the memory, so the tables that are loaded into memory cannot be too large because each compute node is loaded once.
Generate more small files.
Execution Process
Reads data from a large table and executes a where condition. The small table is loaded into memory, and every data read in a large table is compared to the small table data in memory.
The first way, the automatic way
Configure the following parameters
hive** automatically * * Based on SQL, choose to use common join or map join
set hive.auto.convert.join=true;hive.mapjoin.smalltable.filesize默认值是25mb
The second way, manually specifying
/*+mapjoin(n)*/ m.col, m.col2, n.col3 from mjoin n on m.col=n.col;
Note: /*+mapjoin(n)*/ You cannot omit it, just replace the table name n value.
Briefly summarize theusage scenarios for the map join :
1, the association operation has a table very small
2, not equivalent to the link operation
select c.city,p.province from(select province,city from city)cjoin(select province from province)pon c.province=p.province;mapjoin手动方式select /*+mapjoin(p)*/ c.city,p.province from(select province,city from city)cjoin(select province from province)pon c.province=p.province;
Compare the comparison time between the two.
Hive Bucket Join
For each table or partition, hive can be further organized into buckets, which means that buckets are more granular data range divisions.
Hive is a bucket for a column.
Hive uses a hash of the column values, divided by the number of buckets, to determine which bucket the record is stored in.
Benefits
Get higher query processing efficiency.
Make sampling (sampling) more efficient.
create table bucketed_user(id int,name string)clustered by (id) sorted by (name) into 4 bucketsrow format delimited fields terminated by ‘\t‘ stored as textfile;set hive.enforce.bucketing=true;
Use of sub-barrels
selectfrom1outof2on id)
Bucket Join
set hive.optimize .bucketmapjoin =true set hive.optimize .bucketmapjoin .sortedmerge =true set hive.input .format =org.apache .hadoop .hive .io
Connect two tables that divide buckets on the same column (including connection columns), and can be implemented efficiently using map-side joins (map side joins). such as join operations.
For join operations, two tables have an identical column if both tables are stabbed. Then the bucket with the same column values will be saved as a JOIN operation, which can greatly reduce the amount of data in the join.
For a map-side connection, two tables divide the buckets in the same way. Processing the mapper of a bucket in the left table knows that the matching rows in the right table are within the corresponding bucket. Therefore, mapper only needs to get the bucket (which is only a small portion of the data stored in the right table) to connect.
This optimization method does not necessarily require that two tables must be the same number of buckets, the number of buckets in two tables is a multiple relationship can also.
distribute by, sort by
Distribute Scatter data
Distribute by col– scattered data to different reduce according to Col column.
Sort sorts
Sort by col– sorting data according to Col column
select col1,col2 from Mdistribute by col1sort by col1 asc,col2 desc
The combination of both ensures that the output of each reduce is orderly.
Distribute by vs group by
Data is divided by key value
Both use the reduce operation
* * The only difference is that **distribute by simply disperses data, and group by brings together data from the same key, and subsequent must be an aggregation operation.
order by vs. sort by
Order BY is a global sort
Sort by simply ensures that the data output on each reduce is orderly. If there is only one reduce, the order by function is the same.
Execution Process
Reads the data from the table and executes the where condition.
Set the reduce number to 3, take the value of the distribute by column as the key, the other column values as value, and then pass the data to different reduce according to the key value, and then sort by field.
Application Scenarios
File size of map output is not uniform
Reduce output file size is uneven
Too many small files
File is super large
Put a large file in some small file Set mapred.reduce.tasks=5;-The following city will be output to five files Insert overwrite table cityselsct time, country,province,city from Infodistribute by province;Put some small files in a large file Set mapred.reduce.tasks=1;--The following province will be output to a large file Insert overwrite table province partition (dt=' 20150719 ') SELSCT time , Country,province from city distribute by country;Note: Province is a partitioned table.
Cluster by
Gather data with the same values and sort them together.
Effect equivalent to distribute by Col order by Col
Cluster by Col <==> distribute by Col order by Col
Union All
Data from multiple tables is combined into a single table, and hive does not support union
select col from(select a as col from t1union allselect b as col from t2)tmp
Execution Process
Reads the data from the table and executes the where condition. Merge into the same table.
UNION ALL must meet the following requirements
Field names are the same
Same as field type
The same number of fields
Child tables cannot have aliases
If you need to query data from a merged table, the merged table must have an alias
select * from (select * from munion allselect * from n)temp;如果两张表的字段名不一样,要将一个表修改别名同另一个表的字段名一样。select * from (select col1,col2 from munion allselect col1,col3 as col2 from n)temp;
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Hive advanced Query (group BY, order BY, join, etc.)