Hive provides a SQL-like query language for large-scale data analysis, which is a common tool in the Data Warehouse. 1. Sorting and aggregation
Sorting is done using the regular order by, and hive is ordered in parallel when processing the order by request, resulting in a global ordering result. If global ordering is not necessary, then you can use hive's nonstandard extension sort by, which returns a locally ordered result, each reducer internally ordered, and each reducer produces an ordered file.
Sometimes we want to control which reducer the data goes to for the next step, which can be done using hive's distributed by operation. In the following statement, the same year data is distributed to the same reducer, sorted within reducer:
From Records2
SELECT year,temperature
distributed by year
SORT by year ASC, temperature DESC;
The next process can be done based on the data being grouped and sorted for the same year, or you can embed the above statement as a subquery. If the sort by and distributed by fields are the same, they can be simplified to cluster by. Note the difference between clustered by in the bucket. 2. MapReduce Script
Using a similar approach to Hadoop streaming, the transform,map,reduce clause allows us to invoke external programs (scripts) in hive. For example, we use the following Python script to filter dirty data:
#!/usr/bin/env python
import re
import sts for line in
Sys.stdin:
(year,temp,q) = Line.strip (). Split ()
if (temp! = "9999" and Re.match ("[01459]", q)):
print "%s\t%s"% (year,temp)
The script above is very independent and we are easy to test and develop. Then use this script as follows:
ADD file/user/root/python/is_good_quality.py;
From Records
SELECT TRANSFORM (year,temperature, quality)
USING ' is_good_quality.py '
as year, temperature
By adding a script to hive with an add file similar to the Add jar, hive is further distributed to the cluster. The statement streaming the three fields in the table to the standard input, and accepts the standard output from the Python process, resolving to the year and temperature fields.
The following statements combine the map script with the reduce script:
From
records2 MAP year
, temperature, quality
USING ' is_good_quality.py ' as year
, temperature) Map_output
REDUCE, temperature
USING ' max_temperature_reduce.py '
as year, Temperature
3. Association
Using hive to correlate operations is much easier than native mapreduce. Internal Association
The simplest association operation is an inner association. Suppose you have the following two simple tables:
Associate Operation join:
SELECT * FROM
sales join things on (sales.id=things.id)
Hive supports only equality associations, meaning that only equality comparisons can be used in the condition judgments of joins. The join condition determines that multiple conditions can be listed with and and multiple tables can be associated. The above statement is equivalent to:
Select Sales.*, things.* from
sales,things
where sales.id=thing.id
An association operation is implemented with a mapreduce job, but with multiple associations, the number of jobs used can be less than the number of associations, which can be done in the same job using the same fields, and expain can be used to see how many jobs are used by the associated operation:
EXPLAIN extened can obtain more trusted job information. External Association
As with an out-of-context association in SQL, Hive provides left-and right-associative, fully-associative: Ieft Join:
SELECT * FROM
sales to outer join things on (sales.id=things.id)
Right Join:
SELECT * FROM
Sales right outer join things on (sales.id=things.id)
Full Join
SELECT * FROM
sales full outer join things on (sales.id=things.id)
semi-associative
Consider a query like this:
SELECT * from
things
where things.id in (select ID from sales);
Using a semi-association can achieve the same effect:
SELECT * from
things left Semi join things on (sales.id=things.id)
The left SEMI join has a limit, and the table on the right cannot appear in the statement behind the Select, only in the ON clause. Map Association
In the following correlation statement:
Select Sales.*, things.* from
sales joins things on (Sales.id=things.id)
If a table is small enough, it can be placed in memory, such as the things metadata table here. Hive can load this small table into the memory of each mapper task for associated operations. called a map Association.
The mapreduce used to execute the job is only mapper and has no reducer. So in a right or full outer type of association, this does not work, because in these cases it is only possible to determine if there is really no matching record until all the inputs are aggregated at the reducer stage. In the left association of the figure below, we can see from the log that reducer is not enabled:
In full join, a reducer is enabled:
The Map Association can be further enhanced in the table of the buckets, at which point the Mapper simply reads the specific bucket in the right table to complete the association operation. To take advantage of this optimization, you need to set hive.optimize.bucketmapjoin=true. 4. Sub-query
A subquery refers to a SELECT statement that is embedded in another SQL statement. Hive provides limited support for subqueries, running embedded subqueries in from, where, and select. The following statement uses a subquery in the From:
Select Station, year, AVG (Max_temperature) from
(
Select Station, year, Max (temperature) as Max_temperature From
records2
where temperature! = 9999 and quality in (0,1,4,5,9)
Group by station, year
) Mt
Group By station, year
The result of the subquery needs to be given a name, and the Mt table name above is used for the outer query. At the same time, the fields in the subquery also give the name, which makes it convenient for the outer query reference, such as Max_temperature. 5. View
A view can be viewed as a virtual table defined by a select that shows data different from the internal storage to the user and is often used for permission control.
The view in hive is not materialized to disk, and the view query is not executed until a query that uses the view is run. If the view's data is often used, consider using the Create TABLE ... As select creates a table that stores the contents of the view. We use the view to rewrite the previous query:
Create VIEW Valid_records
as
SELECT * from
records
where temperature! = 9999 and quality in (0,1,4,5,9)
When the view is created and saved to Metastore, but not really running, the show Tables command can see the view, and you can use describe EXTENDED view_name query The view for further information, such as what kind of select was created. Then create another view on the basis of the view:
Create View Max_temperatures (Station,year,max_temperature (
as
select Station,year,max (temperture)
From Valid_records
Group by station, year;
The field name is explicitly specified when you create the view, because the Aggregation field hive automatically creates fields such as _C2, or you can use the AS clause to specify an aggregated field name.
Finally we look at the final result from this view:
Select Station, Year,avg (max_temperature) from
max_temperatures
Group by Station,year;
Additionally, the view of hive is read-only, so the underlying table data cannot be finer through the view.