Data query of Hive

Source: Internet
Author: User
Tags joins python script

Hive provides a SQL-like query language for large-scale data analysis, which is a common tool in the Data Warehouse. 1. Sorting and aggregation

Sorting is done using the regular order by, and hive is ordered in parallel when processing the order by request, resulting in a global ordering result. If global ordering is not necessary, then you can use hive's nonstandard extension sort by, which returns a locally ordered result, each reducer internally ordered, and each reducer produces an ordered file.

Sometimes we want to control which reducer the data goes to for the next step, which can be done using hive's distributed by operation. In the following statement, the same year data is distributed to the same reducer, sorted within reducer:

From Records2
SELECT year,temperature
distributed by year
SORT by year ASC, temperature DESC;

The next process can be done based on the data being grouped and sorted for the same year, or you can embed the above statement as a subquery. If the sort by and distributed by fields are the same, they can be simplified to cluster by. Note the difference between clustered by in the bucket. 2. MapReduce Script

Using a similar approach to Hadoop streaming, the transform,map,reduce clause allows us to invoke external programs (scripts) in hive. For example, we use the following Python script to filter dirty data:

#!/usr/bin/env python

import re
import sts for line in

Sys.stdin:
  (year,temp,q) = Line.strip (). Split ()
  if (temp! = "9999" and Re.match ("[01459]", q)):
    print "%s\t%s"% (year,temp)

The script above is very independent and we are easy to test and develop. Then use this script as follows:

ADD file/user/root/python/is_good_quality.py;

From Records
SELECT TRANSFORM (year,temperature, quality)
USING ' is_good_quality.py '
as year, temperature

By adding a script to hive with an add file similar to the Add jar, hive is further distributed to the cluster. The statement streaming the three fields in the table to the standard input, and accepts the standard output from the Python process, resolving to the year and temperature fields.

The following statements combine the map script with the reduce script:

From
  records2 MAP year
  , temperature, quality
  USING ' is_good_quality.py ' as year
  , temperature) Map_output
REDUCE, temperature
USING ' max_temperature_reduce.py '
as year, Temperature
3. Association

Using hive to correlate operations is much easier than native mapreduce. Internal Association

The simplest association operation is an inner association. Suppose you have the following two simple tables:

Associate Operation join:

SELECT * FROM 
sales join things on (sales.id=things.id)


Hive supports only equality associations, meaning that only equality comparisons can be used in the condition judgments of joins. The join condition determines that multiple conditions can be listed with and and multiple tables can be associated. The above statement is equivalent to:

Select Sales.*, things.* from
sales,things
where sales.id=thing.id

An association operation is implemented with a mapreduce job, but with multiple associations, the number of jobs used can be less than the number of associations, which can be done in the same job using the same fields, and expain can be used to see how many jobs are used by the associated operation:

EXPLAIN extened can obtain more trusted job information. External Association

As with an out-of-context association in SQL, Hive provides left-and right-associative, fully-associative: Ieft Join:

SELECT * FROM 
sales to outer join things on (sales.id=things.id)

Right Join:

SELECT * FROM 
Sales right outer join things on (sales.id=things.id)

Full Join

SELECT * FROM 
sales full outer join things on (sales.id=things.id)

semi-associative

Consider a query like this:

SELECT * from
things
where things.id in (select ID from sales);

Using a semi-association can achieve the same effect:

SELECT * from 
things left Semi join things on (sales.id=things.id)

The left SEMI join has a limit, and the table on the right cannot appear in the statement behind the Select, only in the ON clause. Map Association

In the following correlation statement:

Select Sales.*, things.* from
sales joins things on (Sales.id=things.id)

If a table is small enough, it can be placed in memory, such as the things metadata table here. Hive can load this small table into the memory of each mapper task for associated operations. called a map Association.

The mapreduce used to execute the job is only mapper and has no reducer. So in a right or full outer type of association, this does not work, because in these cases it is only possible to determine if there is really no matching record until all the inputs are aggregated at the reducer stage. In the left association of the figure below, we can see from the log that reducer is not enabled:

In full join, a reducer is enabled:

The Map Association can be further enhanced in the table of the buckets, at which point the Mapper simply reads the specific bucket in the right table to complete the association operation. To take advantage of this optimization, you need to set hive.optimize.bucketmapjoin=true. 4. Sub-query

A subquery refers to a SELECT statement that is embedded in another SQL statement. Hive provides limited support for subqueries, running embedded subqueries in from, where, and select. The following statement uses a subquery in the From:

Select Station, year, AVG (Max_temperature) from
(
  Select Station, year, Max (temperature) as Max_temperature From
  records2
  where temperature! = 9999 and quality in (0,1,4,5,9)
  Group by station, year
) Mt
Group By station, year

The result of the subquery needs to be given a name, and the Mt table name above is used for the outer query. At the same time, the fields in the subquery also give the name, which makes it convenient for the outer query reference, such as Max_temperature. 5. View

A view can be viewed as a virtual table defined by a select that shows data different from the internal storage to the user and is often used for permission control.

The view in hive is not materialized to disk, and the view query is not executed until a query that uses the view is run. If the view's data is often used, consider using the Create TABLE ... As select creates a table that stores the contents of the view. We use the view to rewrite the previous query:

Create VIEW Valid_records
as
SELECT * from 
records 
where temperature! = 9999 and quality in (0,1,4,5,9)

When the view is created and saved to Metastore, but not really running, the show Tables command can see the view, and you can use describe EXTENDED view_name query The view for further information, such as what kind of select was created. Then create another view on the basis of the view:

Create View Max_temperatures (Station,year,max_temperature (
as
select Station,year,max (temperture)
From Valid_records
Group by station, year;

The field name is explicitly specified when you create the view, because the Aggregation field hive automatically creates fields such as _C2, or you can use the AS clause to specify an aggregated field name.
Finally we look at the final result from this view:

Select Station, Year,avg (max_temperature) from
max_temperatures
Group by Station,year;

Additionally, the view of hive is read-only, so the underlying table data cannot be finer through the view.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.