"Programming Hive" Reading notes (two) Hive basics
: first read is browse. Build knowledge index, because some knowledge may not be able to use, know is good. The parts of interest can be studied more.
After the use of the time to look specifically. and combined with other materials.
Chapter 3.Data Types and File Formats
Raw data types and collection data types
Select out of data, the delimiter between columns and column can be specified
Chapter 4.hiveql:data Definition
creating databases, creating and modifying tables, partitioning operations
Chapter 5.hiveql:data Manipulation
1 Loading data and exporting data should be available both locally and HDFs.
2 Creating a table and inserting query results into a table
Chapter 6.hiveql:queries Select various grammars, join,cluster by etc.
where supports regular form like,rlike
Join:on conditions do not support an equal sign, do not support or
1 join then where, filter a batch of data based on join, then filter a batch by where
2 the partition filters is ignored for OUTER joints. However, Usingsuch filter predicates in on clauses for inner joins does work!
In the outer join. It is useless to write the partition condition on the on. If you want to raise speed by partitioning conditions, you can join by sub-query method.
Inner join,left outer join,right outerjoin,left Semi Join (in function, but seemingly high version number of hive support in subqueries)
Hive > SELECT * from stocks JOIN Dividends
> WHERE stock. symbol = dividends. symbol and stock. symbol = ' AAPL ';
In hive, this SQL will first count the Cartesian product and then the where filter.
In hive,this query computes the full cartesianproduct before applying the WHERE
Clause. Itcould take a very long time to finish. When the property Hive.mapred.modeis
Set to strict, the Hive prevents users from inadvertentlyissuing a Cartesian product query.
Two table joins, assuming one is smaller. It is possible to raise the speed by Map-side-join method.
Just such optimizations do not support Right-join and full-join.
Hive does not support theoptimization for Right-and Full-outer joins.
Optimization: Meet certain conditions, set the corresponding number of parameters to open.
The ORDER by clause are familiar from other SQL dialects. It performs a totalorderingof
The query result set. This means, "All" the data is passed through a and a single reducer,
Which a unacceptablylong time to execute for larger data sets.
Order BY is sorted globally, and all data is sorted by a reducer.
Because ORDER by can result in excessively long run times, Hive Willrequire a LIMIT
Clause with ORDER by if the property hive.mapred.mode are set to strict. Bydefault, it is
Set to Nonstrict. (precautions)
Sort by IS partial
Distributeby controls how map output isdivided among reducers.
Distributeby can specify how the map output is assigned to each reducer.
Usually a field is also placed in the same reducer, a bit like groupby thought.
Often used with sort by (Distributeby in front). The effect of first grouping and then internal sorting is achieved.
Cluster by is equal to the effect of distribute by plus sort by.
Additional Information:
Http://www.cnblogs.com/ggjucheng/archive/2013/01/03/2843243.html
Cast (ValueAs TYPE)
When forcing conversions, pay attention to the converted values. See if you have the results you need.
Queries from Sample Data
Random Data extraction
Part did not understand. First, whatever.
Blocksampling
Input pruningfor Bucket Tables
Union all merges two tables
This article linger
This article link: http://blog.csdn.net/lingerlanlan/article/details/41153799
Copyright notice: This article blog original articles, blogs, without consent, may not be reproduced.
"Programming Hive" Reading notes (two) Hive basics