Hive principle and source code analysis-hive Source code architecture and theory (i.)

Source: Internet
Author: User
Tags abstract class operator file size log log serialization split create database subq


what is hive.
Data warehousing: Storing, querying, and analyzing large-scale data
sql language: Easy-to-use class SQL query language
O Programming Model: Allows developers to customize UDFs, Transform, Mapper, and Reducer to make it easier to do work that complex mapreduce cannot do
o data format: process data in any data format on Hadoop, or use an optimized format to store data on Hadoop, rcfile,orcfile,parquest
Data Services: HiveServer2, multiple APIs to access data on Hadoop, JDBC,ODBC
The metadata service: What the data is, where the data is, and the only standard on Hadoop
the relationship between Hive and Hadoop

Hive Cli



the compiler process
The function of hive's simple comprehension is to parse a SQL into an Mr Task to do it for Hadoop, so the core of hive is how to interpret the SQL:

Hive Schema



Where the source code is.



Three important modules:

Hiveparser:
This class is the portal to parsing SQL
SOURCE Location
The input parameter is a string of SQL, the output is a tree (AST "Abstract syntax tree or abbreviated as AST), or the syntax tree (Syntax tree)"), Astnode is the head node of the tree, he has an array of children,


  Public Astnode Parse (String command, Context ctx) 
      throws ParseException {
    return Parse (command, CTX, true);
  }


Astnode:
Astnode How to get a child node:


  
   /* * (NON-JAVADOC) * *
   @see org.apache.hadoop.hive.ql.lib.node#getchildren () *
   */
  @Override Public
  arraylist<node> GetChildren () {
    if (super.getchildcount () = = 0) {
      return null;
    }


Next is an abstract syntax tree that becomes a QB (query block)
Semanticanalyzer.java (semantic parser), the old version of about nearly 7000 lines of code, because Java a class of code line too many problems will occur when the compilation problem, now optimized split.
The root node of a tree requires the AST to parse the entire tree (depth-first search).


......
    A little ... @Override @SuppressWarnings ("NLS") public void Analyzeinternal (Astnode ast) throws Semanticexception {

    Reset ();
    QB QB = new QB (NULL, NULL, FALSE);//Final return result this.qb = QB;
    This.ast = AST;
Astnode child = AST; ...........
    A little ...///Continue analyzing from the child Astnode.
    DOPHASE1 (Child, QB, Initphase1ctx ());
   GetMetaData (QB);//Meta Data log.info ("completed getting MetaData in Semantic analysis"); Operator Sinkop = Genplan (QB);//The QB is generated dag ...
      Slightly........  Parsecontext pCtx = new Parsecontext (conf, QB, Child, Optopartpruner, Optopartlist, Topops, Topselops, Opparsectx, Joincontext, Toptotable, Loadtablework, Loadfilework, CTX, Idtotablenamemap, Desttableid, Uctx, ListMapJo

    Inopsnoreducer, Groupoptoinputtables, Prunedpartitions, Optosamplepruner);
    Optimizer Optm = new Optimizer ();//Logic Optimizer optm.setpctx (PCTX);
    Optm.initialize (conf);
    PCTX = Optm.optimize ();
    Init (PCTX); QB = PCtx.GETQB (); ............
     ... genmapredtasks (QB); ............ Slightly......


Finally, a QB is generated:


.............................. A little ...
public class QB {

  private static final log log = Logfactory.getlog ("Hive.ql.parse.QB");

  Private final int numjoins = 0;
  Private final int numgbys = 0;
  private int numsels = 0;
  private int numseldi = 0;
  Private hashmap<string, string> aliastotabs;
  Private hashmap<string, qbexpr> Aliastosubq;
  Private list<string> aliases;
  **private Qbparseinfo qbp;
  Private Qbmetadata qbm;**
  private qbjointree qbjoin;
  Private String ID;
  Private Boolean isquery;
  Private Createtabledesc tbldesc = null; Table descriptor of the final
..... ....... ............ Slightly...............


The two important variables of QB are QBP and QBM they all have a reference to QB, which makes up a tree.
Operator Sinkop = Genplan (QB) in the Analyzeinternal method; Let's look at the structure of the Operator class:


Public abstract class Operator<t extends serializable> implements Serializable,
    Node {

  //Bean methods

  private static final long serialversionuid = 1L;

  protected list<operator<? Extends serializable>> childoperators;
  protected list<operator<? Extends serializable>> parentoperators;
  protected String Operatorid;
  ...................... Slightly.............


From the code you can see that operator has a lot of children and parent, so this is a directed acyclic graph (DAG), and QB passes the Genplan () method into a DAG, followed by Optimizer optm = new Optimizer (); Is the logical optimizer, then how many logic optimizer does hive have? Enter Optimizer:

After these optimizations, Sinkop becomes a rewritten operator, Genmapredtasks (QB), by slicing the logical optimizer to generate a task-to-loop graph, and a split of the task, and the physical optimization of the task will be done in Genmapredtasks ().

The above logic is the general context of the compilation process code for the entire hivesql.



Hive and Database RDBMS



Hive Data Model
DataBase
Consistent with the concept of databases in RDBMS
An HDFs directory corresponding to each database
For example:
Hive > CREATE Database Hugo;
The corresponding HDFs directory is:
/user/hugo/hive/hugo.db



Meta Data
The description and mapping of HDFS data can be understood as data data. The focus of hive learning is on the language manual of hive query



sort and distribute the various by
The biggest difference from traditional relational databases is the ability to process data
The most important manifestation of this ability is the principle of sequencing and distribution.
order by is a global sort, with only one reduce, and the data volume is slow at times
sort by is randomly distributed to a reduce and then reduce the internal sort
distribute by is the corresponding record is distributed to the reduce by the distribute by field.
cluster by is a shorthand for distribute by + sort by



View Query Plans
explain command, which can be used to view the query plan generated by the corresponding query
For example:
Hive > Explian select * from src limit 1;
ABSTRACT SYNTAX TREE:
(Tok_query (Tok_from (tok_tabref (tok_tabname src))) (Tok_insert (Tok_destination (Tok_dir tok_tmp_file)) (TOK_SELECT ( tok_selexpr tok_allcolref)) (Tok_limit 1)))
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE plans:
stage:stage-0
Fetch Operator
Limit:1



Common Optimization
o Scan-related
Predicate push (predicate push down)
Row-trimmed (column pruning)
Partition tailoring (Partition pruning)
Associate Join related
join the left side of the operation is a small table
join the number of jobs started
mapjoin
Group BY related
skew in Data
O Merging small files
column trim (pruning)
o When reading the data, only care about the columns of interest, but ignore the other columns, try not to write select * from XXX
For example, for a query:
Select a b from SRC where E < 10;
Where SRC contains 5 columns (a,b,c,d,e), column c,d will be ignored, only a, B, and E columns will be read
option defaults to true: HIVE.OPTIMIZE.CP = True



Partition clipping (Partition pruning)
o Reduce unnecessary partitioning during query
For example, for the following query:
SELECT * FROM T1 joins (SELECT * from T2) Subq on (T1.C1=SUBQ.C2)
WHERE SUBQ.PRTN = 100;
SUBQ.PRTN = 100 is considered in the subquery, thus reducing the number of partitions to be read into.
option defaults to true: Hive.optimize.pruner=true


Small table on the left side of join Operation
o the table/subquery with fewer entries should be placed to the left of the Join operator
The reason is that in the reduce phase of the join operation, the contents of the table to the left of the join operator are loaded into memory and the table with fewer entries on the left side can effectively reduce the chance of an OOM error.
number of jobs started by join
o If the Join key is the same, no matter how many tables you have, it will be merged into a map-reduce
o a map-reduce (Tez) task, instead of ' n '
The same is true when you're doing OUTER JOIN.
INSERT OVERWRITE TABLE pv_users
SELECT Pv.pageid, u.age from Page_view p
JOIN user U on (pv.userid = U.userid)
JOIN NewUser x on (U.userid = X.userid);
join does not support non-equivalent connections
!= <> not supported in join on condition
Select .....
Join ....
On (A.key! = B.key);
Why.
Imagine that the a.key is uneven, adding a total of 100 million data, only one key of the data is 1, the other is 0, this will burst a node. And go back to other machines to find the data is not found.
Group By-skew in Data
The main concern is data skew
Hive.groupby.skewindata = True
The selected item is set to True, and the resulting query plan will have two MR jobs.
In the first MR Job, the output set of MAP is randomly distributed to reduce, and each reduce does a partial aggregation operation and outputs the result so that the same Group by Key may be distributed to different reduce to achieve load balancing purposes;
The second MR Job is then distributed to reduce in accordance with the preprocessed data results (this process guarantees that the same group by key is distributed to the same reduce) and finally completes the aggregation operation


Merging small Files
The merge feature increases the time the task runs.
The performance of a merge operation is largely dependent on the "single reduce output file size". The greater the output of the reduce side, the longer it takes.
The merge operation adds a MapReduce task to each hive task.
The more small files, the more times you read Metastore, the slower the parsing of SQL, and the large amount of damage that small files can do to Hadoop. Hadoop is not afraid of large files, it is afraid of files small and many, so that the mapping of files in Namenode, Namenode load is too large, for this hive to merge small files.



SerDe
SerDe is the abbreviation for Serialize/deserilize, which is intended for serialization and deserialization.
The format for serialization (write to disk) includes:
Delimiter (tab, comma, ctrl-a)
thrift protocol
o Deserialize (read into memory):
java Integer/string/arraylist/hashmap
hadoop Writable Class
User-defined class



when to consider adding new Serde
The user's data has a special serialization format that is not supported by the current hive, and the user does not want to convert the data format before loading the data into Hive
o user has a more efficient way to serialize disk data



Example-use Regexserde
to CREATE TABLE apache_log (
Host string, identity string, user string,
Tim E string, request STRING, status STRING,
size string, referer string, agent string)
ROW FORMAT SERDE ' Org.apache . hadoop.hive.contrib.serde2.RegexSerDe '
with Serdeproperties
("Input.regex" = "([^])" ([^]) ([^]) (-|\^\\]) ( [^ \ "]|\" [^\ "]\") (-| [0-9]) (-|[ 0-9]) (?: ([^ \ "]|\" [^\ "]\") ([^ \ "]|\" [^\ "]\")], ",
" output.format.string "="%1 s s%2s%3 s S%4s%5 s S%6s%7 s S %8s%9$s "
) STORED as textfile;


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.