HIVE[6] HiveQL Query

Source: Internet
Author: User
Tags arithmetic operators bitwise mathematical functions stocks s

6.1 SELECT ... From statementhive> SELECT name,salary from employees;--General Queryhive>select e.name, e.salary from Employees e;--alias query is also supported when a user selects a column that is a collection data type, Hive uses JSON syntax to apply to the output:hive> SELECT name,subordinates from employees;Display of the array type of John Doe ["Mary Smith", "Todd Jones"]hive>select name,deductions from Employees;Show John Doe {"Federal Taxes": 0.2, "State Taxes": 0.05} MAP outputhive>select name,adress from Employees; Show John Doe {"Street": "1 Michigan Ave.", "City": "Chicago", "state": "IL"} The Address column is a STRUCT hive> SELECT Name,subordin  Ates[0] FROM employees; View the 1th element in the array and return NULL if no element existshive>select name,deductions["State Taxes") from employees; querying the Map elementhive>select name,adress.city from Employees;Queries a struct for an element that can be used. Symbols the above three types of queries can also be used in the WHERE clause;hive>select symbol, ' price.* ' from stocks;Use regular expression martial law select the column that we want, this sentence is the column to check the symbol column and all column names with the price prefix;Hive>select Upper (name), salary, deductions["Federal Taxes"], round (Salary * (1-deductions["Federal Taxes"]) FR  OM employees; Returns the most recent integer of a Double type using the Round () method. Operators supported in Hive: A + B, a-A, a * B, a-B, a% (for redundancy), A & B [bitwise AND], a | b [Bitwise OR], A ^ B [bitwise XOR], ~a [bitwise inverse]

"The picture is taken from the Hive Programming Guide, page 81," Note: Arithmetic operators accept arbitrary numeric types, but if the data types are different, then the data type with the smaller range of values in both types is converted to a wider range of data types. When doing arithmetic, the user needs to be aware of data overflow or data underflow issues, and Hive follows the rules for data types in the underlying Java, because the calculation results are not automatically converted to a broader data type when overflow or underflow occurs, and multiplication and division are most likely to cause this problem; It is sometimes useful to use functions to scale data values proportionally from one range to another, such as dividing by 10 powers or taking a log value. All Mathematical functions:




"The picture from the Hive Programming Guide, page 82," Note that the function floor, round, ceil (rounding up) Enter a value of type DOUBLE, and the returned value is of type BIGINT. All aggregate functions:

"The picture was taken from the Hive Programming Guide, page 85."hive> SET hive.map.aggr=true;--Set this property to true to improve the performance of the aggregationhive>select Count (*), AVG (salary) from employees;-This setting triggers the "top-level" aggregation process in the map phase (the non-top-level aggregation process will be performed after a GROUP by), but this setting will require more memory.hive>select count (DISTINCT symbol) from stocks;Multiple functions can also accept an expression like DISTINCT to perform a row weight, or V to return 0 if the symbol is a partition column .... is a bug;hive>select count (DISTINCT ymd), COUNT (DISTINCT volume) from stocks;The authorities are not allowed to do so, but they can do so; table Generation function: A class of functions contrary to aggregate functions are not so-called table-generating functions, which can be expanded into multiple columns or multiple rows by a single column;hive> SELECT Explode (subordinates) as sub from employees;This statement converts the contents of the subordinates field in each row of records in the Employees table to 0 or more new record rows, and if the subordinates field contents are empty, no new records will be generated, if not empty. Then each element of the array will produce a new row of records, and the AS sub clause defines the column alias Sub. When you use the full table generation function, Hive requires that you use a column alias. The table generation function is described in detail in Chapter 13:
"The picture was taken from the Hive Programming Guide, page 85."

Other built-in functions:




"The picture was taken from the Hive Programming Guide, page 88-92."hive> Select upper (name), salary, deductions["Federal Taxes"], round (Salary * (1-deductions["Federal Taxes"]) FR OM EmployeesLIMIT 2; The limit clause is used to restrict the number of rows returned;hive> Select upper (name), salary, deductions["Federal Taxes"]As fed_taxes, Round (Salary * (1-deductions["Federal Taxes"]))As salary_minus_fedFrom employees LIMIT 2; Fed_taxes,salary_minus_fedTwo individual names for newly isolated results;hive> from (upper (name), salary, deductions["Federal Taxes"] as fed_taxes, round (Salary * (1-deductions[) Federal Taxes "])) as salary_minus_fed from Employees) e SELECT e.name, e.salary_minus_fed_taxes WHEREe.salary_minus_fed_taxes >7000; Nested query case for SELECT ... When ... Then sentence example:hive>select name,salary, caseWhen salary < "low"When salary >= and salary <7000 then ' middle 'When salary>=7000 and salary < 100000 then ' high 'ELSE ' Very high ' END as bracket from employees;In most cases, the query triggers a MapReduce task, and the query for this pattern in hive does not have to use MP; for example: SELECT * FROM Employees;SELECT * FROM Employees WHERE country= ' us ' and state= ' CA ' limit 100; The case where the filter condition in the WHERE statement is only a partitioned field (regardless of whether the LIMT statement is used to limit the number of output record strips) is not required for the mapreduce process;hive.exec.mode.local.auto=true;If this value is true, Hive will also attempt to perform other operations using local mode, or hive will use MP to perform all other queries; it is better to add it to the $HOME/.hiverc file; The 6.2 where statement is used to filter the query conditional usage as in normal SQL; Verb actions: The operation of these words can also be used for jion ... On and having statements "pictures taken from the Hive Programming Guide 88-92 page" Like and Rlike:like is a standard SQL operation that lets us pass the beginning or end of a string, and specify a specific substring, or match when a substring appears anywhere within the string; the Rlike clause is an extension of this feature in Hive, which can be specified by the more powerful language of the JAVA regular expression. 】hive>select name, address.street from Employees WHERE address.street like '%ave. '--Find employee names that begin with Ave;hive>select name, address.stree from Employees WHERE address.street rlike '. * (chicago| Ontario). * ';--rlike after the string meaning: in the string. The expression is matched with any character, and the asterisk * means repeating the "left string" 0 times to countless times, and expressions (X|y) are matched with x or y; "PS: Not regular can be Baidu to learn" 6.3 GROUP by statement, which is usually used with aggregate functions, according to one or more columns to the results Row grouping, and then performing aggregation operations on each group "usage is similar to SQL"hive>select Year (YMD), AVG (price_close) from stocks WHERE exchange= ' NASDAQ ' and symbol = ' APPLE ' GROUP by year (YMD); --Example having statement: Hive>select year (YMD), AVG (price_close) from stocks >where exchange= ' NASDAQ ' and symbol = ' APPLE ' GROUP by year (YMD)>having avg (price_close) >50.0;--Example 6.4 JOIN StatementINNER Join links, only the data that matches the connection criteria in the two tables that are linked will show "usage is similar to SQL"Hive>select A.ymd, A.price_close, b.price_close from stocks a JOIN stocks B on A.ymd =b.ymd>where a.symbol = ' appl ' and b.symbol= ' IBM ';Note: Queries that are not supported in hive are either "in-between" or "not supported" in the verbs in the ON clause, and can be supported and ":Hive>select A.ymd, A.price_close, b.price_close from stocks a JOIN stocks B on A.ymd<=B.ymd>where a.symbol = ' appl ' and b.symbol= ' IBM ';Links to more than one table:Hive>select A.ymd, A.price_close, B.price_close,c.price_close> From stocks a joins stocks B on A.ym d=b.ymd
> JOIN stocks c on a.ymd = C.ymd>where a.symbol = ' appl ' and b.symbol= ' IBM ' and c.symbol= ' ge ';In most cases, Hive initiates a mapreduce task on each JOIN link object, which first starts a mapreduce job to connect table A and table B, and then starts a mapreduce job that mapred the first one. The output of the UCE job and the connection to C; Hive is a "hint" that executes from left-to-right order. When a JOIN link is made to 3 or more tables, if each on clause uses the same link key, then only one mapreduce;hive is given, assuming the query has the most The latter table is the largest one, and when you link each row of records, it will try to cache the other tables and then scan the last large table in the calculation, so we need to make sure that the size of the table in the continuous query is increased from left to right, and left OUTER JOIN is "similar to SQL usage"hive> SELECT S.ymd, S.symbol, S.price_close, d.dividend from stocks s left OUTER jion dividends D on>S.YDM and S.symbol=d.symbol WHERE s.symbol= ' AAPL ';In the left outer link operation, all records in the left table of the JOIN operator that conform to the WHERE clause are returned, and NULL is returned when there are no link records in the right table that match on. The hint where statement executes after the join operation is executed, so the where statement should only be used to filter those columns that are NOT null values, while the partition filter in the ON statement (OUTER JOIN) is not valid, but is valid within the link;Right OUTER JOINRight outer link: Returns all records that conform to the WHERE statement in the right table, with NULL instead of matching field values in the left table;Full OUTER JOINFull link: All records in all tables that meet the WHERE statement criteria will be returned;Left SEMI JOINLeft half Open link: Returns the record on the left side of the table, provided that the other records meet the criteria in the ON statement for the right table; "Do not support right-close link"Hive>select S.ymd, S.symbol, s.price_close from stocks s left SEMI joins dividends D on S.ymd=d.ymd and S.symbol=d.s Ymbol;Queries not supported by Hive:Hive>select S.ymd, S.symbol, s.price_close from stocks S WHERE s.ymd,s.symbol in (SELECT * from dividends D);JOIN Cartesian Product: The number of rows on the left side multiplied by the number of rows in the right table equals the size of the returned result set;hive> SELECT * from stocks JOIN dividends;If you use this method to query, MapReduce can not be optimized in any way;Set Hive.mapred.mode=strictThe query of the Cartesian product is forbidden; Map-side join: If only one table in all tables is a small table, you can put the small table in memory when the largest table is mapper, and hive can perform the linking process (called Map-side JOIN) on the map side; Because Hive can match a small table in memory, omitting the reduce process required for regular connection operations, this optimization is significantly faster than regular connection operations, even for very small data xiu, reducing the reduce process and sometimes simultaneously reducing the steps of the map process ;Hive>set hive.auto.conver.join=true;This property needs to be set to take effect starting from version 0.7Hive>set hive.mapjoin.smalltable.filesize=25000000;Configure the size of the small table that can use this optimization, (in bytes) The right outer link, the full outer link does not support the above optimization, the bucket table, for large tables, in certain circumstances can also use this optimization, but the table data must be according to the key in the statement of the bucket, And one of the table of the number of barrels must be another table of the number of barrels several times, so that the data can be linked by the bucket; Hive>set hive.optimize.bucketmapjoin=true; Also needs to be set, which is off by default; 6.5 ORDER by and SORT byThe p132hive's ORDER by statement is the same as other SQL language definitions. A global sort of query results. In other words, there will be a process in which all the data is processed through a reducer, and if there is a large data set, the process may take too long to execute; Hive also has a sort by, which sorts the data in each reducer. That is, a local ordering process is performed, which ensures that each reducer output data is ordered (but not globally ordered) to improve the efficiency of global sorting;SELECT S.ymd, S.symbol, s.price_close from stocks S ORDER by S.ymd ASC, S.symbol DESC;ORDER BY exampleSELECT S.ymd, S.symbol, s.price_close from stocks S SORT by S.ymd ASC, S.symbol DESC;Sort by example note: Because the ORDER by operation may lead to long run time, if the attribute is hvie.mapred.mode=strict, then hive requires that the statement must be restricted by the limit statement, and the property is Nonstri by default. CT; 6.6 SORT by and distribute Bydistribute by control how the output of map is divided in reducer. Assuming we want data with the same stock transaction code to be processed together, we can use distribute by to ensure that records with the same stock transaction code are distributed to the same reducer for processing, and then use the SOTR by to sort the data according to our expectations:Hive>select S.ymd, S.symbol, s.price_close from stocks s distribute by S.symbol SORT by s.symbol ASC, S.YMD ASC; Distribute by and GROUP by are similar in their control over how reducer accepts a row of data for processing, and sort by controls how the data in the reducer is sorted. Note that the distribute by statement must be written before the SOTR by statement; 6.7 CLUSTER by in the example above S.symbol column is used in the distribute by statement, while S.symbol and s.ymd bits with SOT In the R by statement, if the exact same columns are involved in the two statements and are sorted in ascending order (that is, the default sort) in this case, cluster by is equal to the preceding 2 statements, equivalent to shorthand:hive> SELECT S.ymd, S.symbol, s.price_close from the stock s CLUSTER by S.symbol;Use Distribute by ... The SOTR by statement or its simplified version of the CLUSTER by statement deprives the parallelism of sort by, but this allows the output file's data to be sorted globally; 6.8 Type conversionsSELECT name, salary from Employees WHERE cast (salary as FLOAT) < 10000.0;Syntax for type conversion: Cast (value as type) returns NULL if unsuccessful, and note that the recommended way to convert a floating-point number to an integer is to use the round () or floor () function instead of the type conversion operator cast; binary type only supports BI The NARY type is converted to STRING type; 6.9 Sample query: Hive can be sampled by a bucket of tables to satisfy a sampled queryHive>select * FOM numbers tablesample (BUCKET 3 out of ten on Rand ())) s; The denominator in the bucket statement represents the number of buckets that the data will be hashed, while the numerator represents the number of buckets that will be selected;Hive>select * from Numberflat tablesample (0.1 PERCENT) s;Sampling by percentage, this is based on the number of rows; Note: This sample test does not necessarily apply to the remote file format, and the sampling method of the smallest sampling unit is a HDFS data block, if the table data size is less than the normal block size of 128M then all rows will be returned ; a percentage-based sampling provides a variable to control the seed information for data block-based tuning:<property><name>hive.sample.seednumber</name><value>0</value></property>6.10 UNION All can merge 2 or more tables, each union subquery must have the same column, and the field type of each field must be consistent; This feature facilitates splitting a long, complex where statement into 2 or more UNION subqueries, unless the source The table establishes a cable, otherwise the query will be distributed multiple copies of the same source data;From (from Src SELECT src.key, src,value WHERE src.key <100UNION AllFrom src SELECT src.key, src,value WHERE src.key >110) UnioninputINSERT OVERWRITE DIRECTORY '/tmp/union.out ' SELECT unionimput.*

HIVE[6] HiveQL Query

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.