Hive in layman 's
1. What is Hive
1) What is hive?
Here is an introduction to the Hive wiki:
Hive is a data warehouse infrastructure built on top of Hadoop. IT provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and a Nalysis of large data sets stored in Hadoop files. Hive defines a simple sql-like query language, called QL, which enables users familiar with SQL to query the data. At the same time, this language also allows programmers who is familiar with the MapReduce fromwork to being able to plug in Their custom mappers and reducers to perform more sophisticated an analysis of that is not being supported by the built-in Capabi Lities of the language.
Hive is a data warehouse infrastructure built on Hadoop. It provides a range of tools that can be used for data extraction conversion loading (ETL), a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive defines a simple class-SQL query language called QL, which allows users who are familiar with SQL to query data. At the same time, the language also allows developers to familiarize themselves with the development of custom Mapper and reducer for the built-in mapper and reducer of complex analytical work that cannot be done.
Hive does not mandate read or written data bes in the "Hive format"-there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see the File Format and SerDe in Developer Guide for details.
Hive does not have a specific data format. Hive can work well on Thrift, control separators, and allow users to specify data formats.
2) hive is not what?
Hive based on Hadoop,hadoop is a batch system that does not guarantee low latency, so the query for hive does not guarantee a low latency.
The working mode of hive is to submit a task to be notified at the end of the task instead of being queried in real time. correspondingly, a system like Oracle, when run on a small data set, responds very quickly, and can take several hours when the data set being processed is very large. It is important to note that hive may take several minutes to complete even if it is running on a very small data set.
In short, low latency is not the primary goal that Hive pursues. Hive's design objectives are: scalable, extensible, fault tolerant, and input format loose coupling.
Hive Resources
The hive itself provides a richer documentation, and the following links provide some of the basic documentation for hive:
- FaceBook Mirror (by Wall): [[http://mirror.facebook.com/facebook/hive]]
- Wiki page: [[http://wiki.apache.org/hadoop/Hive]]
- Getting Started: [[http://wiki.apache.org/hadoop/Hive/GettingStarted]]
- Query Language guide: [[[Http://wiki.apache.org/hadoop/Hive/HiveQL]]
- Presentation: [[[Http://wiki.apache.org/hadoop/Hive/Presentations]]
- Blueprint: [[Http://wiki.apache.org/hadoop/Hive/Roadmap]]
Most questions about the use and characteristics of Hive can be found in the links above. Of course, as hive itself continues to evolve, the speed at which the document is updated is often not as fast as that of the hive itself, and if you want to know what's new in hive or encounter bugs, you can join Hive's mailing list:
* User: [email protected]
* Developer: [email protected]
1. Why Hive is used
the problems facingStaff learning costs too high project cycle requirements are too short I just need a simple environment MapReduce how to handle complex queries how to implement the Hive interface using the class SQL syntax to provide rapid development capabilities • Avoid writing MapReduce and reducing developer The learning cost and expansion function are very convenientFeatures of Hive• Expandable hive can expand the size of the cluster without the need to restart the service. • Malleable hive supports user-defined functions, and users can implement their own functions according to their own requirements • Fault tolerance for good fault tolerance, node problems SQL can still complete execution
2. Hive Architecture
The structure of Hive,
Mainly divided into the following parts:
- User interface, including command line Cli,client,web interface WUI,JDBC/ODBC interface, etc.
- Middleware: Includes the thrift interface and the service side of the JDBC/ODBC for integrating hive and other programs.
- Metadata metadata storage, usually a system parameter stored in a relational database such as MySQL, Derby
- Underlying driver: Includes HIVEQL interpreter, compiler, optimizer, actuator (engine).
- Hadoop: Use HDFS for storage and compute with MapReduce.
- There are three main user interfaces: Cli,client and WUI. One of the most common is when CLI,CLI starts, it initiates a Hive copy at the same time. The client is the guest of hive, and the user connects to the Hive Server. When you start the Client mode, you need to indicate the node where the hive server is located and start hive Server on that node. WUI is a browser that accesses Hive.
- Hive stores metadata in the database, such as MySQL, Derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the properties of the table (whether it is an external table, etc.), the directory where the table's data resides, and so on.
- The interpreter, compiler, optimizer completes HQL query statements from lexical analysis, parsing, compiling, optimization, and query plan generation. The generated query plan is stored in HDFS and subsequently executed by a MapReduce call.
- Hive data is stored in HDFS, and most queries are completed by MapReduce (queries that contain *, such as SELECT * from TBL, do not generate mapredcue tasks).
the relationship between Hive and Hadoop:
3. Hive Metadata Store
Metastore a directory similar to hive. It stores all the information that has a table, a region, a column, a type, a rule model. And it can be modified and queried through the thrift interface. It provides efficient service for compilers, so it is stored in a traditional RDBMS and managed using a relational model. This information is important, so backup is required and the scalability of the query is supported.
Hive stores metadata in an RDBMS with three modes to connect to the database:
- Single User mode: This mode is connected to a in-memory database Derby, which is typically used for Unit Test.
- Multi User mode: Connecting to a database over a network is the most frequently used mode.
、
- Remote Server Mode: For non-Java client access metabase, a metastoreserver is started on the server side, and the client accesses the metabase through Metastoreserver using the Thrift protocol.
Attach a meta-database data dictionary (if you use mysql,show tables): BUCKETING_COLS:COLUMNS:Hive table field information (Fields note, Field name, field type, field ordinal) DBS: metabase information, Storing HDFs path information partition_keys:hive partition Table partitioning key SDS: The HDFS Data directory and data format for all hive tables and table partitions. Sd_params: Serialization of deserialization information such as row separators, column separators, null representation characters, and so on sequence_table:sequence_table table holds the next available ID for the Hive object, such as ' org '. Apache.hadoop.hive.metastore.model.MTable ', 21, the next newly created hive table whose tbl_id is 21, and sequence_table in table 271786 is updated to 26 (this is +5 each time). ?)。 Similarly, column,partition and so on have corresponding records SERDES:SERDE_PARAMS:SORT_COLS:TABLE_PARAMS: table-level attributes, such as whether external tables, table annotations, etc. tbls: basic information for all hive tables in the table above The whole process of creating a table for hive is already clear.
- Parse user submits hive statement, parse it, decompose into hive object such as table, field, partition, etc.
- Based on the information resolved to build the corresponding table, field, partition and other objects, from sequence_table to get the latest ID of the building object, and build object information (name, type, etc.) with the DAO method written to the metadata table, the success will Sequence_ The latest id+5 in the table.
4. Data storage for Hive
First, Hive does not have a dedicated data storage format and does not index the data, and users can organize the tables in hive very freely, simply by telling the column separators and row separators in the hive data when creating the table, and hive can parse the data.
Second, all the data in hive is stored in HDFS, and hive contains the following data model: Table,external table,partition,bucket.
1) Table: A table is a directory in HDFs
2) District partition: One of the tables in the table is a subdirectory under the table directory
3) Bucket Buckets: If there is a partition, then the bucket is a unit under the zone, if there is no area in the table, then the bucket is directly under the table unit, the bucket is generally the form of documents.
- The table in hive is conceptually similar to the table in the database, and each table has a corresponding directory store data in hive. For example, a table PVS, its path in HDFS is:/wh/pvs, where WH is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in Hive-site.xml, all table data (excluding Ex ternal Table) are stored in this directory.
- Partition corresponds to a dense index of Partition columns in the database, but Partition in Hive are organized differently from the database. In Hive, a Partition in a table corresponds to a directory below the table, and all Partition data is stored in the corresponding directory. For example: PVs table contains DS and city two Partition, then corresponds to ds = 20090801, ctry = US HDFS subdirectory is:/wh/pvs/ds=20090801/ctry=us; corresponds to ds = 20090801 , ctry = The HDFS subdirectory for the CA;/wh/pvs/ds=20090801/ctry=ca. Tables are partitioned, and how to add partitions can be done through the HIVE-QL language. With partitioning, or directory storage, hive makes it easier to query for partition conditions.
- Buckets calculates the hash for the specified column, slicing the data according to the hash value, in order to parallel each Bucket corresponding to a file. Spread the user column to 32 buckets, first calculating the value of the user column hash, corresponding to a hash value of 0 of the HDFS directory is:/wh/pvs/ds=20090801/ctry=us/part-00000;hash value of 20 HD The FS directory is:/wh/pvs/ds=20090801/ctry=us/part-00020. Buckets are the ultimate form of storage for hive. When you create a table, users can describe the buckets and columns in detail.
- E xternal Table points to data that already exists in HDFS , you can create a Partition. It is the same as Table in the metadata organization, while the actual data is stored in a large difference.
- Table creation and data loading process (both processes can be completed in the same statement), during the loading of data, the actual data will be moved to the Data Warehouse directory, and then the data pair access will be done directly in the Data Warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time.
- External table has only one process, loading the data and creating the table at the same time (create External table ...). Location), the actual data is stored in the HDFS path specified behind the location and is not moved to the Data Warehouse directory. When you delete a External Table, delete only
5. Differences between hive and common relational databases
Because Hive uses the SQL query Language HQL, it is easy to interpret hive as a database. In fact, in terms of structure, Hive and database have similar query language, no similarity. This article will explain the differences between Hive and database from several aspects. The database can be used in Online applications, but hive is designed for the data warehouse, which is clear and helps to understand the characteristics of hive from an application perspective.
Similarities and differences with common relational databases:
|
Hive |
Rdbms |
Query Language |
HQL |
Sql |
Data storage |
Hdfs |
Raw Device or Local FS |
Index |
No |
Yes |
Perform |
Mapreduce |
Excutor |
Execution delay |
High |
Low |
Processing data size |
Big |
Small |
1. Query the language. Because SQL is widely used in the Data Warehouse, the query Language HQL of class SQL is designed specifically for the characteristics of Hive. Developers who are familiar with SQL development can easily use Hive for development.
2. Data storage location. Hive is built on top of Hadoop, and all hive data is stored in HDFs. The database can then store the data in a block device or on a local file system.
3. Data format. There is no specific data format defined in Hive, the data format can be specified by the user, the user-defined data format needs to specify three attributes: The column delimiter (typically a space, "\ T", "\x001″"), the line delimiter ("\ n"), and the method of reading the file data (three file formats default in hive Textfile,sequencefile and Rcfile). Because in the process of loading the data, there is no need to convert from the user data format to the data format defined by the hive, so hive does not make any modifications to the data itself during loading, but simply copies or moves the contents of the data to the appropriate HDFS
The directory. In the database, different databases have different storage engines and define their own data formats. All data is stored in a certain organization, so the process of loading data in a database can be time-consuming.
4. Data updates. Because Hive is designed for data warehouse applications, the content of the Data warehouse is much less read and write. Therefore, overwriting and adding data is not supported in Hive, and all data is determined when loaded. The data in the database is often modified, so you can use INSERT into ... VALUES add data, use UPDATE ... SET to modify the data.
5. Index. As has been said before, Hive does not do any processing of the data during the loading of the data, or even scans the data, and therefore does not index some of the keys in the data. When Hive accesses a specific value in the data that satisfies a condition, it requires brute-force scanning of the entire data, so the access latency is high. Because of the introduction of MapReduce, hive can access the data in parallel, so even without an index, hive can still demonstrate its advantage in accessing large amounts of data. Database, it is usually indexed for one or several columns, so the database can be highly efficient and low latency for data access to a small number of specific conditions. Due to the high latency of data access, it determines
Hive is not suitable for online data queries.
6. Implementation. The execution of most queries in Hive is done through the MapReduce provided by Hadoop (queries like select * from TBL do not require MapReduce). The database usually has its own execution engine.
7. Execution delay. As mentioned before, Hive, when querying data, needs to scan the entire table because there is no index, so the delay is high. Another factor that causes a high latency in Hive execution is the MapReduce framework. Because MapReduce itself has a high latency, there is also a high latency when executing a Hive query with MapReduce. In contrast, the database execution latency is low. Of course, this low is conditional, that is, the data size is small, when the data is large enough to exceed the processing capacity of the database, Hive's parallel computing obviously can show the advantages.
8. Scalability. Because Hive is built on top of Hadoop, the scalability of hive is consistent with the scalability of Hadoop (the world's largest Hadoop cluster is around 4000 nodes in the yahoo!,2009 year). However, due to the strict limitation of ACID semantics, the database is very limited in extension lines. At present, the most advanced parallel database Oracle has a theoretical expansion capacity of only about 100 units.
9. Data size. Because Hive is built on a cluster and can be used for parallel computing by MapReduce, it can support large-scale data, and correspondingly, the database can support a small amount of data.
5. Basic concepts of Hive
Reference: Https://cwiki.apache.org/confluence/display/Hive/Tutorial
1) Data Unit
Depending on the granularity of the data, the hive data can be organized into:
1) Databases: A namespace that avoids naming conflicts for different tables
2) Tables: A collection of homogeneous data with the same Scema
3) Partitions: A table can have one or more partition keys that determine how data is stored
4) buckets (or clusters): data in the same partition can be divided into buckets based on the hash value of a column. Partition and buckets are not necessary, but they can greatly speed up the query speed of the data.
2), Data type
(1) Simple type:
TINYINT-1 byte integer
SMALLINT-2 byte integer
INT-4 byte integer
BIGINT-8 byte
Boolean-true/false
FLOAT-Single Precision
Double-dual precision
String-String Collection
(2) Complex type:
Structs:structs internal data can be accessed via dot (.), for example, a column C in a table is of type struct{a int; b int}, and we can access domain A through C.A.
Maps (Key-value pair): Access to the specified domain can be done through [' element name '], for example, a map m contains a group->gid k-v pair, the GID value can be obtained by m[' group ').
The data in Arrays:array is the same type, for example, if the element in array a [' A ', ' B ', ' C '], then the value of a[1] is ' B '.
3), built-in operators and functions
Includes relational operators (A=b, A!=b, A<b, and so on),
Arithmetic operators (A+b, A*b, A&b, a| b, etc.),
logical operators (A&&B, a| B, etc. ),
Operators on complex types (A[n], M[key], s.x),
Various built-in functions: ROUND,FLOOR,SUBSTR
4), Language skills
The Hive Query Language provides basic class-SQL operations based on table and partition, including the following:
1. Using the WHERE statement to filter the development line
2. Use Select to find the specified column
3. Join two table
4. GROUP BY
5. Query results of one table are deposited in another table
6. Storing the contents of a table in a local directory
7. Storing query results on HDFS
8. Manage table and partition (creat, drop, alert)
9. Embed the Map-reduce program in the query
Hive Query Language provides the basic SQL like operations. These operations work on tables or partitions. These operations is:
- Ability to filter rows from a table using a WHERE clause.
- Ability to select certain columns from the table using a SELECT clause.
- Ability to do equi-joins between and tables.
- Ability to evaluate aggregations on multiple ' group by ' columns for the data stored in a table.
- Ability to store the results of a query into another table.
- Ability to download the contents of a table to a local (e.g., NFS) directory.
- Ability to store the results of a query in a Hadoop DFS directory.
- Ability to manage tables and partitions (create, drop and alter).
- Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.
6. Hive Practical Application
1. Apache Weblog Data
The format of Apache weblog is customizable, while most webmasters use the default.
For default Apache weblog, we can create a table with the following command.
More about! Regexserde can found here in HIVE-662 and HIVE-1719.
CREATE
TABLE
apachelog (
host STRING,
identity STRING,
user
STRING,
time STRING,
request STRING,
status STRING,
size
STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.RegexSerDe‘
WITH
SERDEPROPERTIES (
"input.regex"
=
"([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED
AS
TEXTFILE;
|
Hive in Layman's