Hive has brought a real-time query mechanism to Hadoop

Last Update:2014-12-25 Source: Internet

Author: User

Keywords Real time other existing or

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Apache hive is a Hadoop based tool that specializes in analyzing large, unstructured datasets using class-SQL syntax to help existing business intelligence and Business Analytics researchers access Hadoop content. As an open source project developed by the Facebook engineers and recognized and contributed by the Apache Foundation, Hive has now gained a leading position in the field of large data analysis in the business environment.

Like other components of the Hadoop ecosystem, the hive is also very fast growing. In today's review article, we will target 0.13-a version that addresses some of the flaws in other previous versions. The 0.13 release also significantly improves the processing speed of class SQL queries between multiple large Hadoop clusters and adds a number of new features to the interactive query mechanism in the previous version. In essence, hive is actually a set of transactional data storage system, and it is also suitable for the large scale, not high query speed requirements of the relatively static data set for analysis hive to the existing Data Warehouse program has been well supplemented, but does not belong to a complete alternative mechanism. On the contrary, the use of hive secondary data warehouses can give full play to the actual performance of the existing investment results, without affecting the capacity to accommodate data. Typical data warehouse scenarios include many expensive hardware and software components, such as RAID or SAN storage, optimized ETL for simplifying and inserting data (i.e., extraction, transformation, and loading) procedures, specific connectivity mechanisms for ERP or other back-end systems, plus location-oriented, Product or sales channels and other enterprises common sales office design plan. This kind of warehouse system optimizes for the CPU to bring the rich data content, thus will find the answer for each kind of operation problem which the preset in the Plan plan.

By contrast, the hive data storage mechanism consolidates large amounts of unstructured data-including log files, customer tweets, e-mail messages, geo-data, and CRM interactions-and stores them in unstructured formats on Low-cost commercial hardware. Hive allows analysts to build a database-like project structure based on these data, introducing mechanisms that resemble traditional tables, columns, and rows, and writing SQL queries against them. This means that users can fully use different types of processing planning on the same set of data based on query characteristics to find out the exact answers to key operational issues through the collected data.

In the past, hive queries have been heavily delayed, and even small queries involving little data need to take a considerable amount of time--because the query needs to be first converted to a mapping------------------------------- This delay is generally not problematic because the query itself, before the search planning and mapping-reduction mechanisms work, can be predefined for the duration of the entire processing cycle-at least in the case of a large dataset pointed to by the hive design idea.

However, users quickly found that such a long-running query process would cause serious inconvenience and even trouble in a multiuser environment, where a single job could become the primary goal of the overall cluster. To address this dilemma, the Hive community organizes a new round of efforts (also known as the ' Stinger ' project) to improve query processing, with the goal of making hive capable of adapting to the needs of real-time, interactive queries and exploratory operations. This part of the improvement results in the hive 0.11, 0.12 and 0.13 three versions began to emerge.

Finally, although the HIVEQL query language is based on SQL-92, it still has a series of significant differences with SQL-for the simple reason that the former runs on the Hadoop basis. For example, the DDL (data definition Language) command needs to take into account the objective reality that the existing multiuser file system in the table can support multiple storage formats. Overall, however, SQL users have a sense of familiarity with the HIVEQL language and should not encounter any obstacles in the process of adaptation. Hive platform architecture from top to bottom, the Hive platform architecture looks no different from any other relational database. Users write SQL queries and submit them to the processing process, and can use command-line tools that interact directly with the database engine, or third-party tools that communicate with the database through JDBC or ODBC. The specific architecture of hive is shown in the following illustration:

Hive platform Architecture diagram.

With JDBC and ODBC drivers under the Mac and Windows systems, data workers can browse, query, and create tables by comparing their favorite SQL clients to hive. For senior users, Hive also provides the original rich client CLI, which can interact directly with the hive driver. This set of clients is the most powerful and requires direct docking with Hadoop, so it is especially good for local network execution-firewalls, DNS, and network topologies will not be a problem.

The hive meta storage mechanism Hcatalog was once a stand-alone Hadoop project and is now part of the hive release. Supported by its own relational database, Hcatalog can eliminate the steps to define planning in hive, simplify new queries, and make such plans available to other tools in the Hadoop tool chain, such as pig--.

123 Next

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More