Big Data-Hive

Source: Internet
Author: User
Tags key string

built on top of HadoopData Warehouse, data calculation using MR, data storage using HDFs because data calculations use MapReduce, they are typically used for offline data processingHive defines a class of SQL query Language--hqlSQL-like, but not exactly the same can be considered as a HQL-->MR language translator . simple, easy to get started
with Hive, do you still need to write your own Mr Program?? The ability of the HQL expression of hive finite iterative algorithm cannot express some complex operations with HQL not easy to express hiveless efficientHive automatically generates MapReduce jobs, usually not smart enough, hql tuning difficulties, coarse and controllable granularity
The hive consists of modulesuser Interfaceincluding Cli,jdbc/odbc, WebUIMeta data Storage (Metastore)default is stored in your own database Derby, which is typically used for MySQL on-lineDrive (Driver)interpreter, compiler, optimizer, actuatorHadoopcalculate with MapReduce and store with HDFS
Hive Deployment Architecture-Lab environment

Hive Deployment Architecture-production environment

Data Model
Partition and Buckets
PartitionTo reduce unnecessary brute force data scanning, tables can be partitionedTo avoid generating too many small files, it is recommended that you partition only discrete fieldsBucketsfor fields with a higher value, you can divide them into bucketscan be combined with partition and buckets
SELECT statementhave and exist in operations are not supported and can be converted to left SEMI join operationsjoin (only equivalent connections are supported), non-equivalent connections are not supported
Order by and sort by
Order bystart a reduce taskglobally ordered dataThe speed may be very slowStrict mode, must be in conjunction with limitSort byyou can have multiple reduce taskinternal data for each reduce task is ordered, but globally unorderedusually with distribute by
distribute by and cluster byDistribute byequivalent to the Paritioner in MapReduce, the default is based on the hash implementation;use with sort by to play a very good roleCluster bywhen distribute by is used with sort by (descending), and the following fields are the same, the cluster by is abbreviated;
user-defined function UDF:one way to extend HQL capabilities
HQL Support index? the HQL execution process is primarily a parallel, violent scan. Currently, Hive supports only single-table indexes , but it provides index creation interface and calling method, which can be implemented by users as needed.does the HQL support update operation? not supported. Hive Bottom is Hdfs,hdfs only support append operation, do not support random write ;Skew data processing mechanism? Specify skew column: CREATE TABLE list_bucket_single (key string, value string) skewed by (key) on (1,5,6);assigning more resources to skew task (TODO)break skew task into multiple tasks and merge results (TODO)
Hive on HBaseusing hql to process data in HBase more convenient than accessing data directly through the HBase API;but lower performance is equivalent to converting online processing to batch processingThere is a problem not mature enough ;can't get data on time, always fetch the latest data by default
a similar system for hiveStingerThe Next generation of Hive is called "Stinger", and its underlying computing engine will replace MapReduce with Tez;Tez has a number of advantages over MapReduce:A variety of operators (such as map, shuffle, etc.) are provided for user use;combine multiple jobs into one job to reduce disk read/write IO;make full use of memory resources.
Shark Hive on Spark(http://spark.incubator.apache.org/); Spark is a memory-computing framework that is more efficient than MapReduce (part of the test shows that the speed is 100x); Shark fully compatible with hive, bottom-level computing The engine uses spark.
ImpalaThe underlying computing engine no longer uses MR, but rather uses a distributed query engine similar to a commercial parallel relational database;
Performance Comparison




Big Data-Hive

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.