Hive Use Experience

Source: Internet
Author: User
Tags python script in python

Recently used hive one months down (finally completed the first phase of data analysis migration work), then used the 0.8 version (now the latest version is 0.8.1), one months to harvest a lot. From the installation environment,

Commissioning, development, business understanding, technical research, business implementation, one by one have experienced a.

Overall, in addition to the current online introduction of the regular hive use and optimization.

Because the current hive only supports 0.20 of the relevant version, so our environment is still using the 0.20 version of Hadoop to build.
Using hive and Hadoop is an embodiment of a comprehensive capability that I used to involve a lot of system-level problems. If you let hive and Hadoop get closer together, it's a personal view from the following perspectives:
First: Hive's class SQL statement itself is tuned
The second: is the hive parameter tuning
Third: The HDFs parameter tuning in Hadoop (storage format, compression format, RPC call, Connection number control)
IV: Map/reduce tuning in Hadoop (data transfer between datanode, processing size, related JVM settings for each child, etc.)
The tuning of network transmission in the Hadoop environment (Hardware environment)
Sixth: is the HDFS storage format tuning (text format, order format, etc.)
Seventh: Disk I/O tuning at the operating system level (multiplexing, etc.)
Eighth: The operating system level of network tuning (buffer size, number of connections to enlarge, etc.)
Nineth: The operating system level of memory tuning (virtual memory settings, memory control, etc.)
Tenth: Hadoop fault-tolerant mechanism of mastery, because the normal operation to nothing, fear is abnormal, there are corresponding solutions (dispatchers, queues, etc.)
11th: Hadoop management (including Datanode failure, namenode failure, adding or deleting datanode, load balancing, clustering, etc.)

In addition, the Internet is based on the Sun's JVM to build a Hadoop environment, but I used the process is to use JRockit JVM to build the hadooop environment, mainly in view of the current version of the JRockit 6.0 performance is relatively strong (reflected in the network transmission, threading, GC recovery, etc.).

On the other hand, there are a variety of programming languages, such as SQL (but still different from the actual SQL syntax of the database), Java, Python (fortunately, Python has been developed before), Shell, and so on.
Now talk about Hive, after all, in this development process, hive use a large proportion. About hive Some of the parameters of the use and collocation, I still in the test process (several key parameters), and then explain in detail.
Hive personally feel more suitable to do some aggregation scenes related processing, its custom function Udaf more can embody such a characteristic.
About UDF This custom function is more suitable for conversion processing on some fields, for example, a text type is converted to an array type, and a richer decoding method can be implemented, such as some parameters in the URL using UTF-8 encoding or GBK encoding.
also supports custom map/reduce methods (related keywords).

/*mapjoin*/This identification (described here incorrectly, Mapjoin provides a way to increase the join).
Support a variety of storage formats, such as: row format, file format, HDFs also supports a variety of storage formats.
Support import/Export data, here need to be reminded, using the Hive command of the export, will be a row of data to merge, also can not see the field; therefore, Hive-e ' select * from T ' >> 123.txt is generally used as the command, In order to make the exported fields separated by a space (later verified, this is one of my use errors, in fact, there is still a \001 symbol, just show the time does not show it, so both ways are suitable).
The difference between null and null values, which needs to be noted in hive processing (especially with scripts like Python, and in the case of field cutting in Python).

Now summarize what hive is not good at.
Because my development task is to migrate some SAS computing tasks to Hadoop for data analysis, because SAS has a lot of powerful features that are helpful when doing data analysis, but there's no such function on hive. Examples are as follows:
SAS have something similar to the concept of a cursor that can manipulate a row of records and move a field in a record (such as moving a field above the previous record)
SAS has a row weight feature (i.e., remove duplicate data based on several fields)
SAS can easily be grouped, before taking the first 10 records in each group (note: I later learned that the SQL statement itself is not supported by UDAF processing)
The bad thing about SAS is decoding the conversion (such as after UTF-8 or GBK encoding, where the main point is the URL, the more troublesome)
Common scenarios, and there are several other situations that are more complex to describe.

With regard to the first three cases described previously, it is difficult to achieve a simple SQL statement using the hive itself, and the common denominator of three situations is how to manipulate a row of records.
Hadoop has a way to call streaming, is a row of read, and then processed in a script, where just hive is supporting the script, so you can use scripts to implement the operation of a line of records (I use the Python script here)
About the weight of the function, the principle of the implementation of the SAS is based on a few keywords to sort, and then to arrange the treatment (because after sorting, the duplicate data must be together, so that the row is more convenient to deal with).
So on the hive side, the first thing to do is to sort by a few keywords and then use a script to get rid of duplicate data.
Although the script is running slower than pure Java, but because the script meets the business requirements, we can ignore the slow running defects, after all, hive in late tuning, there are also keyword support map, reduce.
about how to use the script to take each group's first 10 records (to a simple business scene, take each region (according to Shanghai, Beijing, Shenzhen, Guangzhou, etc.) sales of the top 10 salesman), by the reader himself can think.
The fourth scene, about the URL parameter decoding problem, very simple, customize a UDF, and then through the Java way to solve. One of the more troubling parts of this is how to tell whether the URL is encoded in UTF-8 or GBK.

Add another two points:

is a hivesql script, preferably not too much SQL, otherwise it will cause some unusual situation to appear. It is best to split the division. As follows:

1.sql

1_1.sql

1_2.sql

form so that it is easy to control in the shell.
The second is to look at the source code (Hadoop source code, hive source), found that some parameter configuration, are in the program, but the document does not reflect.

These are the general use of the feeling, followed by a new experience, and then increase it.
If you do not understand the place, you can ask questions.

Here is a second supplement to this project on the use of hive:

The first is to go to the problem, and then after repeated tests found that the number of maps can not be multiple, otherwise the last record of a split, and the first record of the second split, can not be matched to the weight.

So when confronted with such a problem, Mr's advantage will not be reflected, the simplest way is to use the map number and REDCE number are set to 1

But I do not know if Mr One of the chain is feasible.

To sum up, when there is a connection between the upper and lower records, how to use the model of Mr, still need to ponder and constantly test.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.