Hive user manual and map parameter adjustments

Source: Internet
Author: User
Tags shuffle apache access log

Map/reduce simple Introduction to the principle


The Hadoop map/reduce framework generates a MAP task for each inputsplit, and each inputsplit is generated by the inputformat of the job.

The framework then invokes a map (writablecomparable, writable, Outputcollector, Reporter) operation for each key-value pair in the inputsplit of this task.
You can collect the output key-value pairs by calling Outputcollector.collect (writablecomparable,writable).

Reduce receives data from different map missions, and the data from each map is ordered. If the amount of data accepted by the reduce side is quite small,
is stored directly in memory (the buffer size is controlled by the Mapred.job.shuffle.input.buffer.percent property, which represents the percentage of heap space used for this purpose),
If the amount of data exceeds a certain percentage of the buffer size (determined by mapred.job.shuffle.merge.percent), the data is merged and then overflowed to disk.

hive External Table, write regular expression parse Nginx log

The format configuration of the Nginx log:

' $proxy _add_x_forwarded_for-$remote _user [$time _local] "$request" '
' $status $request _body ' $http _referer '
"$http _user_agent" "$http _x_forwarded_for" $request _time $upstream _response_time ';
The generated log approximate format:
218.202.xxx.xxx–-[2014-08-19 22:17:08.446671] "post/xxx/xxx-web/xxx http/1.1″200 Stepid=15&tid=u753hifvpe0daon %2f&output=json&language=zh_cn&session=114099628&dip=10920&diu=dbdbf926-3210-4d64-972a7 &xxx=056a849c70ae57560440ebe&diu2=2dfdb167-1505-4372-aab5-99d28868dcb5&shell= e3209006950686f6e65352c3205004150504c450000000000000000000000000000&compress=false&channel=&sign= 438bd4d701a960cd4b7c1de36aa8a877&wua=0&appkey=0&adcode=150700&t=0 http/1.1″200 302 "-" "Xxx-iphone "31.0ms

The statement for the Hive build table:
CREATE EXTERNAL TABLE Xxx_log (
Host STRING,
Log_date STRING,
Method STRING,
Uri STRING,
Version STRING,
STATUS STRING,
Flux STRING,
Referer STRING,
User_agent STRING,
Reponse_time STRING
)
Partitioned by (year string, MONTH string, day String)
ROW FORMAT SERDE ' org.apache.hadoop.hive.contrib.serde2.RegexSerDe '
With Serdeproperties ("Input.regex" = "([^]*) \\s+-\\s+-\\s+\\[([^\]]*] \\]\\s+\" ([^]*) \\s+ (. *) \\s+ ([^]*) \ "\\s+ (-|[ 0-9]*) \\s+ (-|[ 0-9]*) \\s+\ "(. +?| -) \ "\\s+\" (. +?| -) \ "\\s+ (. *)",
"Output.format.string" = "%1 $ s%2$s%3$s%4$s%5$s%5$s%6$s%7$s%8$s%9$s%10$s") STORED as Textfile;
Then, import the data into the table
Method 1. Import a file under the specified path without moving the file
ALTER TABLE nugget_aos_log ADD partition (year= ', month= ', day= ') location '/user/xxx/xx/year=2014/month=08/ Day=19 ';
Or
Method 2. Import the specified file and move the file to the user's storage path
LOAD DATA inpath '/user/xxx/xx/2014/08/19 ' overwrite into TABLE xxx_log partition (year= ' n ', month= ', day= ' 19 ');
Next, you can query the data in the table to verify
Hive>select * from Xxx_log limit 100;
You may encounter a situation where your regular expression is not a problem in the regular Expression tool (recommended: Regexbuddy), but when the hive table is displayed, each field value is NULL, indicating that there is an error in the regular expression. This time can be resolved in the following way, in hive to execute the following command:
Hive>describe extended tablename;
– View the details of the table, including the value of the actual Input.regex regular expression for the hive table, and see if the regular expression escape character for the build table is missing.
After the problem of the regular expression is resolved, drop table Xxx_log, and then re-build the table to guide the data. Finally, you can see that the Nginx log is split into the fields of the Hive table, and then you can do a variety of statistics.

Hive User Manual
Usage Examples
Creating tables
Movielens User Ratings
CREATE TABLE U_data (
UserID INT,
MovieID INT,
Rating INT,
Unixtime STRING)
ROW FORMAT Delimited
Fields TERMINATED by ' \ t '
STORED as Textfile;
Apache Access Log Tables
Add Jar: /build/contrib/hive_contrib.jar;

CREATE TABLE Apachelog (
Host STRING,
Identity STRING,
User STRING,
Time STRING,
Request STRING,
Status STRING,
Size STRING,
Referer STRING,
Agent STRING)
ROW FORMAT SERDE ' org.apache.hadoop.hive.contrib.serde2.RegexSerDe '
With Serdeproperties (
"Input.regex" = "([^]*) ([^]*) ([^]*] (-|\\[^\\]*\\]) ([^ \"]*|\ "[^\"]*\ ") (-|[ 0-9]*) (-|[ 0-9]*) (?: ([^ \ "]*|\" [^\ "]*\") ([^ \ "]*|\" [^\ "]*\")]? ",
"Output.format.string" = "%1 $ s%2$s%3$s%4$s%5$s%6$s%7$s%8$s%9$s"
)
STORED as Textfile;
Control separated Tables
CREATE TABLE MyLog (
Name string, language string, groups Array<string>, entities Map<int, string>)
ROW FORMAT Delimited
Fields TERMINATED by ' \001 '
COLLECTION ITEMS TERMINATED by ' \002 '
MAP KEYS TERMINATED by ' \003 '
STORED as Textfile;
Loading tables
Movielens User Ratings
Download and extract the data:
wget http://www.grouplens.org/system/files/ml-data.tar+0.gz
Tar xvzf ml-data.tar+0.gz
Load it in:
LOAD DATA LOCAL inpath ' Ml-data/u.data '
OVERWRITE into TABLE u_data;
Running queries
Movielens User Ratings
SELECT COUNT (1) from U_data;
Running Custom Map/reduce Jobs
Movielens User Ratings
Create weekday_mapper.py:
Import Sys
Import datetime

For line in Sys.stdin:
line = Line.strip ()
UserID, MovieID, rating, unixtime = line.split (' \ t ')
Weekday = Datetime.datetime.fromtimestamp (float (unixtime)). Isoweekday ()
print ' \ t '. Join ([UserID, MovieID, rating, STR (weekday)])
Use the Mapper script:
CREATE TABLE U_data_new (
UserID INT,
MovieID INT,
Rating INT,
Weekday INT)
ROW FORMAT Delimited
Fields TERMINATED by ' \ t ';

INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, MovieID, rating, Unixtime)
USING ' Python weekday_mapper.py '
As (userid, MovieID, rating, weekday)
From U_data;

SELECT Weekday, COUNT (1)
From U_data_new
GROUP by weekday;
Note:due to a bug in the parser, you must run the "INSERT OVERWRITE" query on a single line


analysis and tuning of the number of Hadoop maps
Hadoop calculates the size of a shard before calculating the number of maps a job needs. The formula for calculating the Shard size is:
Goalsize = Totalsize/mapred.map.tasks
MinSize = max {mapred.min.split.size, minsplitsize}
Splitsize = Max (minSize, Min (goalsize, dfs.block.size))
TotalSize is the total input size of all maps for a job, which is the map input bytes. The default value for parameter mapred.map.tasks is 2, and we can change the value of this parameter. The upper and lower limits are also determined after calculating the goalsize.

The lower bound is max {mapred.min.split.size, minsplitsize}. The default value for parameter mapred.min.split.size is 1 bytes, and minsplitsize varies with file format.

The upper limit is dfs.block.size, and its default value is 64 megabytes.

For example, if the map input bytes is 100 megabytes, and the Mapred.map.tasks default value is 2, then the Shard size is 50 trillion, and if we change the mapred.map.tasks to 1, the Shard size becomes 64 trillion.

After calculating the Shard size, the map number is calculated next. The map number is calculated as a file, and a loop is made for each file:

1. File size/splitsize>1.1, create a split, the size of this split =splitsize, file remaining size = File size-splitsize

2. File remaining size/splitsize<1.1, remaining part as a split

For a few examples:

1. Input has only one file, the size is 100m,splitsize=blocksize, then the map number is 2, the first map processes the Shard is 64M, the second is 36M

2. Input has only one file, the size is 65m,splitsize=blocksize, the map number is 1, the Shard size processed is 65M (because 65/64<1.1)

3. Input has only one file, the size is 129m,splitsize=blocksize, then the map number is 2, the first map processes the Shard is 64M, the second is 65M

4. Input has two files, the size is 100M and 20m,splitsize=blocksize, the map number is 3, the first file is divided into two map, the first map processing the Shard is 64M, the second is 36M, the second file is divided into a map, The processed shard size is 20M

5. Input has 10 files, each size 10m,splitsize=blocksize, then the map number is 10, each map processing the Shard size is 10M

Let's look at 2 more special examples:

1. The input file has 2, respectively 40M and 20m,dfs.block.size = 64M, mapred.map.tasks with default value 2. Then splitsize = 30M, the map number is actually 3, the first file is divided into 2 map, the first map processing shard size is 30M, the second map is 10M; the second file is divided into 1 maps, the size is 20M

2. The input file has 2, respectively 40M and 20m,dfs.block.size = 64M, mapred.map.tasks manually set to 1.

Then splitsize = 60M, the map number is actually 2, the first file is divided into 1 map, processing the Shard size is 40M; the second file is divided into 1 maps, the size is 20M

With these 2 special examples, you can see that mapred.map.tasks is not set up, and job execution is more efficient. At the same time, Hadoop becomes less efficient when working with small files.

According to the method of calculating the number of shards and maps, it can be concluded that a map processing shard is not more than Dfs.block.size * 1.1, by default it is 70.4 trillion. But there are 2 exceptions:

1. Map with small files in hive only job, this job will only have one or a few maps.

2. The input file format is compressed text files, because the compressed textual format does not know how to split, so it can only use a map.

Consider compressing the output with a suitable compressor (compression speed vs performance) to improve the write performance of HDFs.
Do not export multiple files per reduce to avoid generating satellite files. We generally use satellite files to record statistics, and if this information is not much, you can use the counter.
Select the appropriate format for the output file. For downstream consumer programs, using algorithms such as zlib/gzip/lzo to compress large amounts of text data often backfired. Because the Zlib/gzip/lzo file is not divisible, it can only be processed as a whole. This can cause poor load balancing and recovery problems. As an improvement, You can use the Sequencefile and tfile formats, which are not but compressed, and can be segmented.
If each output file is large (several gigabytes), consider using a larger output block (dfs.block.size).
CREATE TABLE Apachelog (
Host STRING,
Identity STRING,
User STRING,
Time STRING,
Request STRING,
Status STRING,
Size STRING,
Referer STRING,
Agent STRING)
ROW FORMAT SERDE ' org.apache.hadoop.hive.serde2.RegexSerDe '
With Serdeproperties (
"Input.regex" = "([^]*) ([^]*) ([^]*] (-|\\[^\\]*\\]) ([^ \"]*|\ "[^\"]*\ ") (-|[ 0-9]*) (-|[ 0-9]*) (?: ([^ \ "]*|\". *\ ") ([^ \"]*|\ ". *\"))? "
)
STORED as Textfile;

Map Number Adjustment

public static void Setmininputsplitsize (Job job,long size) {
Job.getconfiguration (). Setlong ("Mapred.min.split.size", size);
}
public static long Getminsplitsize (Jobcontext job) {
Return Job.getconfiguration (). Getlong ("Mapred.min.split.size", 1L);
}

public static void Setmaxinputsplitsize (Job job,long size) {
Job.getconfiguration (). Setlong ("Mapred.max.split.size", size);
}
public static long Getmaxsplitsize (Jobcontext context) {
Return Context.getconfiguration (). Getlong ("Mapred.max.split.size", Long.max_value);
}
As we can see, Hadoop implements the definition of mapred.min.split.size and mapred.max.split.size here, with the default values of 1 and the largest of long. Therefore, we can control the size of the Inputsplit shard by simply re-assigning the values to these two values in the program.
3. If we want to set the Shard size to 10MB
Then we can add the following code to the driver section of the MapReduce program:

Textinputformat.setmininputsplitsize (job,1024l);//Set minimum shard size
Textinputformat.setmaxinputsplitsize (job,1024x1024x10l);//Set maximum shard size

UDF is permanently active
Hive matches evaluate () according to different parameters;
UDF is a simple line to generate a row, the calculation of a row,
Udaf aggregation functions are cumbersome, with the merge function, etc., in different map functions and in reduce.
UDTF has a row to generate multiple rows or columns, initialize and progress, and close, etc., used in the construction of the table, can be serialized and regular

To achieve the same effect see http://www.uroot.com/archives/1059


UDF is active now it is only possible to modify the source code. But it can be done with a little bit of flexibility.
Create a new. hiverc file in the bin directory of the Hive_home, and write the registered statement of the UDF, which can be used just like hive built-in usage.
The principle is that when you run the./hive command, both HIVE_HOME/BIN/.HIVERC and $HOME/.hiverc are loaded as the files needed for initialization
Include the following in the. hiverc file:
Add Jar/run/jar/avg_test.jar
Create temporary function avg_test ' HIVE.UDAF.AVG ';

Hive user manual and map parameter adjustments

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.