Source code analysis of hive regexserde

Source: Internet
Author: User

Recently, regexserde was used to create a business table. Although it was used to parse nginx logs, it did not have a deep understanding. This time I checked its implementation method.

Table creation statement:

CREATE external TABLE ods_cart_log(time_local STRING,request_json  STRING,trace_id_num STRING)PARTITIONED BY(dt string,hour string)ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe‘WITH SERDEPROPERTIES("input.regex" ="\\\[(.*?)\\\] .*\\\|(.*?) (.*?) \\\[(.*?)\\\]","output.format.string" ="%1$s %2$s  %4$s")STORED AS TEXTFILE;

Test data:

[2014-07-24 15:54:54] [6] OperationData.php: :89|{"action":"add","redis_key_hash":9,"time":"1406188494.73745500","source":"web","mars_cid":"","session_id":"","info":{"cart_id":26885,"user_id":4,"size_id":"2784145","num":"1","warehouse":"VIP_NH","brand_id":"7379","cart_record_id":26885,"channel":"te"}} trace_id [40618849399972881308]

Here, trace_id_num is supposed to be 4th fields (that is, 40618849399972881308) according to the conjecture, but 3rd fields (trace_id) are actually output)

View its code implementation:

Regexserde consists of the following three parameters:

1) Input. RegEx regular

2) output. format. String output format

3) is input. RegEx. Case. Insensitive case sensitive?

Input. regEx is used in the deserialization method, that is, Data Reading (hive reads HDFS files), relative output. format. string is used in the serialization method, that is, writing data (writing hive into HDFS files ).

The deserialization method deserialize contains the following code to return data that represents a Matched Field:

For (int c = 0; C <numcolumns; C ++) {// numcolumns is calculated based on the number of columns in the table (for example, columnnames is [time_local, request_json, trace_id_num] | numcolumns = columnnames. size (); try {row. set (C, M. group (C + 1); // you can see that the field match starts from 0 and there is no skip in the middle. So here the select trace_id_num field is the 3rd groups in the regular expression, and output. format. string does not matter} catch (runtimeexception e) {partialmatchedrows ++; If (partialmatchedrows> = rows) {rows = getnextnumbertodisplay (nextpartialmatchedrows); // report the row log. warn ("" + partialmatchedrows + "partially unmatched rows are found," + "cannot find group" + C + ":" + rowtext);} row. set (C, null );}}

Output. format. it seems that string settings are useless. First, the regexserde method takes effect only under textfile. That is, you can use load to import data to hive tables, however, load is an HDFS-level file operation that does not involve serialization. To use serialization, you need to use insert into select to insert data, however, the data inserted in this way is related to the Select data and output. format. it does not matter ..

In fact, the regexserde class has two

Located in

./Serde/src/Java/org/Apache/hadoop/hive/serde2/regexserde. Java and
./contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java

Extends the abstract class abstractserde. The Code shows that the class under contrib implements the serialize and deserialize methods, and the above only implements the deserialize method. It seems that the serialize method in regexserde may be useless ..

Note the following points:

1. If a row does not match, the field output of the entire row is null.

 if (!m.matches()) {      unmatchedRows++;      if (unmatchedRows >= nextUnmatchedRows) {        nextUnmatchedRows = getNextNumberToDisplay(nextUnmatchedRows);        // Report the row        LOG.warn("" + unmatchedRows + " unmatched rows are found: " + rowText);      }      return null;    }

2. The table field types must all be string; otherwise, an error is reported. If other fields are required, cast can be used in select for conversion.

    for ( int c = 0; c < numColumns ; c++) {      if (!columnTypes.get(c).equals( TypeInfoFactory.stringTypeInfo)) {        throw new SerDeException(getClass().getName()            + " only accepts string columns, but column[" + c + "] named "            + columnNames.get(c) + " has type " + columnTypes.get(c));      }    }

This article from the "Food light blog" blog, please be sure to keep this source http://caiguangguang.blog.51cto.com/1652935/1532987

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.