Recently, regexserde was used to create a business table. Although it was used to parse nginx logs, it did not have a deep understanding. This time I checked its implementation method.
Table creation statement:
CREATE external TABLE ods_cart_log(time_local STRING,request_json STRING,trace_id_num STRING)PARTITIONED BY(dt string,hour string)ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe‘WITH SERDEPROPERTIES("input.regex" ="\\\[(.*?)\\\] .*\\\|(.*?) (.*?) \\\[(.*?)\\\]","output.format.string" ="%1$s %2$s %4$s")STORED AS TEXTFILE;
Test data:
[2014-07-24 15:54:54] [6] OperationData.php: :89|{"action":"add","redis_key_hash":9,"time":"1406188494.73745500","source":"web","mars_cid":"","session_id":"","info":{"cart_id":26885,"user_id":4,"size_id":"2784145","num":"1","warehouse":"VIP_NH","brand_id":"7379","cart_record_id":26885,"channel":"te"}} trace_id [40618849399972881308]
Here, trace_id_num is supposed to be 4th fields (that is, 40618849399972881308) according to the conjecture, but 3rd fields (trace_id) are actually output)
View its code implementation:
Regexserde consists of the following three parameters:
1) Input. RegEx regular
2) output. format. String output format
3) is input. RegEx. Case. Insensitive case sensitive?
Input. regEx is used in the deserialization method, that is, Data Reading (hive reads HDFS files), relative output. format. string is used in the serialization method, that is, writing data (writing hive into HDFS files ).
The deserialization method deserialize contains the following code to return data that represents a Matched Field:
For (int c = 0; C <numcolumns; C ++) {// numcolumns is calculated based on the number of columns in the table (for example, columnnames is [time_local, request_json, trace_id_num] | numcolumns = columnnames. size (); try {row. set (C, M. group (C + 1); // you can see that the field match starts from 0 and there is no skip in the middle. So here the select trace_id_num field is the 3rd groups in the regular expression, and output. format. string does not matter} catch (runtimeexception e) {partialmatchedrows ++; If (partialmatchedrows> = rows) {rows = getnextnumbertodisplay (nextpartialmatchedrows); // report the row log. warn ("" + partialmatchedrows + "partially unmatched rows are found," + "cannot find group" + C + ":" + rowtext);} row. set (C, null );}}
Output. format. it seems that string settings are useless. First, the regexserde method takes effect only under textfile. That is, you can use load to import data to hive tables, however, load is an HDFS-level file operation that does not involve serialization. To use serialization, you need to use insert into select to insert data, however, the data inserted in this way is related to the Select data and output. format. it does not matter ..
In fact, the regexserde class has two
Located in
./Serde/src/Java/org/Apache/hadoop/hive/serde2/regexserde. Java and
./contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
Extends the abstract class abstractserde. The Code shows that the class under contrib implements the serialize and deserialize methods, and the above only implements the deserialize method. It seems that the serialize method in regexserde may be useless ..
Note the following points:
1. If a row does not match, the field output of the entire row is null.
if (!m.matches()) { unmatchedRows++; if (unmatchedRows >= nextUnmatchedRows) { nextUnmatchedRows = getNextNumberToDisplay(nextUnmatchedRows); // Report the row LOG.warn("" + unmatchedRows + " unmatched rows are found: " + rowText); } return null; }
2. The table field types must all be string; otherwise, an error is reported. If other fields are required, cast can be used in select for conversion.
for ( int c = 0; c < numColumns ; c++) { if (!columnTypes.get(c).equals( TypeInfoFactory.stringTypeInfo)) { throw new SerDeException(getClass().getName() + " only accepts string columns, but column[" + c + "] named " + columnNames.get(c) + " has type " + columnTypes.get(c)); } }
This article from the "Food light blog" blog, please be sure to keep this source http://caiguangguang.blog.51cto.com/1652935/1532987