A probe into the data processing of Hive JSON

Source: Internet
Author: User

backgroundJSON is a lightweight data format with a flexible structure, supports nesting, is easy to read and write, and the mainstream programming language provides a framework or class library to support interaction with JSON data, so a large number of systems use JSON as a log storage format. Before using hive to parse data (meaning text), you first need to create a table of data for the data you want to analyze before you can use hive SQL to parse the data in this datasheet. This involves how we map a line of text data to columns in a data table, usually in two ways: (1) delimiter (2) regular expression but hive itself does not provide native support for parsing JSON data, providing only two built-in functions: Get_json_ Object and Json_tuple, which are used to parse a column of JSON data. The main reason is that the data in JSON format is too flexible, especially in scenarios where common data is used in conjunction with JSON data, multi-layered nesting, JSON objects, and JSON array objects, the general data parsing method becomes stretched. This is also the focus of this article discussion. Programme  1. Common data is used in conjunction with JSON data, where JSON data does not exist in combination with multiple layers of nesting, JSON objects, and JSON array objects;   can be considered as "&_", data part data format JSON, for this case , we use the regular way for it to set up a data table, the following:   verify that the data table parsing data is correct,   visible three rows of data are correctly parsed, but we have no way to directly map "data" slice_id, status to the column , we can only analyze the data indirectly through Get_json_object or json_tuple.   Because Get_json_object has performance problems when parsing multiple columns of data, refer to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ for details Udf#languagemanualudf-json_tuple, so here we only discuss json_tuple.   is actually what we want to do is to map the data section to columns, and notice that the data section is all Key/value in a simple, non-nested form, so we can do this:   by lateral View (Https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView) Way we get all the data columns with Json_tuple. Obviously, it's not friendly to let our users face such a complex SQL statement every time they analyze the data, which can be hidden by creating a view.    View Mytable_view helps us hide the process of mapping JSON data to data columns, and users can only manipulate a standard columnar view of data when analyzing data.  2. JSON array objects;  get_json_object and json_tuple can only handle JSON objects, and there is no way to manipulate JSON array objects, if we need to parse the following data:   Hive built-in functionality is no way to support such data parsing, so we need to expand it ourselves.  json_tuple can handle normal JSON objects, so all we have to do is extend a json_array to convert the JSON array object to a JSOAn array of n objects.   NOTE: The JSON object, JSON array object, and JSON object array elements discussed here are JSON strings.  2.1 creating a json_array  extension UDF requires an inheritance of genericudf, which generally requires rewriting two methods: Initialize, evaluate.  initialize core logic is as follows:  (1) whether the number of parameters is 1, this parameter is the JSON array object string, (2) define the parameters of the converter, for the late acquisition of parameter values, (3) to define the UDF return result type: string array;   evaluate core logic is as follows:  (1) determines whether the number of arguments is 1, whether the parameter value is NULL, and if the argument is valid, gets the passed JSON array object string jsonarraystr;   (2) If Jsonarraystr is an empty string, return null, otherwise proceed to the next step, (3) parse jsonarraystr with Gson, return NULL if parsing fails, and if parsing succeeds, make the following decision:     a. If it is a JSON array object, continue to the next;     b. If it is not a JSON array object, NULL is returned, (4) The individual "elements" of the JSON array object string form into result and return (note that different "element" types get strings differently, here we ignore null).    we packaged the class file and Gson as a separate jar, stored in HDFs, and then created the permanent Function via hive, as follows:  so we can start in hive The function json_array is used in SQL.  2.2 using json_array  (1) To create a data table;  because the log data is a JSON array string, we can create a data table that is only one column, as follows:    (2) using Json_ Array Map column;  each JSON array object contains two JSON objects (which can be a different number), each JSON object contains the following properties: TS, ID, log, IP, we first map these columns, as follows:   first lateral View converts a JSON array object (string) to a JSON object array (string) and converts it to a JSON object by explode (wordThe second lateral view maps a JSON object (string) to a data column.   We can also take a step further, using the same method to decompose the "IP" column, as follows:   we can also hide the above-mentioned mapping process through the previous method of creating the view, which we will not repeat here.   SummaryWith the hive built-in function json_tuple and our own extended json_array, the combination of the two makes it very flexible to "map" the JSON data, and the mapping process can be hidden by creating a view. We also need to note that JSON itself is a very flexible data format, but the actual application can not be abused, such as: avoid multi-layer nesting, data structure is not uniform, or use hive to analyze JSON log data will be more cumbersome.

A probe into the data processing of Hive JSON

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.