[Hive-languagemanual] Xpathudf

Source: Internet
Author: User

documentation for built-in user-defined Functions related to XPath

Udfsxpath, Xpath_short, Xpath_int, Xpath_long, Xpath_float, xpath_double, Xpath_number, xpath_string
    • Functions for parsing XML data using XPath expressions.
    • Since Version:0.6.0overview

The XPath family of UDFs is wrappers around the Java XPath library javax.xml.xpath provided by the JDK. The library is based on the XPath 1.0 specification. Refer to Http://java.sun.com/javase/6/docs/api/javax/xml/xpath/package-summary.html for detailed information On the Java XPath library.

All functions follow the form: xpath_*(xml_string, xpath_expression_string) . The XPath expression string is compiled and cached. It is reused if the expression in the next input row matches the previous. Otherwise, it is recompiled. So, the XML string was always parsed to every input row, but the XPath expression was precompiled and reused for the vast m Ajority of Use cases.

Backward axes is supported. For example:

> select xpath (‘<a><b id="1"><c/></b><b id="2"><c/></b></a>‘,‘/descendant::c/ancestor::b/@id‘) from t1 limit 1;[1","2]

Each function returns a specific Hive type given the XPath expression:

    • xpathReturns a Hive array of strings.
    • xpath_stringReturns a string.
    • xpath_booleanReturns a Boolean.
    • xpath_shortReturns a short integer.
    • xpath_intReturns an integer.
    • xpath_longReturns a long integer.
    • xpath_floatReturns a floating point number.
    • xpath_double,xpath_numberReturns a double-precision floating point number (was an xpath_number alias for xpath_double ).

The UDFs is schema agnostic-no XML validation is performed. However, malformed XML (e.g., <a><b>1</b></aa> ) would result in a runtime exception being thrown.

Following is specifics on each XPath UDF variant.

Xpath

The xpath() function always returns a hive array of strings. If the expression results in a non-text value (e.g., another XML node) The function would return an empty array. There is 2 primary uses for this function:to get a list of node text values or to get a list of attribute values.

Examples:

non-matching XPath expression:

> select xpath(‘<a><b>b1</b><b>b2</b></a>‘,‘a/*‘) from src limit 1;[]

Get a list of node text values:

> select xpath(‘<a><b>b1</b><b>b2</b></a>‘,‘a/*/text()‘) from src limit 1;[b1","b2]

Get a list of values for attribute ' ID ':

> select xpath(‘<a><b id="foo">b1</b><b id="bar">b2</b></a>‘,‘//@id‘) from src limit 1;[foo","bar]

Get a list of node texts for nodes where the ' class ' attribute equals ' BB ':

> SELECT xpath (‘<a><b class="bb">b1</b><b>b2</b><b>b3</b><c class="bb">c1</c><c>c2</c></a>‘‘a/*[@class="bb"]/text()‘) FROM src LIMIT 1;[b1","c1]
Xpath_string

The xpath_string() function returns the text of the first matching node.

Get the text for node ' A/b ':

> SELECT xpath_string (‘<a><b>bb</b><c>cc</c></a>‘‘a/b‘) FROM src LIMIT 1;bb

Get the text for node ' a '. Because ' A ' has children nodes with text, and the result is a composite of text from the children.

> SELECT xpath_string (‘<a><b>bb</b><c>cc</c></a>‘‘a‘) FROM src LIMIT 1;bbcc

Non-matching expression returns an empty string:

> SELECT xpath_string (‘<a><b>bb</b><c>cc</c></a>‘‘a/d‘) FROM src LIMIT 1;

Gets the text of the first node that matches '//b ':

> SELECT xpath_string (‘<a><b>b1</b><b>b2</b></a>‘‘//b‘) FROM src LIMIT 1;b1

Gets the second matching node:

> SELECT xpath_string (‘<a><b>b1</b><b>b2</b></a>‘‘a/b[2]‘) FROM src LIMIT 1;b2

Gets the text from the first node, which has a attribute ' id ' with value ' b_2 ':

> SELECT xpath_string (‘<a><b>b1</b><b id="b_2">b2</b></a>‘‘a/b[@id="b_2"]‘) FROM src LIMIT 1;b2
Xpath_boolean

Returns true if the XPath expression evaluates to True, or if a matching node is found.

Match found:

> SELECT xpath_boolean (‘<a><b>b</b></a>‘‘a/b‘) FROM src LIMIT 1;true

No Match found:

> SELECT xpath_boolean (‘<a><b>b</b></a>‘‘a/c‘) FROM src LIMIT 1;false

Match found:

> SELECT xpath_boolean (‘<a><b>b</b></a>‘‘a/b = "b"‘) FROM src LIMIT 1;true

No Match found:

> SELECT xpath_boolean (‘<a><b>10</b></a>‘‘a/b < 10‘) FROM src LIMIT 1;false
Xpath_short, Xpath_int, Xpath_long

These functions return an integer numeric value, or the value zero if no match is found, or a match was found but the value Is Non-numeric.
Mathematical operations is supported. In cases where the value overflows the return type and then the maximum value for the type is returned.

No match:

> SELECT xpath_int (‘<a>b</a>‘‘a = 10‘) FROM src LIMIT 1;0

Non-numeric match:

> SELECT xpath_int (‘<a>this is not a number</a>‘‘a‘) FROM src LIMIT 1;0> SELECT xpath_int (‘<a>this 2 is not a number</a>‘‘a‘) FROM src LIMIT 1;0

Adding values:

> SELECT xpath_int (‘<a><b class="odd">1</b><b class="even">2</b><b class="odd">4</b><c>8</c></a>‘‘sum(a/*)‘) FROM src LIMIT 1;15> SELECT xpath_int (‘<a><b class="odd">1</b><b class="even">2</b><b class="odd">4</b><c>8</c></a>‘‘sum(a/b)‘) FROM src LIMIT 1 ;7> SELECT xpath_int (‘<a><b class="odd">1</b><b class="even">2</b><b class="odd">4</b><c>8</c></a>‘‘sum(a/b[@class="odd"])‘) FROM src LIMIT 1;5

Overflow:

> SELECT xpath_int (‘<a><b>2000000000</b><c>40000000000</c></a>‘‘a/b * a/c‘) FROM src LIMIT 1;2147483647
Xpath_float, Xpath_double, Xpath_number

Similar to Xpath_short, Xpath_int and Xpath_long but with floating point semantics. Non-matches result in zero. However,
Non-numeric matches result in NaN. Note that's an xpath_number() alias for xpath_double() .

No match:

> SELECT xpath_double (‘<a>b</a>‘‘a = 10‘) FROM src LIMIT 1;0.0

Non-numeric match:

> SELECT xpath_double (‘<a>this is not a number</a>‘‘a‘) FROM src LIMIT 1;NaN

A Very large number:

SELECT xpath_double (‘<a><b>2000000000</b><c>40000000000</c></a>‘‘a/b * a/c‘) FROM src LIMIT 1;8.0E19

[Hive-languagemanual] Xpathudf

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.