Introduction and usage of Hive Symlinktextinputformat

Source: Internet
Author: User
Tags key string symlink hadoop fs

1, Introduction first pick an official introduction, as follows:Symlink file is a text file which contains a list of filename/dirname. This input method reads Symlink files from specified job input paths and takes the files/directories specified in those symlink files as actual map-reduce input. The target input data should is in Textinputformat.Simlink file is actually a text file, but the file is not stored in the Hive table to read the data, but the data stored in the address. Similar to soft links under Linux. The data address in the Simlink file supports regular expressions and is very flexible to use.

2. Examples and explanations
2.1 Creating Tables CREATE table Symlink_text_input_format (key string, value string) STORED as InputFormat ' ORG.APACHE.HADOOP.HIVE.QL . Io. Symlinktextinputformat ' OutputFormat ' Org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat ';
2.2 Upload two text-formatted files on HDFs, as below, and put them in the/tmp directory
Hadoop fs-put symlink1.txt/tmp/symlink1.txthadoop fs-put symlink2.txt/tmp/symlink1.txt
2.3 Create a link file, the content is real file address:/tmp/symlink1.txt/tmp/symlink1.txt
2.4 Upload the link file to the directory in Hive table Symlink_text_input_format Hadoop fs-put link/user/hive/warehouse/test.db/symlink_text_input_ Format/link
2.5 Query Symlink_text_input_format
SELECT * from Symlink_text_input_format; when querying Symlink_text_input_format, the address of the linked file that is read first will take these addresses as input file in the Hive table.
2.6 Regular expressions are supported in linked files that are created. As a link file below, the symlink starting with the/TMP directory will be treated as input to the hive table/tmp/symlink*

3. Application Scenario
There are 3 directories below the cluster, and the files in the directory are as follows,/logdata/uigs/web/cnc/201401/20140414----/logdata/uigs/web/cnc/201401/20140414/ Ip1.201404141220.log----/logdata/uigs/web/cnc/201401/20140414/ip2.201404141220.log----/logdata/uigs/web/cnc/ 201401/20140414/ip3.201404141220.log----/logdata/uigs/web/cnc/201401/20140414/ip4.201404141220.log/logdata/ uigs/web/tc/201401/20140414----/logdata/uigs/web/tc/201401/20140414/ip5.201404141220.log----/logdata/uigs/web/ Tc/201401/20140414/ip6.201404141220.log----/logdata/uigs/web/tc/201401/20140414/ip7.201404141220.log----/ logdata/uigs/web/tc/201401/20140414/ip81.201404141220.log/logdata/uigs/web/sjs/201401/20140414----/logdata/ Uigs/web/sjs/201401/20140414/ip9.201404141220.log----/logdata/uigs/web/sjs/201401/20140414/ Ip10.201404141220.log----/logdata/uigs/web/sjs/201401/20140414/ip11.201404141220.log----/logdata/uigs/web/sjs/ 201401/20140414/ip12.201404141220.log
There is a table uigs in hive, in minutes as a partition, in order to write the above data to the logdata=201404141220 partition, you can create a link file, the content is as follows
/logdata/uigs/web/sjs/201401/20140414/*201404141220*/logdata/uigs/web/cnc/201401/20140414/*201404141220*/ logdata/uigs/web/tc/201401/20140414/*201404141220*
Then put the link file into the logdata=201404141220 partition directory.

Introduction and usage of Hive Symlinktextinputformat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.