Data import of ODPs function introduction

Source: Internet
Author: User

Data import of ODPs function introduction

Before using ODPs's powerful data processing power, you are most concerned about how your data is imported into ODPs. Here is a tool-fluentd for importing data to ODPs.

Fluentd is an open source software that collects various source logs (including application log, Sys log, and access log), allows users to select plug-ins to filter log data, and store it on different data processors (including MySQL, Oracle, MongoDB, Hadoop, treasure Data, AWS Services, Google services, and ODPs, etc.). Fluentd is known for its small and flexible, allowing users to customize the data source, filtering processing and target end plug-ins, currently in this software has 300+ plug-ins run Fluentd architecture, and these plug-ins are all open source. ODPs also open the data import plugin on this software.

Environment preparation

Using this software, importing data to ODPs requires the following environments:

Ruby 2.1.0 or update

Gem 2.4.5 or update

Fluentd-0.10.49 or from FLUENTD official website to find the latest, FLUENTD for different OS offers different versions

Protobuf-3.5.1 or update (RUBYPROTOBUF)

Installing the Import Plugin

Next, you can install the ODPs fluentd import plugin by either of the following two ways.

Way one: Install through Ruby gem:

Copy Code

$ gem Install Fluent-plugin-neitui-odps

ODPs has published this plugin to the Gem Library, with the name Fluent-plugin-neitui-odps, only to install it via the Gem Install command (you may not be able to access the Gem library in the country when using the gem, Can be searched on the internet to change the Gem Library source to resolve).

Mode two: Through the plugin source code installation:

Copy Code

$ gem Install Protobuf

$ gem Install fluentd--no-ri--no-rdoc

$ git clone https://github.com/neitui/neitui-odps-fluentd-plugin.git

$ cp neitui-odps-fluentd-plugin/lib/fluent/plugin/* {your_fluentd_directory}/lib/fluent/plugin/-R

The second command is to install FLUENTD, which can be omitted if it has already been installed. ODPS FLUENTD Plug-in source code on GitHub, clone down and then directly into the FLUENTD plugin directory.

Use of plugins

When importing data using FLUENTD, the most important thing is to configure the Fluentd conf file, and more conf files See: Http://docs.fluentd.org/articles/config-file

Example one: Import Nginx logs. The configuration of source in Conf is as follows:

Copy Code

<source>

Type tail

Path/opt/log/in/in.log

Pos_file/opt/log/in/in.log.pos

Refresh_interval 5s

Tag In.log

Format/^ (? <remote>[^]*)--\[(? <datetime>[^\]]*) \] "(? <method>\s+) (?: + (? <path>[^\"]*?) (?: +\s*)?)? " (? <code>[^]*) (? <size>[^]*) "-" "(? <agent>[^\"]*) "$/

Time_format%y%b%d%h:%m:%s%z

</source>

Fluentd the tail mode to monitor the specified file contents for changes, more tail configuration see: Http://docs.fluentd.org/articles/in_tail

The match configuration is as follows:

Copy Code

<match in.**>

Type Neitui_odps

neitui_access_id ************

Neitui_access_key *********

Neitui_odps_endpoint Http://service.odps.neitui.com/api

Neitui_odps_hub_endpoint http://dh.odps.neitui.com

Buffer_chunk_limit 2m

Buffer_queue_limit 128

Flush_interval 5s

Project Projectforlog

<table in.log>

Table Nginx_log

Fields Remote,method,path,code,size,agent

Partition ctime=${datetime.strftime ('%y%m%d ')}

Time_format%d/%b/%y:%h:%m:%s%z

</table>

</match>

The data is imported into the Nginx_log table in Projectforlog project, where it is partitioned as a datetime field in the source, and when the plug-in encounters different values, the partition is created automatically;

Example two: Import data from MySQL. When importing data from MySQL, you need to install the Fluent-plugin-sql plugin as Source:

$ gem Installfluent-plugin-sql

To configure source in conf:

Copy Code

<source>

Type SQL

Host 127.0.0.1

Database test

Adapter MySQL

Username xxxx

Password xxxx

Select_interval 10s

Select_limit 100

State_file/path/sql_state

<table>

Table test_table

Tag In.sql

Update_column ID

</table>

</source>

This example is from the test_table select data, each interval of 10s to read 100 data out, SELECT when the ID column as the primary key (ID field is self-increment). For more instructions on Fluent-plugin-sql see: Https://github.com/fluent/fluent-plugin-sql

The match configuration is as follows:

Copy Code

<match in.**>

Type Neitui_odps

neitui_access_id ************

Neitui_access_key *********

Neitui_odps_endpoint Http://service.odps.neitui.com/api

Neitui_odps_hub_endpoint http://dh.odps.neitui.com

Buffer_chunk_limit 2m

Buffer_queue_limit 128

Flush_interval 5s

Project Your_projectforlog

<table in.log>

Table Mysql_data

Fields Id,field1,field2,fields3

</table>

</match>

The data is exported to the Mysql_data table in Odpsprojectforlog project, and the imported fields include ID,FIELD1,FIELD2,FIELD3.

Instructions for importing a table

Importing data through FLUENTD is the ODPs real-time data flow into the channel-datahub, which requires a special ODPs table, which needs to be designated as the hub table when it is created. You can use the following name when creating a table:

CREATE table<table_name) (field_name type,...) Partitioned by (pt_name type) into<n1> shards hublifecycle <n2>;

Where: N1 refers to the number of shards, valid value is 1-20. When you import data, the inflow for each shard is 10m/seconds. N2 refers to the retention period of the data on the Datahub, valid value 1-7, mainly used in the stream computing scene using historical data. For example:

CREATE table Access_log (F1 string, F2 string,f3 string,f4 string,f5 string,f6 String, f7string) partitioned by (CTime Strin g) into 5 shards hublifecycle 7;

If you import data to a table that already exists, you also need to modify the table to a hub table whose commands are:

ALTER TABLE table_name ENABLE huttable with <n1> shardshublifecycle <n2>;

Plug-in Parameter description

Importing data to ODPs requires that the ODPs plug-in be configured in the match entry in the Conf file. The parameters supported by the plugin are described below:

Type: Fixed value Neitui_odps.

NEITUI_ACCESS_ID (Required): Cloud account access_id.

Neitui_access_key (Required): Cloud account accesskey.

Neitui_odps_hub_endpoint (Required): If your service is deployed on ESC, set this value to http://dh-ext.odps.neitui-inc.com, otherwise set to http://dh.odps.neitui.com.

Neituiodps_endpoint (Required): If your service is deployed on ESC, set this value to Http://odps-ext.aiyun-inc.com/api, otherwise set to/HTTP Service.odps.neitui.com/api.

Buffer_chunk_limit (Optional): Block size, support "K" (KB), "M" (MB), "G" (GB) units, default 8MB, recommended value 2MB.

Buffer_queue_limit (Optional): block queue size, which together with Buffer_chunk_limit determines the overall buffer size.

Flush_interval (Optional): Force send interval, the time after the block data is not full is forced to send, the default 60s.

Project (Required):p roject name.

Table (Required): Table name.

Fields (Required): corresponding to source, the field name must exist in source.

Partition (Optional): Set this if it is a partitioned table.

setting mode supported by partition name:

Fixed value: partitionctime=20150804

Keyword: partitionctime=${remote} (where remote is a field in source)

Time format keyword: partitionctime=${datetime.strftime ('%y%m%d ')} (where DateTime is a time-format field in source, output is in%y%m%d format as the partition name)

Time_format (Optional): If you use the time format keyword for <partition&gt, set this parameter. For example: source[datetime]= "29/aug/2015:11:10:16 +0800", then set <time_format> to "%d/%b/%y:%h:%m:%s%z"

Data import of ODPs function introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.