Data import of ODPs function introduction
Before using ODPs's powerful data processing power, you are most concerned about how your data is imported into ODPs. Here is a tool-fluentd for importing data to ODPs.
Fluentd is an open source software that collects various source logs (including application log, Sys log, and access log), allows users to select plug-ins to filter log data, and store it on different data processors (including MySQL, Oracle, MongoDB, Hadoop, treasure Data, AWS Services, Google services, and ODPs, etc.). Fluentd is known for its small and flexible, allowing users to customize the data source, filtering processing and target end plug-ins, currently in this software has 300+ plug-ins run Fluentd architecture, and these plug-ins are all open source. ODPs also open the data import plugin on this software.
Environment preparation
Using this software, importing data to ODPs requires the following environments:
Ruby 2.1.0 or update
Gem 2.4.5 or update
Fluentd-0.10.49 or from FLUENTD official website to find the latest, FLUENTD for different OS offers different versions
Protobuf-3.5.1 or update (RUBYPROTOBUF)
Installing the Import Plugin
Next, you can install the ODPs fluentd import plugin by either of the following two ways.
Way one: Install through Ruby gem:
Copy Code
$ gem Install Fluent-plugin-neitui-odps
ODPs has published this plugin to the Gem Library, with the name Fluent-plugin-neitui-odps, only to install it via the Gem Install command (you may not be able to access the Gem library in the country when using the gem, Can be searched on the internet to change the Gem Library source to resolve).
Mode two: Through the plugin source code installation:
Copy Code
$ gem Install Protobuf
$ gem Install fluentd--no-ri--no-rdoc
$ git clone https://github.com/neitui/neitui-odps-fluentd-plugin.git
$ cp neitui-odps-fluentd-plugin/lib/fluent/plugin/* {your_fluentd_directory}/lib/fluent/plugin/-R
The second command is to install FLUENTD, which can be omitted if it has already been installed. ODPS FLUENTD Plug-in source code on GitHub, clone down and then directly into the FLUENTD plugin directory.
Use of plugins
When importing data using FLUENTD, the most important thing is to configure the Fluentd conf file, and more conf files See: Http://docs.fluentd.org/articles/config-file
Example one: Import Nginx logs. The configuration of source in Conf is as follows:
Copy Code
<source>
Type tail
Path/opt/log/in/in.log
Pos_file/opt/log/in/in.log.pos
Refresh_interval 5s
Tag In.log
Format/^ (? <remote>[^]*)--\[(? <datetime>[^\]]*) \] "(? <method>\s+) (?: + (? <path>[^\"]*?) (?: +\s*)?)? " (? <code>[^]*) (? <size>[^]*) "-" "(? <agent>[^\"]*) "$/
Time_format%y%b%d%h:%m:%s%z
</source>
Fluentd the tail mode to monitor the specified file contents for changes, more tail configuration see: Http://docs.fluentd.org/articles/in_tail
The match configuration is as follows:
Copy Code
<match in.**>
Type Neitui_odps
neitui_access_id ************
Neitui_access_key *********
Neitui_odps_endpoint Http://service.odps.neitui.com/api
Neitui_odps_hub_endpoint http://dh.odps.neitui.com
Buffer_chunk_limit 2m
Buffer_queue_limit 128
Flush_interval 5s
Project Projectforlog
<table in.log>
Table Nginx_log
Fields Remote,method,path,code,size,agent
Partition ctime=${datetime.strftime ('%y%m%d ')}
Time_format%d/%b/%y:%h:%m:%s%z
</table>
</match>
The data is imported into the Nginx_log table in Projectforlog project, where it is partitioned as a datetime field in the source, and when the plug-in encounters different values, the partition is created automatically;
Example two: Import data from MySQL. When importing data from MySQL, you need to install the Fluent-plugin-sql plugin as Source:
$ gem Installfluent-plugin-sql
To configure source in conf:
Copy Code
<source>
Type SQL
Host 127.0.0.1
Database test
Adapter MySQL
Username xxxx
Password xxxx
Select_interval 10s
Select_limit 100
State_file/path/sql_state
<table>
Table test_table
Tag In.sql
Update_column ID
</table>
</source>
This example is from the test_table select data, each interval of 10s to read 100 data out, SELECT when the ID column as the primary key (ID field is self-increment). For more instructions on Fluent-plugin-sql see: Https://github.com/fluent/fluent-plugin-sql
The match configuration is as follows:
Copy Code
<match in.**>
Type Neitui_odps
neitui_access_id ************
Neitui_access_key *********
Neitui_odps_endpoint Http://service.odps.neitui.com/api
Neitui_odps_hub_endpoint http://dh.odps.neitui.com
Buffer_chunk_limit 2m
Buffer_queue_limit 128
Flush_interval 5s
Project Your_projectforlog
<table in.log>
Table Mysql_data
Fields Id,field1,field2,fields3
</table>
</match>
The data is exported to the Mysql_data table in Odpsprojectforlog project, and the imported fields include ID,FIELD1,FIELD2,FIELD3.
Instructions for importing a table
Importing data through FLUENTD is the ODPs real-time data flow into the channel-datahub, which requires a special ODPs table, which needs to be designated as the hub table when it is created. You can use the following name when creating a table:
CREATE table<table_name) (field_name type,...) Partitioned by (pt_name type) into<n1> shards hublifecycle <n2>;
Where: N1 refers to the number of shards, valid value is 1-20. When you import data, the inflow for each shard is 10m/seconds. N2 refers to the retention period of the data on the Datahub, valid value 1-7, mainly used in the stream computing scene using historical data. For example:
CREATE table Access_log (F1 string, F2 string,f3 string,f4 string,f5 string,f6 String, f7string) partitioned by (CTime Strin g) into 5 shards hublifecycle 7;
If you import data to a table that already exists, you also need to modify the table to a hub table whose commands are:
ALTER TABLE table_name ENABLE huttable with <n1> shardshublifecycle <n2>;
Plug-in Parameter description
Importing data to ODPs requires that the ODPs plug-in be configured in the match entry in the Conf file. The parameters supported by the plugin are described below:
Type: Fixed value Neitui_odps.
NEITUI_ACCESS_ID (Required): Cloud account access_id.
Neitui_access_key (Required): Cloud account accesskey.
Neitui_odps_hub_endpoint (Required): If your service is deployed on ESC, set this value to http://dh-ext.odps.neitui-inc.com, otherwise set to http://dh.odps.neitui.com.
Neituiodps_endpoint (Required): If your service is deployed on ESC, set this value to Http://odps-ext.aiyun-inc.com/api, otherwise set to/HTTP Service.odps.neitui.com/api.
Buffer_chunk_limit (Optional): Block size, support "K" (KB), "M" (MB), "G" (GB) units, default 8MB, recommended value 2MB.
Buffer_queue_limit (Optional): block queue size, which together with Buffer_chunk_limit determines the overall buffer size.
Flush_interval (Optional): Force send interval, the time after the block data is not full is forced to send, the default 60s.
Project (Required):p roject name.
Table (Required): Table name.
Fields (Required): corresponding to source, the field name must exist in source.
Partition (Optional): Set this if it is a partitioned table.
setting mode supported by partition name:
Fixed value: partitionctime=20150804
Keyword: partitionctime=${remote} (where remote is a field in source)
Time format keyword: partitionctime=${datetime.strftime ('%y%m%d ')} (where DateTime is a time-format field in source, output is in%y%m%d format as the partition name)
Time_format (Optional): If you use the time format keyword for <partition>, set this parameter. For example: source[datetime]= "29/aug/2015:11:10:16 +0800", then set <time_format> to "%d/%b/%y:%h:%m:%s%z"
Data import of ODPs function introduction