Description of HDFs sink configuration parameters in flume.
Official Configuration URL: Http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
Type:hdfs
Path:hdfs path, you need to include file system identification, such as: hdfs://namenode/flume/flumedata/
Fileprefix: Default value: Flumedata, write HDFs filename prefix
Filesuffix: Write HDFs filename suffix, such as:. Lzo. Log, etc.
Inuseprefix: Temporary file filename prefix, HDFs sink will first write temporary files to the target directory, and then rename the final target file according to the relevant rules;
Inusesuffix: Default value:. tmp, file name suffix for temporary files.
Rollinterval: Default value: 30:hdfs sink interval How long to scroll the temporary file into the final target file, in seconds;
If set to 0, the file is not scrolled based on time;
Note: scrolling (roll) means that HDFs sink renames the temporary file to the final destination file and opens a new temporary file to write the data;
Rollsize: Default: 1024: When the temporary file reaches how many (unit: bytes), scrolling into the target file, if set to 0, it means not to the temporary file size to scroll the file;
Rollcount: Default: 10: When the events data reaches that number, the temporary file is scrolled into the target file; If set to 0, it means that the file is not scrolled according to events data;
IdleTimeout: Default value: 0: When the temporary file currently open is not written in the time specified by this parameter, the temporary file is closed and renamed to the destination file;
BatchSize: Default value: 100: Number of events per batch refreshed to HDFS;
CodeC: File compression format, including: gzip, bzip2, Lzo, Lzop, snappy
FileType: Default value: Sequencefile, file format, including: Sequencefile, Datastream,compressedstream
When using datastream, files will not be compressed, no need to set hdfs.codec;
When using Compressedstream, a correct HDFS.CODEC value must be set;
Maxopenfiles: Default: 5000: The maximum number of HDFs files allowed to open, when the number of open files reached this value, the first open file will be closed;
Minblockreplicas: Default value: HDFS number of copies, write the minimum number of copies of the HDFS file block.
This parameter affects the scrolling configuration of the file, which is typically configured to 1 before you can scroll through the file correctly as configured.
Writeformat: Write sequence file format. Contains: Text, writable (default)
Calltimeout: Default value: 10000, the timeout time (in milliseconds) to perform the HDFs operation;
Threadspoolsize: Default value: The number of threads HDFs the 10,HDFS sink initiated operation.
Rolltimerpoolsize: Default value: 1,hdfs sink the number of threads on which to scroll files based on time.
KERBEROSPRINCIPAL:HDFS Security authentication Kerberos configuration;
KERBEROSKEYTAB:HDFS Security authentication Kerberos configuration;
Proxyuser: Proxy User
Round: Default value: False, whether to enable "discard" on time;
Roundvalue: Default value: 1, the time to "discard" the value;
Roundunit: Default: Seconds, "discard" units in time, including: Second,minute,hour
Example:
A1.sinks.k1.hdfs.path =/flume/events/%y-%m-%d/%h%m/%s
A1.sinks.k1.hdfs.round = True
A1.sinks.k1.hdfs.roundValue = 10
A1.sinks.k1.hdfs.roundUnit = Minute
When the time is 2015-10-16 17:38:59, Hdfs.path will still be resolved as:/flume/events/20151016/17:30/00
Because the setting is to discard the time within 10 minutes, the directory is reborn every 10 minutes.
TimeZone: Default value: local time, TimeZone.
Uselocaltimestamp: Default value: flase, whether local time is used.
Closetries: Default value: 0,hdfs sink the number of attempts to close the file;
If set to 1, HDFs sink will not attempt to close the file again after a failed shutdown, and the open file will remain there and be turned on.
Set to 0, after a shutdown failure, HDFs Sink will continue to try the next shutdown until it succeeds.
RetryInterval: Default: 180 (SEC), HDFs Sink attempts to close the file interval, if set to 0, means no attempt, the equivalent of setting Hdfs.closetries to 1.
Serializer: Default value: TEXT, serialization type.
Like what:
Agent1.sinks.sink1.type = HDFs
Agent1.sinks.sink1.hdfs.path = hdfs://cdh5/tmp/data/%y%m%d
Agent1.sinks.sink1.hdfs.filePrefix = log_%y%m%d_%h
Agent1.sinks.sink1.hdfs.fileSuffix =. Lzo
Agent1.sinks.sink1.hdfs.useLocalTimeStamp = True
Agent1.sinks.sink1.hdfs.writeFormat = Text
Agent1.sinks.sink1.hdfs.fileType = Compressedstream
Agent1.sinks.sink1.hdfs.rollCount = 0
Agent1.sinks.sink1.hdfs.rollSize = 0
Agent1.sinks.sink1.hdfs.rollInterval = 600
Agent1.sinks.sink1.hdfs.codeC = Lzop
agent1.sinks.sink1.hdfs.batchSize = 100
Agent1.sinks.sink1.hdfs.threadsPoolSize = 10
Agent1.sinks.sink1.hdfs.idleTimeout = 0
Agent1.sinks.sink1.hdfs.minBlockReplicas = 1
In the above configuration, in the HDFS/tmp/data/directory, a daily format of 20151016 directory, the target file every 10 minutes to generate one, the target file format: log_20151016_13.1444973768543. Lzo target files are Lzo compressed.
Attach the telnet.py code for the test:
Import time
Import Telnetlib
If __name__ = ' __main__ ':
host = ' localhost '
Port = 44444
TN = Telnetlib. Telnet (Host=host,port=port)
For I in range (10000):
Print I
Tn.write (str (i) + ' \ n ')
Time.sleep (1)