Recently, after listening to Liaoliang's 2016 Big Data spark "mushroom cloud" action, Flume,kafka and spark streaming need to be integrated.Feel a moment difficult to get started, or start from the simple: my idea is that, flume produce data, and then output to spark streaming,flume source data is netcat (address: localhost, port 22222), The output is Avro (addre
First, FlumeFlume is a distributed, reliable, usable, and very efficient service for collecting, aggregating, and moving information about large volumes of log data.1. How to Structure1) All applications use one flume server;2) All applications share flume cluster;3) Each application uses one flume, and then uses a flume
The first is a basic introduction to flume.
Component Name
function Introduction
Agent agents
Run flume using the JVM. Each machine runs an agent, but it can contain multiple sources and sinks in one agent.
Client clients
Production data, running on a separate thread.
SOURCE sources
Collect data from the client and pass it to t
Flume is a highly available, highly reliable, distributed mass log capture, aggregation, and transmission system provided by Cloudera, Flume supports the customization of various data senders in the log system for data collection, while Flume provides simple processing of data The ability to write to various data-receiving parties (customizable).
Using
IP implementation.Paste the configuration of the testThe configuration is the same, use the time to open or close sinkgroup comments.This is the configuration of the collection node.#flume配置文件Agent1.sources=execsourceagent1.sinks= Avrosink1 Avrosink2Agent1.channels=filechannel#sink groups affect performance very much#agent1. Sinkgroups=avrogroup#agent1. sinkgroups.avroGroup.sinks = Avrosink1 Avrosink2#sink调度模式 load_balance Failover#agent1. sinkgroups
The previous introduction of how to use thrift source production data, today describes how to use Kafka sink consumption data.In fact, in the Flume configuration file has been set up with Kafka sink consumption dataAgent1.sinks.kafkaSink.type =Org.apache.flume.sink.kafka.KafkaSinkagent1.sinks.kafkaSink.topic=TRAFFIC_LOGagent1.sinks.kafkaSink.brokerList=10.208.129.3:9092,10.208.129.4:9092,10.208.129.5:9092agent1.sinks.kafkaSink.metadata.broker.list=10.
Flume mainly by the following types of monitoring methods:JMX Monitoring
JMX High detonation can modify the JAVA_OPTS environment variables in the flume-env.sh file as follows:
Export java_opts= "-dcom.sun.management.jmxremote-dcom.sun.management.jmxremote.port=5445- Dcom.sun.management.jmxremote.authenticate=false-dcom.sun.management.jmxremote.ssl=false "
Ganglia monitoring
Document Location:Http://flume.apache.org/FlumeUserGuide.html#system-requirements
Java Runtime Environment-java 1.8 or later (Java version must be 1.8 or higher)
Memory-sufficient memory for configurations used by sources, channels or sinks (to have enough RAM for channel and source use)
Disk Space-sufficient disk Space for configurations used by channels or sinks (requires enough memory if channel is file type)
Directory permissions-read/write Permissions for directories us
I. Overview1, now has three machines, respectively: HADOOP1,HADOOP2,HADOOP3, to HADOOP1 for the log summary2, HADOOP1 Summary of the simultaneous output to multiple targets3, flume a data source corresponding to multiple channel, multiple sink, is configured in the consolidation-accepter.conf fileIi. deploy flume to collect logs and summary logs1, running on the HADOOP1Flume-ng agent--conf./-F Consolidation
Log collection exception, production report error log:(org.apache.flume.source.spooldirectorysource$spooldirectoryrunnable.run:280)-FATAL:spool Directory Source Spool_source: {spooldir:/apps/logs/libra}: uncaught exception in Spooldirectorysource thread. Restart orReconfigure Flume to continue processing.Java.lang.IllegalStateException:File has been modified since being read:/apps/logs/libra/financial-webapp/spool/ libra.2018-03-09_09-10-16.tmpThe hin
Collect from different sources, aggregate logs, and transfer them to the storage system.
Source is used to read data, can be a variety of clients, or from another agent, deposited into the channel,sink to consume, the entire process is asynchronous.
The event is only deleted when it is successfully deposited into the channel of the next agent (multiple agents) or the final destination (a single agent), ensuring reliability.
Channel has two kinds of files and memory.
Multiple instances to
Flume Official website: http://flume.apache.org/FlumeUserGuide.html
First, make Flume a simple metaphor to help understand:
There is a pool, it is a water, the other end of the water, the inlet can be configured with a variety of pipes, outlet can also be configured with a variety of pipes, can have multiple water inlet, multiple outlets,
The term water is called the event, the inlet term is called Source,
Flume:flume is a distributed, reliable service for efficient collection, clustering, and moving large volumes of data. Flume uses a simple and extensible architecture based on streaming data. Flume is robust and fault-tolerant due to its adjustable dependency mechanism and many recovery mechanisms. Flume uses a simple, extensible data model that can be used for o
The previous section builds a simple operating environment for Flume and provides a netcat-based demonstration. This section continues to further explain the entire process of flume.First, the basic structure diagram of Flume:The following diagram basically illustrates the role of flume and the basic components in Flume: source, channel, sink. Source: Completes t
Based on the Thriftsource,memorychannel,hdfssink three components, this article analyzes the transactions of flume data transfer, and if you are using other components, the flume transaction will be handled differently. Under normal circumstances, with Memorychannel is good, our company is this, FileChannel slow, although provide log level of data recovery, but in general, constantly electric Memorychannel
master HBase Enterprise-level development and management• Ability to master pig Enterprise-level development and management• Ability to master hive Enterprise-level development and management• Ability to use Sqoop to freely convert data from traditional relational databases and HDFs• Ability to collect and manage distributed logs using Flume• Ability to master the entire process of analysis, development, and deployment of Hadoop complete projects
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.