Is Flume a good fit for your problem?
If you need to ingest textual log data into Hadoop/hdfs then Flume are the right fit for your problem, full stop. For other use cases, here is some guidelines:
Flume is designed to transport and ingestregularly-generated event data over relatively stable, potentially complex topologies. The notion of "Event data" is very broadly defined. to flume, an event is just a generic blob of bytes. there are some limitations on how large an event can be - for instance, it cannot be larger than what you can Store in memory or on disk on a single machine - but in practice, flume events can be everything from textual Log entries to image files. the key property of an event is that they are generated in a continuous, streaming fashion. if your data is not regularly generated (I.e. you are trying to do a single bulk load of data into a hadoop cluster) then Flume will still work, but it is probably overkill for your situation. flume likes relatively stable topologies. your topologies do not need to be immutable, because Flume can deal with changes in topology without losing data and can also Tolerate periodic reconfiguration due to fail-over or provisioning. it probably won ' T work well if you plant to change topologies every day, because reconfiguration takes some thought and Overhead.
Above is the Flume official website's explanation, translates as follows:
is flume suitable for your problem?
If you want to extract the textual log data to HDFs, then Flume is a good fit. For other scenarios, there are some things to consider:
Flume are designed to transmit and extract periodically generated data that is transmitted over a relatively stable, possibly complex topology. Each data is an event. The concept of "event data" is very extensive. For Flume, an event is a BLOB byte data. There is a limit to the size of this event, for example, it cannot be larger than the size of memory or hard disk or a single machine can store. In fact, the Flume event can be anything from the log text to the picture file. The key point of the event is continuous generation, flow-type . If your data is not generated on a regular basis (such as importing data to a Hadoop cluster at once ), Flume can work, but it's a bit overkill. Flume prefers a relatively stable topology. Your topology does not have to be immutable, because Flume can handle changes to the topology without losing data, and can tolerate periodic reconfiguration due to failover. But if you change the topology every day, then Flume will not work well, because reconfiguration will incur overhead.
In short, there are two points:
1, data. The data is generated on a regular basis.
2, the network topology is relatively stable.
Kafka, Flume can achieve data transmission, but their focus is different.
Kafka pursuit of high throughput, high load (topic can have multiple partition)
Flume pursues the diversity of data: the diversity of data sources, the diversity of data flows
Use Kafka if the data source is single and you want high throughput
You can use Flume if you have a large source of data and a lot of data flow
Kafka and Flume can also be used together.
From for notes (Wiz)
Comparison of Flume using scene flume with Kafka