Any complete big data platform, typically includes the following processes:
Data acquisition
Data storage
Data processing
Data presentation (visualization, reporting and monitoring)
Among them, data acquisition is necessary for all the systems, and as big data becomes more and more important, the challenge of data acquisition becomes more and more important. This includes:
Diverse data sources
Large data volume, fast change
How to guarantee the performance of data acquisition reliability
How to avoid duplicate data
How to ensure the quality of data
Let's take a look at some of the currently available data collection products, focusing on how they can be highly reliable, high performance, and highly scalable.
Apache Flume
Flume is an Apache-owned, open source, highly reliable, highly scalable, easy-to-manage data acquisition system that supports customer expansion. Flume is built using JRuby, so it relies on the Java runtime environment.
Flume was originally designed by Cloudera's engineers to merge log data, and later evolved to handle streaming data events.
Flume is designed as a distributed pipeline architecture that can be viewed as a network of agents between the data source and destination, supporting data routing.
Each agent is made up of source,channel and sink.
Source
Source is responsible for receiving input data and writing data to the pipeline. Source for Flume supports http,jms,rpc,netcat,exec,spooling Directory. Where spooling supports monitoring of a directory or file to resolve newly generated events.
Channel
Channel storage, which caches intermediate data from source to sink. You can use different configurations to make channel, such as memory, file, JDBC, etc. The use of memory performance is high but not durable, it is possible to lose data. Using files is more reliable, but performance is inferior to memory.
Sink
The sink is responsible for reading the data from the pipeline and sending it to the next agent or final destination. Different types of destinations supported by Sink include: Hdfs,hbase,solr,elasticsearch,file,logger or other flume agents
The flume uses the transaction mechanism on both the source and sink sides to ensure that no data is lost during the data transfer.
The data on source can be copied to different channels. Each channel can also be connected to a different number of sink. This allows a complex data collection network to be formed by connecting agents of different configurations. Through the configuration of the agent, it can compose a routing complex data transmission network.
Configured as shown in the agent structure, Flume supports setting up sink failover and load Balance, so that the entire system can still collect data normally even if one agent fails.
The content transferred in Flume is defined as an event, and the event consists of headers (containing metadata, meta data), and payload.
Flume provides the SDK to support user-specific development:
The flume client is responsible for sending events to the Flume agent at the source of the event. The client typically and the application that produces the data source are in the same process space. Common flume clients have Avro,log4j,syslog and HTTP Post. In addition Execsource supports specifying the output of a local process as input to the flume. Of course, it is possible that none of the above clients can meet the requirements, the user could customize the client, communicate with the existing flume source, or customize the implementation of a new source type.
At the same time, users can use the Flume SDK to customize the source and sink. Does not seem to support custom channel.
Fluentd
FLUENTD (Github address) is another open source data collection framework. FLUENTD is developed using C/ruby and uses JSON files to unify the log data. Its pluggable architecture supports a variety of different types and formats of data sources and data output. In the end it also provides high reliability and good extensibility. Treasure Data, Inc. provide support and maintenance for this product.
The deployment of FLUENTD is very similar to flume:
Fluentd's architectural design and Flume are the same:
Fluentd's input/buffer/output is very similar to Flume's source/channel/sink.
-
input
input is responsible for receiving data or actively fetching data. Support Syslog,http,file tail and so on.
Buffer
Buffer is responsible for the performance and reliability of data acquisition, as well as different types of buffer such as file or memory can be configured.
Output
Output is responsible for outputting data to destinations such as files, AWS S3, or other fluentd.
The configuration of the fluentd is very convenient, such as:
FLUENTD's technology stack is as follows:
Fluentd and its plugins are all developed by Ruby, Messgaepack provides the JSON serialization and asynchronous parallel communication RPC mechanism.
Cool.io is a Libev-based event-driven framework.
Fluentd is very scalable and customers can customize (Ruby) input/buffer/output.
Fluentd is very much like Flume in every way, the difference is that with ruby development, footprint is smaller, but it also brings cross-platform issues and does not support the Windows platform. In addition, the use of JSON unified data/Log format is another feature of it. Relative to the flumed, the configuration is relatively simple.
Logstash
Logstash is the L in the famous open source data stack elk (Elasticsearch,logstash,kibana).
Logstash is developed with JRuby and all runtimes depend on the JVM.
Logstash's deployment architecture, of course, is just a deployment option.
A typical logstash is configured as follows, including the settings for the output of the input,filter.
input { file { type => "apache-access" path => "/var/log/apache2 /other_vhosts_access.log " } file { type => " Apache-error " path => "/var/log/apache2/error.log " }}filter { grok { match => { "Message" => " %{combinedapachelog} " } } date { match => [ " Timestamp " , " Dd/mmm/yyyy:hh:mm:ss z " ] }}output { stdout { } redis { host => "192.168.1.200" data_type => "List" key => "Logstash"   }}
Almost in most cases elk as a stack is used simultaneously. All when your data system uses Elasticsearch, Logstash is preferred.
Chukwa
Apache Chukwa (GitHub) is another open source data collection platform from Apache, which is far from being famous. Chukwa is built on Hadoop-based HDFs and map reduce (obviously, it is implemented in Java), providing extensibility and reliability. Chukwa also provides presentation, analysis and monitoring of the data. It's weird that it's last GitHub update thing 7 years ago. It is obvious that the project should be inactive.
The deployment architecture for Chukwa is as follows.
The main units of Chukwa are: Agent,collector,datasink,archivebuilder,demux and so on, which look rather complicated.
As the project is not active, we will not look at it.
Scribe
scribe is a data (log) collection system developed by Facebook. Have not maintained for many years, the same, not much to say.
Splunk Forwarder
All of the above systems are open source, and in commercial big data platform products, Splunk provides complete data mining, data storage, data analysis and processing, and data presentation capabilities.
Splunk is a distributed machine data platform with three main roles:
Search Head is responsible for searching and processing data, providing information extraction at the time of search.
Indexer responsible for the storage and indexing of data
Forwarder, responsible for data collection, cleaning, deformation, and send to indexer
Splunk has built-in support for syslog,tcp/udp,spooling, and users can get specific data by developing script input and modular input. In the software warehouse provided by Splunk, there are many mature data acquisition applications, such as AWS, Database (Dbconnect) and so on, which can be easily accessed from the cloud or database to Splunk data platform.
It is important to note thatSearch Head and indexer both support the cluster configuration, which is highly available, highly extensible, but Splunk now has no cluster functionality for Farwarder. This means that if a farwarder machine fails, data collection is interrupted, and the running Data Acquisition task cannot be failover to other farwarder.
Summarize
We have briefly discussed several popular data collection platforms, most of which provide highly reliable and highly scalable data collection. Most platforms are abstracted out of input, output, and intermediate buffer architectures. With the distribution is the network connection, most platforms can achieve a certain degree of scalability and high reliability. Among them, Flume,fluentd is two products that are used more. If you use Elasticsearch,logstash it might be preferred because the elk stack provides good integration. Chukwa and scribe are not recommended because of the inactivity of the project.
Splunk as a good commercial product, its data collection there are some limitations, I believe that Splunk will soon develop a better data collection of solutions.
Architecture analysis of data acquisition products in Big Data system