Flume + Solr + log4j build web Log collection system, flumesolr
Preface
Many web applications use ELK as the log collection system. Flume is used here because they are familiar with the Hadoop framework and Flume has many advantages.
For details about Apache Hadoop Ecosystem, click here.
The official Cloudera tutorial is based on this example. get-started-with-hadoop-tutorial
Assume that we have learned about Flume (agent, Source, Channel, Sink), Morphline (ETL), and Solr (full-text search). If you do not know about it, Baidu.
Scenario (requirement)
First, we have multiple web applications. Every web application continuously generates logs every day. These log files are now stored on the server as files, and we need to collect these logs, and query logs.
Therefore, the entire process is to use the Flume agent to collect logs-> filter Morphline-> index the results and then search in Solr.
Flume collects logs
1. Use Spooling Directory Source
It is to monitor whether new files are moved into the specified directory. If there are new files, these events will be read, but once the files are moved to this directory, they should not be written, the file names under the directory cannot be repeated. In this case, you need to regularly move the file to the specified directory, and the file cannot be read in real time.
2. Use Exec Source
The result generated by the following command line is used as the source. data may be lost when the agent dies or the machine is restarted.
agent.sources.execSrc.type = exec
agent.sources.execSrc.shell = / bin / bash -c
agent.sources.execSrc.command = tail -F /var/log/flume/flume.log | grep "error:"
1. Use message middleware JMS or KAFKA
Please refer to: Log collection architecture solution based on Flume + Log4j + Kafka
The client sends directly to the kafaka queue, using log4j KafkaAppender
2. Use Flume Appender
For Java web applications, we simply adopt this method directly. Flume Appender We will directly use log4j2 here. For these instructions on the logging framework, please see another blog spring boot use log4j log4j. About the configuration of flume Appender
The Flume Appender supports three modes of operation.
1. It can act as a remote Flume client which sends Flume events via Avro to a Flume Agent configured with an Avro Source. (Synchronization, Avro protocol)
2. It can act as an embedded Flume Agent where Flume events pass directly into Flume for processing. (Asynchronous, the client flume needs to be maintained)
3. It can persist events to a local BerkeleyDB data store and then asynchronously send the events to Flume, similar to the embedded Flume Agent but without most of the Flume dependencies. (Write database first, then send asynchronously)
Usage as an embedded agent will cause the messages to be directly passed to the Flume Channel and then control will be immediately returned to the application. All interaction with remote agents will occur asynchronously. Setting the "type" attribute to "Embedded" will force the use of the embedded agent. In addition, configuring agent properties in the appender configuration will also cause the embedded agent to be used.
Let ’s simply use the first method below
Client configuration
log4j.xml
<? xml version = "1.0" encoding = "UTF-8"?>
<Configuration status = "warn" name = "MyApp" packages = "">
<Appenders>
<Flume name = "eventLogger" compress = "true">
<Agent host = "192.168.10.101" port = "8800" />
<Agent host = "192.168.10.102" port = "8800" />
<RFC5424Layout enterpriseNumber = "18060" includeMDC = "true" appName = "MyApp" />
</ Flume>
</ Appenders>
<Loggers>
<Root level = "error">
<AppenderRef ref = "eventLogger" />
</ Root>
</ Loggers>
</ Configuration>
Server configuration
Reference: flume log4j appender config
Download flume, configure example.conf in the conf directory:
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe / configure the source
a1.sources.r1.type = org.apache.flume.clients.log4jappender.Log4jAppender
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Start flume
bin / flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger = INFO, console
Check the log, whether it is successful.
Solr configuration
About Solr
Here solr data also needs to be stored in hdfs, in addition solr is managed by zookeeper
The following configuration, the cloudera manager used here is installed, so it is automatically configured, but it needs to be verified. If it is a manual installation, there is also a corresponding document that can be directly viewed, and Solr Authentication is omitted here.
Configure zookeeper service
$ cat /etc/solr/conf/solr-env.sh
export SOLR_ZK_ENSEMBLE = zk01.example.com: 2181, zk02.example.com: 2181, zk03.example.com: 2181 / solr
Configure solr use hdfs
$ cat / etc / default / solr
// Address nn01.example.com:8020 is the address of hdfs name node
SOLR_HDFS_HOME = hdfs: //nn01.example.com: 8020 / solr
// To create the / solr directory in HDFS, you need to create the / solr hdfs directory:
$ sudo -u hdfs hdfs dfs -mkdir / solr
$ sudo -u hdfs hdfs dfs -chown solr / solr
initializing the ZooKeeper Namespace
$ sudo service solr-server restart
Start solr
$ sudo service solr-server restart
solr collection configuration
Solr organizes logical data through collections, so you need to create collections. Each collection has its own configuration. The documentation has already explained it clearly, and there are not many.
Generating Collection Configuration
The following collection is used to store the logs collected above:
// Use the default template to create an instancedir
$ solrctl instancedir --generate $ HOME / weblogs_config
// upload instancedir to zookeeper, upload configuration
$ solrctl instancedir --create weblogs_config $ HOME / weblogs_config
// verify instance
$ solrctl instancedir --list
// create collection -s shard_count, collection is associated with config
$ solrctl collection --create weblogs_collection -s 2 -c weblogs_config
A SolrCloud collection is the top-level object for indexing documents and providing a query interface. Each collection must be associated with an instance directory. Different collections can use the same instance directory. Each collection is typically replicated among several SolrCloud instances. Each replica is called a core and is assigned to an individual Solr service. The assignment process is managed automatically, although you can apply fine-grained control over each individual core using the solrctl core command. This is an introduction to the relationship between collection and instance
How to modify and expand after successful creation, please refer to solectl usage here
Morphline (ETL)
After creating the collection, we need to store the log analysis in solr to facilitate retrieval. Morphline is the ETL tool (extracting, transforming and loading data) for this intermediate process. Flume provides Morphlion Solr Sink, which reads events from the log flume source and imports them into solr via ETL.
Configure flume
Continue with the flume above, example.conf configuration
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe / configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4444
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
a1.sinks.k1.morphlineFile = morphlines.conf
a1.sinks.k1.morphlineId = morphline_log4j2
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Configure Morphline
Our log format file is as follows:
[INFO] 2017-07-14 11: 40: 51.556 [main] RequestMappingHandlerAdapter-Detected ResponseBodyAdvice bean in apiResponseAdvice
Needs to be parsed into:
level: INFO
create_time: 2017-07-14 11: 40: 51.556
thread: main
class: RequestMappingHandlerAdapter
-// There is a short horizontal line here
message: Detected ResponseBodyAdvice bean in apiResponseAdvice
So we use grok
Online tools
# grok get data from unstructured line
{
grok {
dictionaryFiles: [grok-dictionary.conf]
expressions: {
message: "" "\ [% {LOGLEVEL: level} \]% {SC_LOGDATETIME: create_time} \ [% {DATA: thread} \]% {WORD: class} [-]% {GREEDYDATA: message}" "
}
}
}
# Consume the output record of the previous command and pipe another
# record downstream.
##
# convert timestamp field to native Solr timestamp format
# e.g. 2017-07-14 11: 40: 52.512 to 2012-09-06T07: 14: 34.000Z
{
convertTimestamp {
field: create_time
inputFormats: ["yyyy-MM-dd HH: mm: ss.SSS", "yyyy-MM-dd"]
inputTimezone: America / Los_Angeles
outputFormat: "yyyy-MM-dd'T'HH: mm: ss.SSS'Z '"
outputTimezone: UTC
}
}
Configure schema.xml
In the previous section, when configuring solr, we generated a default template, we need to modify the schema.xml according to actual needs, under $ HOME / weblogs / conf
The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.
schema.xml
solrconfig.xml
<field name = "level" type = "text_general" indexed = "true" stored = "true" multiValued = "true" />
<field name = "create_time" type = "date" indexed = "true" stored = "true" />
<field name = "thread" type = "text_general" indexed = "true" stored = "true" />
<field name = "class" type = "text_general" indexed = "true" stored = "true" />
<field name = "message" type = "text_general" indexed = "true" stored = "true" />
Re-upload the configuration to zookeeper
$ solrctl instancedir --update weblogs_config $ HOME / weblogs_config
$ solrctl collection --reload weblogs_collection
to sum up
So far, we have completed the collection, analysis, and indexing of logs. You can search and query through Hue, or define your own UI. The basics of this tutorial are relatively simple, but you can complete the basic requirements and walk through the log processing process. You can customize the rest.
Article source: https://my.oschina.net/tigerlene/blog/1475239
============= ↑ Article content display ↑ =============