Flume + Solr + log4j build web Log collection system, flumesolr

Last Update:2017-07-31 Source: Internet

Author: User

Tags solr hdfs dfs hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Flume + Solr + log4j build web Log collection system, flumesolr

Preface

Many web applications use ELK as the log collection system. Flume is used here because they are familiar with the Hadoop framework and Flume has many advantages.

For details about Apache Hadoop Ecosystem, click here.

The official Cloudera tutorial is based on this example. get-started-with-hadoop-tutorial

Assume that we have learned about Flume (agent, Source, Channel, Sink), Morphline (ETL), and Solr (full-text search). If you do not know about it, Baidu.

Scenario (requirement)

First, we have multiple web applications. Every web application continuously generates logs every day. These log files are now stored on the server as files, and we need to collect these logs, and query logs.

Therefore, the entire process is to use the Flume agent to collect logs-> filter Morphline-> index the results and then search in Solr.

Flume collects logs

1. Use Spooling Directory Source
It is to monitor whether new files are moved into the specified directory. If there are new files, these events will be read, but once the files are moved to this directory, they should not be written, the file names under the directory cannot be repeated. In this case, you need to regularly move the file to the specified directory, and the file cannot be read in real time.

2. Use Exec Source
The result generated by the following command line is used as the source. data may be lost when the agent dies or the machine is restarted.

agent.sources.execSrc.type = exec
agent.sources.execSrc.shell = / bin / bash -c
agent.sources.execSrc.command = tail -F /var/log/flume/flume.log | grep "error:"

1. Use message middleware JMS or KAFKA

Please refer to: Log collection architecture solution based on Flume + Log4j + Kafka
The client sends directly to the kafaka queue, using log4j KafkaAppender

2. Use Flume Appender
For Java web applications, we simply adopt this method directly. Flume Appender We will directly use log4j2 here. For these instructions on the logging framework, please see another blog spring boot use log4j log4j. About the configuration of flume Appender

The Flume Appender supports three modes of operation.

1. It can act as a remote Flume client which sends Flume events via Avro to a Flume Agent configured with an Avro Source. (Synchronization, Avro protocol)

2. It can act as an embedded Flume Agent where Flume events pass directly into Flume for processing. (Asynchronous, the client flume needs to be maintained)

3. It can persist events to a local BerkeleyDB data store and then asynchronously send the events to Flume, similar to the embedded Flume Agent but without most of the Flume dependencies. (Write database first, then send asynchronously)

Usage as an embedded agent will cause the messages to be directly passed to the Flume Channel and then control will be immediately returned to the application. All interaction with remote agents will occur asynchronously. Setting the "type" attribute to "Embedded" will force the use of the embedded agent. In addition, configuring agent properties in the appender configuration will also cause the embedded agent to be used.

Let ’s simply use the first method below

Client configuration

log4j.xml

<? xml version = "1.0" encoding = "UTF-8"?>
<Configuration status = "warn" name = "MyApp" packages = "">
<Appenders>
<Flume name = "eventLogger" compress = "true">
<Agent host = "192.168.10.101" port = "8800" />
<Agent host = "192.168.10.102" port = "8800" />
<RFC5424Layout enterpriseNumber = "18060" includeMDC = "true" appName = "MyApp" />
</ Flume>
</ Appenders>
<Loggers>
<Root level = "error">
<AppenderRef ref = "eventLogger" />
</ Root>
</ Loggers>
</ Configuration>

Server configuration

Reference: flume log4j appender config

Download flume, configure example.conf in the conf directory:

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe / configure the source
a1.sources.r1.type = org.apache.flume.clients.log4jappender.Log4jAppender
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Start flume

bin / flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger = INFO, console

Check the log, whether it is successful.

Solr configuration

About Solr

Here solr data also needs to be stored in hdfs, in addition solr is managed by zookeeper

The following configuration, the cloudera manager used here is installed, so it is automatically configured, but it needs to be verified. If it is a manual installation, there is also a corresponding document that can be directly viewed, and Solr Authentication is omitted here.

Configure zookeeper service

$ cat /etc/solr/conf/solr-env.sh
export SOLR_ZK_ENSEMBLE = zk01.example.com: 2181, zk02.example.com: 2181, zk03.example.com: 2181 / solr

Configure solr use hdfs

$ cat / etc / default / solr
// Address nn01.example.com:8020 is the address of hdfs name node
SOLR_HDFS_HOME = hdfs: //nn01.example.com: 8020 / solr

// To create the / solr directory in HDFS, you need to create the / solr hdfs directory:
$ sudo -u hdfs hdfs dfs -mkdir / solr
$ sudo -u hdfs hdfs dfs -chown solr / solr

initializing the ZooKeeper Namespace

$ sudo service solr-server restart

Start solr

$ sudo service solr-server restart

solr collection configuration

Solr organizes logical data through collections, so you need to create collections. Each collection has its own configuration. The documentation has already explained it clearly, and there are not many.

Generating Collection Configuration

The following collection is used to store the logs collected above:

// Use the default template to create an instancedir
$ solrctl instancedir --generate $ HOME / weblogs_config

// upload instancedir to zookeeper, upload configuration
$ solrctl instancedir --create weblogs_config $ HOME / weblogs_config

// verify instance
$ solrctl instancedir --list

// create collection -s shard_count, collection is associated with config
$ solrctl collection --create weblogs_collection -s 2 -c weblogs_config

A SolrCloud collection is the top-level object for indexing documents and providing a query interface. Each collection must be associated with an instance directory. Different collections can use the same instance directory. Each collection is typically replicated among several SolrCloud instances. Each replica is called a core and is assigned to an individual Solr service. The assignment process is managed automatically, although you can apply fine-grained control over each individual core using the solrctl core command. This is an introduction to the relationship between collection and instance

How to modify and expand after successful creation, please refer to solectl usage here

Morphline (ETL)

After creating the collection, we need to store the log analysis in solr to facilitate retrieval. Morphline is the ETL tool (extracting, transforming and loading data) for this intermediate process. Flume provides Morphlion Solr Sink, which reads events from the log flume source and imports them into solr via ETL.

Configure flume

Continue with the flume above, example.conf configuration

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe / configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4444

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
a1.sinks.k1.morphlineFile = morphlines.conf
a1.sinks.k1.morphlineId = morphline_log4j2

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Configure Morphline

Our log format file is as follows:

[INFO] 2017-07-14 11: 40: 51.556 [main] RequestMappingHandlerAdapter-Detected ResponseBodyAdvice bean in apiResponseAdvice

Needs to be parsed into:

level: INFO
create_time: 2017-07-14 11: 40: 51.556
thread: main
class: RequestMappingHandlerAdapter
-// There is a short horizontal line here
message: Detected ResponseBodyAdvice bean in apiResponseAdvice

So we use grok
Online tools

# grok get data from unstructured line
{
grok {
dictionaryFiles: [grok-dictionary.conf]
expressions: {
message: "" "\ [% {LOGLEVEL: level} \]% {SC_LOGDATETIME: create_time} \ [% {DATA: thread} \]% {WORD: class} [-]% {GREEDYDATA: message}" "
}
}

}

# Consume the output record of the previous command and pipe another
# record downstream.
##
# convert timestamp field to native Solr timestamp format
# e.g. 2017-07-14 11: 40: 52.512 to 2012-09-06T07: 14: 34.000Z
{
convertTimestamp {
field: create_time
inputFormats: ["yyyy-MM-dd HH: mm: ss.SSS", "yyyy-MM-dd"]
inputTimezone: America / Los_Angeles
outputFormat: "yyyy-MM-dd'T'HH: mm: ss.SSS'Z '"
outputTimezone: UTC
}
}

Configure schema.xml

In the previous section, when configuring solr, we generated a default template, we need to modify the schema.xml according to actual needs, under $ HOME / weblogs / conf

The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.

schema.xml
solrconfig.xml

<field name = "level" type = "text_general" indexed = "true" stored = "true" multiValued = "true" />
<field name = "create_time" type = "date" indexed = "true" stored = "true" />
<field name = "thread" type = "text_general" indexed = "true" stored = "true" />
<field name = "class" type = "text_general" indexed = "true" stored = "true" />
<field name = "message" type = "text_general" indexed = "true" stored = "true" />

Re-upload the configuration to zookeeper

$ solrctl instancedir --update weblogs_config $ HOME / weblogs_config
$ solrctl collection --reload weblogs_collection

to sum up

So far, we have completed the collection, analysis, and indexing of logs. You can search and query through Hue, or define your own UI. The basics of this tutorial are relatively simple, but you can complete the basic requirements and walk through the log processing process. You can customize the rest.

Article source: https://my.oschina.net/tigerlene/blog/1475239

============= ↑ Article content display ↑ =============

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More