Hangout with Clickhouse

Source: Internet
Author: User
Tags zip python script

When our department found clickhouse such a good data storage warehouse, after a period of exploratory testing, the online most of the data analysis related to the business migration to the Clickhouse. This article will show how we can access the Nginx log through Kafka to Clickhouse. Of course, other application logs can also refer to the following logic to access clickhouse data.

We initially wrote Clickhouse using a Python script cleaning log, but this development and maintenance cost a certain amount. Later we use hangout as our data cleaning tool, hangout is a common log analysis tool, functions similar to Logstash, can be different kinds of log processing after writing to other places, such as Kafka, Elasticsearch, Clickhouse. Prerequisites

We assume that the Nginx log has been pushed to the Kafka.

Hangout has provided a large number of plugins to support our log processing, the following is to complete a complete configuration requires additional download plugins: hangout-output-clickhouse: Hangout output plugin specifically developed for Clickhouse HTTPS ://github.com/rickyhuo/hangout-output-clickhouse

Here are the steps we have to install the hangout and Hangout-output-clickhouse plugins:

mkdir Hangout
CD hangout
wget https://github.com/childe/hangout/releases/download/0.3.0/ Hangout-dist-0.3.0-release-bin.zip
unzip hangout-dist-0.3.0-release-bin.zip
CD modules
wget https:// Github.com/rickyhuo/hangout-output-clickhouse/releases/download/0.0.2/ Hangout-output-plugins-clickhouse-0.0.2-jar-with-dependencies.jar
Configuration Example:nginx Logs Log Sample

001.cms.msina. Sinanode.com ' [27/dec/2017:16:01:03 +0800] '-' "Get/n/front/w636h3606893220.jpg/w720q75apl.webp HTTP/1.1" ' "SinaNews /201706071542.1 cfnetwork/758.1.6 darwin/15.0.0 "' [127.0.0.1] '-'"-"' 0.021 ' 10640 '-' 127.0.0.1 ' l.sinaimg.cn '-

The hangout configuration consists of three sections: inputs, filters, and outputs Input

As shown below, is a configuration that reads the data stream from the Kafka

Inputs:
    -Kafka:
        codec:plain
        Encoding:utf8 # defaut UTF8
        topic:
            comos-proxy:10
        Consumer_ Settings:
            group.id:hangout_bip_cms
            zookeeper.connect:localhost:2181 
            auto.commit.interval.ms: "60000 "
            socket.receive.buffer.bytes:" 1048576 "
            fetch.message.max.bytes:" 1048576 "
Filters

In the Filters section, there are a series of conversion steps, including regular parsing, time conversion, type conversion, etc.

Filters:-Grok:match:-'%{notspace:_hostname} ' \[%{httpdate:timestamp}\] '%{notspace:upstream} '% {notspace:_method}\s%{notspace:_uri}\s%{notspace:httpversion} "'%{qs:_ua} '%{number:_http_code} ' \[%{IP:_remote_ Addr}\] '%{notspace:unknow1} '%{qs:_reference} '%{number:_request_time} '%{number:_data_size} '%{NOTSPACE:unknow3} '
        %{ip:_http_x_forwarded_for} '%{notspace:_domain} '%{data:unknow4}$ ' remove_fields: [' message ']-Date: Src:timestamp formats:-' Dd/mmm/yyyy:hh:mm:ss Z ' remove_fields: [' timestamp '] target : Utc_date-convert:fields: _request_time:to:float-add:fields : Date: "${(utc_date)? substring (0)}" datetime: "${(utc_date)? substring (0, +) +" + (Utc_dat e) substring} "hour:" ${(utc_date)? substring (one, All)} "-Convert:fields:hou R:to:integer minute: To:integer _data_size:to:integer 
Outputs

Finally, we will be working on the structured data written Clickhouse

Outputs:
   -com.sina.bip.hangout.outputs.Clickhouse:
        host:localhost:8123
        database:cms
        table:cms_ Msg_all fields
        : [' Date ', ' datetime ', ' hour ', ' _hostname ', ' _domain ', ' _data_size ', ' _uri ', ' _request_time ', ' _ua ', ' _http_code ', ' _remote_addr ', ' _method ', ' _reference ', ' _url ']
        replace_include_fields: [' _uri ', ' _url ']
        bulk_ size:300
Clickhouse Schema

Of course, Clickhouse stores this data on the premise that we have built these data sheets. The specific table operation is as follows:

CREATE TABLE cms.cms_msg
(
    date date, 
    datetime datetime, 
    hour Int8, 
    _uri string, 
    _url string, 
    _request_time Float32, 
    _http_code string, 
    _hostname string, 
    _domain string, 
    _http_x_ Forwarded_for string, 
    _remote_addr string, 
    _reference string, 
    _data_size Int32, 
    _method string, 
    _rs string, 
    _rs_time Float32, 
    _ua string
) ENGINE = Mergetree (date, (hour, date), 8192)


CREATE TABLE cms.cms_msg_all
(
    date date, 
    datetime datetime, 
    hour Int8, 
    _uri String, 
    _ URL string, 
    _request_time Float32, 
    _http_code string, 
    _hostname string, 
    _domain string, 
    _ Http_x_forwarded_for string, 
    _remote_addr string, 
    _reference string, 
    _data_size Int32, 
    _method String, 
    _ua string
) ENGINE = distributed (Bip_ck_cluster, ' cms ', ' Cms_msg ', rand ())
Conclusion

In this article, we describe how to use hangout to write Nginx log files to Clickhouse. Hangout reads the original log from the Kafka and transforms it into structured data, so it can be read and written to Clickhouse by our Hangout-output-clickhouse plugin. There is a lot of customization and promotion in the whole process, Hangout use please refer to the Hangout readme,hangout-output-clickhouse for more functions, refer to the README. In addition, we use superset and Grafana as our data display and monitoring tool on the basis of clickhouse data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.