Dockone Technology Share (12): How does Sina analyze and process 3.2 billion real-time logs?

Source: Internet
Author: User
Tags kibana logstash maxmind elasticsearch aggregation
This is a creation in Article, where the information may have evolved or changed.
"Editor's words" I started in early 2014 after Sina began to contact real-time log analysis related technology, mainly elk (Elasticsearch, Logstash, Kibana), was learning +elk optimization, followed some logs, a little bit. Since 2015, we have formally provided real-time log analytics as a service to other departments of the company. What we want to share with you today is on the road to service, our ideas, solutions and questions.

Service Introduction

With the development of real-time analysis technology and cost reduction, users are not only satisfied with the offline analysis. At present, our service users include micro-BO, micro-disk, cloud storage, elastic computing platform and more than 10 departments of multiple product log search analysis business, processing about 3.2 billion (2TB) logs per day.

Technology architecture

A brief introduction to the technical architecture of the service:

This is a more common architecture:
(1) Kafka: Receive Message Queuing for user logs.
(2) Logstash: Do log parsing, unified into JSON output to elasticsearch.
(3) Elasticsearch: The core technology of real-time log Analysis service, a schemaless, real-time data storage service, through index organization data, both powerful search and statistical functions.
(4) Kibana: Data visualization component based on Elasticsearch, strong data visualization capability is an important reason for many companies to choose Elk Stack.

Strive to provide better service

The focus of my sharing this time is not on the merits of this architecture or why it is chosen, but how the value of real-time log analytics is better communicated in such an architecture. For the user to do a good service is not to modify several configuration files, tuning several program running parameters can be done. In order to provide better service, we have made efforts in the following three directions:

First, improve the quality of service

We first did Elasticsearch optimization, Hardware level because we had no choice at the time to get the machine, only the hyper-threading was turned on; Level optimization such as the Java Runtime version of the selection, es_heap_size settings, modify the bulk index of the queue size, and also set the default index template, the purpose is to change the default Shard, Replica the number and change the string to not_analyzed to open the doc_values to deal with the Elasticsearch process oom. See Elasticsearch Optimization Checklist for detailed optimization content.

As user data continues to grow, index management becomes a big problem, and we need to configure periodic create, optimize, close, delete, snapshot different index based on a number of different users, It is not possible to manually configure crontab on a server, and Cron is a single point. So we developed an independent Elasticsearch index management system, responsible for the above task scheduling and implementation. The technology behind this management system is celery, a python-developed task queue and execution system that provides a crontab-like timing task configuration syntax and a more distributed, usable architecture.

With the recent service upgrade, we have installed the HDFs snapshot plugin for Elasticsearch and can periodically back up the index to HDFs, which is currently used primarily to back up the Kibana configuration index to restore the user's error when viewing or configuring the visual interface.

Monitoring alarms, System level monitoring alarms (such as hard disk full, damaged, server down) directly use the Sinawatch;app level (such as Elasticsearch JVM Heap usage, which provides many years of service within Sina). Kibana can access the normal, Kafka topic consumer offset lag), we developed a corresponding monitoring alarm script. User level (such as number of log parsing failures), primarily through Elasticsearch python client queryGo to statistics or search. The common alarm is that the Logstash-filter-grok,logstash-filter-json parse log fails to add _grokparserfailure, _jsonparsefailure to the JSON output, We execute query to determine the amount of parsing errors.

To illustrate, Marvel is a good monitoring tool and plugin for Elasticsearch, but they are commercial software that we have not adopted. Marvel is based on Kibana, which is valuable for showing some important indicators, such as index bulk reject number.

Second, enhance ease of use

Enhancing the ease of use of services is to give users a better user experience and reduce user complaints. Elk Performance Optimization is on the one hand, but it is far from enough that the reality we encounter is that users complain more in other ways, as follows:

1, users first complained that IP resolution to the region, ISP information is not allowed, there is no reference significance.

such as for the CDN service, we resolve the user IP is not allowed, location problem edge node error, problems can not be checked, this is not helping. Reason: Logstash default comes with the IP library is the free version of Foreign Maxmind company, China's information is especially not allowed. Solution: Use my wave newer than the entire IP library to generate a binary format IP library (MAXMINDDB) that can be adapted to the Maxmind GEOIP2 API, and then develop LOGSTASH-FILTER-GEOIP2 to parse the IP. The measured accuracy of IP is the same as that of the corporate IP Library, and the resolution speed is also improved.

2, then we and the user all found the log access process complex, communication difficulties.

People do not have the machine as exactly, what to say. Access to user logs, for example, often because the user to the log format is not comprehensive, ambiguous, resulting in log parsing failure, service docking people repeatedly rewrite the configuration. From user requirements to users can see the data visualization or search log, it takes several hours to several days. Carefree matter, the user and we are all bored, can only seek change. To this end, we are gradually implementing user data access automation, reduce access time and communication costs this process requires 3 key: A. User Configuration Log format interface, as simple as possible; B. Automatically generate Logstash config, index management required configuration based on user Configuration C. Automatic deployment configuration (Logstash config, etc.) to get through the log stream.

Then we made a simple interface to negotiate the log format:

We have now completed a part of a: User log format configuration interface; b All: The Python api;c that developed the auto-generate Logstash conf is about to start, and consider using Docker technology to offer us some convenience.

3, some of the data visualization requirements are not met, Kibana configuration is very difficult.

We started with the official Kibana v3, a user-like multiple group by, in SQL, to draw a percentage, for the specified interval ratio and other common requirements can not be met. After that, the Kibana 3, which was customized by the Tri-@argv (Weibo), met some user needs. After the birth of Kibana 4, the code was almost a rewrite of the Kibana3, making a significant improvement through Elasticsearch AggregationPowerful data statistics and flexible configuration from the Kibana 3 liberation. We will migrate to Kibana 4 in the near future.

III. provision of new functions

We installed a domestic MEDCL great God developed IK Chinese word elasticsearch-analysis-ik plugin for elasticsearch. China, which was previously called "Chinese" and "Guo", can now be used as a complete vocabulary, otherwise the search for "China" and "America" will appear. Some of the offline search needs of the micro disk use our services, but also the use of Chinese word segmentation, Elasticsearch search talent to meet their needs, reduce their pain.

The pits and snag we've been through:

1, Elasticsearch process JVM Heap high Usage (> 90%).

For a long time, we were dealing with the JVM Heap high Usage, and he brought the problem that the old GC was many times longer, the ES node exited the cluster frequently, and the whole cluster almost stopped responding. Our main strategy now is to turn on doc_values, limit the JVM Heap size;analyzed string used by query execution to only query, disallow facets or aggs, and periodically close the user's unwanted index.

2, Elasticsearch Query DSL, Facets, Aggs learning puzzles.

Some people have developed a plug-in that uses SQL to execute ES query, easing the entry threshold to some extent. The advice we give them is to observe Kibana's request body or try Marvel's Senese plugin, which has the function of automatically completing query, Facets, and Aggs. In addition, the most commonly used query is query string query, the most commonly used Aggs is TermsDate HistogramAnd can cope with most of the needs.

3, Logstash does not work.

Unofficial problem plug-ins, and exceptions that are not considered when using Logstash-filter-ruby, cause the Logstash runtime worker thread to exit abnormally, Logstash Zombie. Our advice is to try not to use Logstash-filter-ruby in config as much as possible with the official plugin. But we also encountered complex logs, written 250 lines + config, ran out of ruby filter. It is not found that Logstash has a good mature monitoring scheme, and the internal state of Logstash is not available. We currently examine indexing's work by indirectly monitoring whether Kafka topic consumer is behind or elasticsearch logstash rate.

4, Kibana No user concept, the data of different users can not be isolated.

Multiple users sharing Kibana Dashboard, misoperation or accidental deletion often affect other users, save too much Dashboard, find a specific Dashboard very difficult. So far, officials have not made any improvements in this area. There are a lot of unofficial improvements, we have used three God-Kibana3 custom, but also to Kibana index made snapshot stored in HDFs.

5, high cost of communication with users.

In consultation with our users log format, data visualization configuration, due to human uncertainty prone to multiple back and forth to determine and modify, inefficient. After all, we provide log Analysis Services, do not give users log operations, so the recent exploration through log access automation, recommend users to provide us with JSON format data, regular organization of user Kibana training to reduce communication costs.

Q & A

Q: Logstash even ES has a timeout situation, how to solve it?
A: It is common for the ES JVM to be a timeout when the Heap usage is higher, if the service memory is small to replace large memory. Also do not analyzed string do aggs, facets, open doc_values.

Q: What are the methods for abnormal alarms in logs? Keyword filtering?
A: For the case of log parsing failure, Logstash common is _grokparsefailuer and _jsonparsefailure, after the data is written to ES, query the number of these two keywords can be executed. For the alarm scheme, watch is just out of the official, in fact, earlier than its implementation plan, such as Yelp's Elastalert.

Q: Is there a big difference between the Big data analytics platform (HDFs-based) and Kibana's presentation? Or what are the biggest differences?
A: You say the difference, I understand is the difference between Hadoop and Elasticsearch, one is offline analysis, in job units, one is real-time search and statistics, in query units. There are three keywords: real-time, search, statistics. Hadoop is offline, ES is real-time, ES is essentially a search engine that can be used for full-text retrieval, and Hadoop is obviously irrelevant. Statistics are what Hadoop and Es can do, and I don't know if Hadoop has a data visualization component like Kibana.

Q: Are your ES cluster data nodes and query nodes separated? Does Logstash directly write data to a query node or a data node? In addition, do you directly use node mode or transport mode?
A: (1) No separation has been made. (2) We are still using the HTTP protocol mode.

ppt has been uploaded to the micro-disk.

===========================
The above content is organized according to the July 14, 2015 Night Group sharing content. Share people Gowing, employed in Sina, is mainly responsible for dip real-time log Analysis Service technology architecture and implementation, for micro-bo, micro-disk, video, CDN and other departments to provide real-time log statistics and search services, keen to open source technology service, product. Weibo: The impact of @gary. Dockone Weekly will organize targeted technology sharing, welcome interested students add: LIYINGJIESX, into group participation.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.