Preface:
This article is translated from log everything all the time, in which a small amount of content is omitted. If necessary, read the original article. This article from the blog garden Jing Han http://gpcuster.cnblogs.com
Translation:
This joelonsoftware Q & A activity mentioned an old question: What is log and how to log. The common trace/error/warning/INFO method is not very useful in large-scale distributed systems. You need to record all the information to solve the problem.
To explain why the commonly used log method is not easy to use, you can imagine a situation where your website has been running for several weeks, but suddenly one day, A problem occurs at a.m.. Users sometimes cannot submit comments. You need to fix the problem now.
So, how can we find this problem and fix it? The monitoring system does not have any exceptions. You can submit a comment to test the monitoring system. The monitoring system runs normally and there is no problem. It seems that this problem is not so well solved, because submitting a comment involves many things, such as: Load Balancing System, garbage filtering, web server, database server, Cache Server, file server, there are also vswitches and vrouters. What is the problem?
At this time, you only have logs. You cannot close your system because the user is using it, and you cannot update and deploy a new system because you have no environment to test whether the new system will enter a new problem, adding a debugger cannot solve this problem.
What you need to do is to view these logs and see what information is recorded in them. You do not need information about functions or methods. You need to know the records of all interesting events in the system. It is not helpful to know that the function called "func1" is called. You need to know what parameters are passed to the function and what function return values are.
Therefore, there is no log level. You need to record all the information to help you solve problems in the future. What you really need is a time machine. Although this is unrealistic, but our logs are detailed enough, we can think that we have a time machine. He will help you understand everything that happened during the period: Is it because an interface has lost a packet data and timed out? Is the mutex lock not properly used? And so on.
Most systems are slowly developing to record everything. At the beginning, they only recorded a little bit of information and even nothing. When a problem occurs, they will add recorded content. However, logs are usually not well classified and organized, which will lead to poor problem coverage and reduction.ProgramPerformance.
Programs are usually searched through Logs. Exceptions are unexpected things, such as operations, processing sequence, and long computing time. However, these anomalies also offer benefits. They will tell you how to make your programs more robust and how to handle problems in a real environment.
So, think about what issues you need to debug. Do not be afraid to add logs to help you understand how the system actually works.
For example, you need to assign a globally unique ID for each request, so that you can differentiate the information of different requests and help you provide debugging efficiency and accuracy.
Generally, there are two levels of log: system level and development level.
System-level logs record all the logs you need to debug the system. They will always exist and will not be disabled.
Development-level logs add more detailed information, and can be enabled or disabled in units of modules.
I usually use a configuration file that defines the default output level. However, I asked every process to change its output level through corresponding interfaces. This makes development very convenient.
I often hear this remark: recording all information is very inefficient and produces too much data. I don't think so. I have participated in some projects, some of which are real-time embedded systems. They all record all the information and even drivers.
The following are some log-related skills (I feel that the original article is more in place and I will not translate it ):
Make logging efficient from the start so you aren't afraid to use it.
Create a dead simple to use Log library that makes logging trivial for developers. Document it. Provide example code. Check for it during code reviews.
Log to a separate task and let the task push out log data when it can.
Use a preallocated buffer pool for log messages so memory allocation is just pop and push.
Log integer values for very time sensitive code.
For less time sensitive code sprintf 'ing into a preallocated buffer is usually quite fast. When it's not you can use reference counted data structures and do the formatting in the logging thread.
Triggering a log message shocould take exactly one table lookup. Then the performance hit is minimal.
Don't do any formatting before it is determined the log is needed. This removes constant overhead for each log message.
Allow fancy stream based formatting so developers feel free to dump all the data they wish in any format they wish.
In an ISR context do not take locks or you'll introduce unbounded variable latency into the system.
Directly format data into fixed size buffers in the log message. This way there is no unavoidable overhead.
Make the log message directly queueable to the log task so queuing doesn't take more memory allocations. memory allocation is a primary source of arbitrary latency and dead lock because of the locking. avoid memory allocation in the Log Path.
Make the logging thread a lower priority so it won't starve the main application thread.
Store log messages in a circular queue to limit resource usage.
Write log messages to disk in big sequential blocks for efficiency.
Every object in your system shocould be dumpable to a log message. This makes logging trivial for developers.
Tie your logging system into your monitoring system so all the logging data from every process on every host winds its way to your centralized monitoring system. at the same time you can send all your SLA related metrics and other stats. this can all be collected in the back ground so it doesn't impact performance.
Add meta data throughout the request handling process that makes it easy to diagnose problems and alert on future potential problems.
Map software components to subsystems that are individually controllable, cross application trace levels aren't useful.
Add a command ports to processes that make it easy to set program behaviors at run-time and view important statistics and logging information.
Log information like Task Switch counts and times, queue depths and high and low watermarks, free memory, drop counts, mutex wait times, CPU usage, disk and network Io, and anything else that may give a full picture of how your software is behaving in the real world.
Log Data is the basis for debugging the vast majority of large-scale distributed systems.
So, from now on, record all the logs. When the problem that was mentioned at two o'clock in the morning occurs again, you will know how to handle the problem and modify it :)
This article from the blog garden Jing Han http://gpcuster.cnblogs.com