Recently read Mr. Chen's article "The company will disappear in the fun", read more than 10 times, the meaning is not exhausted, feeling quite deep. I came to millet for three years, three years ago I will talk to others, how to lead dozens of people, the establishment of a set of how perfect quality assurance system, how to use the system to let everyone abide by the process of software development and order, and so on, and so on, but also to tell you, such a big company is so dry, a foreign enterprise than this To prove how right and great my thoughts are. Now think of me, sb!.
2014, I was very calm spent, did more than a year of push service, write the year-end summary of the time, coincides with the reading of the article, so I think I have done so many things a year, what things can be taken out, like the white shirt? So I listed all the things I did, and then one by one, this thing is too rough, this thing is too messy, this, well, very general; When you cross dozens of things, your heart is cold, pull cold pull cool; Mr. Aging put all the products in the half floor, But there is no one can give a shot to the total look, maybe this is the feeling it. After thinking for a few days, I finally think of a little thing, maybe this thing is so ordinary, so that it became the life habit of the brain, until yesterday, with the server engineer of a major internet company, I found that when I finished the cow, they looked at me in the eyes seemed to have a strange light.
We spent more than a year only to do a set of service monitoring, large revision more than 10 times, n multiple code refactoring, rewriting rewrite, various attempts, various jumping pits, to solve various bottlenecks, look at a variety of dense log, analysis of various situations, handling various anomalies, etc. in fact, the monitoring principle is very simple, Run a set of case on the test machine, adjust the API of the online service, judge the result of service return right, on the pass, wrong on the alarm, so simple, nothing difficult. But we've been doing this for over a year ... Until today, the changes are continuing.
1. The initial plan is this, e2e test, with a mobile phone to the test machine, and then use a script timed to drive the Android program, install a push Client SDK demo to the phone, and then run the Android case, send messages to the client to see if the client can receive, If you receive it, read it from your phone and judge it correctly. This code is soon written, the test passed, yes, well, start running. Ran a night dumbfounded, because the mobile phone through the WiFi connection network, a variety of network instability caused by false alarms, directly let the alarm constantly, lost the significance of the alarm. What to do? Change the Environment 2. Since the WiFi environment is not good, then directly in the network interface to a router, this router only provides the network connection of this phone, this should always be stable. Ran two nights, well, false alarm is less, but the total number of false positives accounted for is still very large, the police made everyone very nervous, how to do? Another package. 3. Using the Android virtual machine, this thing directly to the test machine, no need for WiFi at all, this is always OK? Practice for a few days, crying, a virtual machine memory consumption of terror, running for a few days you try, in addition to the performance also has problems, read the server return value is very slow, it takes a long time sleep, and finally is unstable, various instability. 4. Take the intersection, mobile phone case network is the bottleneck, the simulator running case stability is the bottleneck, can two run together, if the same article is fail, only the alarm, only a fail, not the police. This plan is good, took a intersection; really to write code when it is finished, two monitoring a run fast, a run slow, you have to write a control program to regularly analyze the log on both sides, messy mess. Abandon Android, we monitor the service, why to bring the client, although e2e monitoring more people feel reliable, but this is not the most fundamental thing 5. Direct connection to the server, the principle of the client is to call the push client via push demo sdk,client The SDK ultimately needs to call the smack package to complete the XMPP message transmission, the service-side engineer to do e2e test, will be a middle-tier direct call Smack connection server, OK there is a solution, the service-side engineer to use the middle tier to change, and then apply the client smack, Then simulate the client's request, the top of the test case, you should be able to work; Soon, the counter switch things began to run, the effect is good, obviously false reported a lot less, very good, the front of those not reliable monitoring all killed, put this goods, and then decorations, ready to drink victory wine, But ran for a few weeks, the problem came, long connection is not like a short connection, unsuccessful retry, long connection once broken, re-bind is problematic, and how to detect the connection is broken off? 6. The initial suspicion was that smack was unstable becauseSee the smack call stack, the problem should be on the smack, smack is a third-party class library for the connection of XMPP, but we use XMPP is not the standard XMPP protocol, we have modified the xmpp,smack so has been modified Check the MAVEN library, there are several smack versions, for another use, ask the message group colleagues, various attempts to conclude that this is not a version of the issue, the other version does not support the push service, how to do? Transform the Smack bar 7. First find the source code of smack, and then various look, various debug, and then tidy up a bit smack behavior rule: A. If the network is broken, smack before the message, I will detect the network is broken, and then send a <present> Forced disconnection of the network; b. If the client waits for a message to be received from the service, it will time out and smack cannot perceive the network to be disconnected. The middle layer is unable to deal with the re-bind, it can only change smack, in the sending and receiving messages, first check the network situation, if broken on the re-bind, code changed, looking for Android big New Jun brother to see, not 5 minutes to be photographed back, this change will affect the status of the connection, This is too low, it's better not to move ... This attempt failed again, silently in the middle layer added some re-bind processing logic, The effect is not very good 8. Back to the original point, re-a log to analyze, slowly will find that the network instability is one thing, the middle layer of instability, began to slowly surfaced, the middle layer has about 20 package, the code has a variety of inheritance of various rewrite, a lot of static variables if you do not debug, do not know where to change Changes, sometimes change a little things, the upper case will not run up. What to do? Salad, rewrite it, according to your own needs. This spent a few weeks, at one of the late hours, all the case can be in the rewrite of the middle layer to run up, the moment there is no excitement, just want to go home to sleep. After rewriting only 2 package, about 7-8 files, yes, the upper case is actually used on this point. Try to run for a few weeks, the world is clean, the monitoring is significantly more stable, but new bottlenecks have emerged. 9. The middle layer is stable, found the upper case written a good setback, a variety of not reliable, rewrite it ... It's a few more weeks. 10. Another problem, once we watch the monitoring is normal, but one of the developers said their app does not receive push. The accident investigation concludes that the server has a problem with the configuration of the app. The accident caused us to stop using a single test account to run a case, use the account of each app to run a case, create a new list of appinfo, and then run an app app. 11. There was a problem again, when we had 100 leftRight XMQ, we encountered such a problem, because of the upgrade technology problem, led to a certain upgrade, there are two XMQ service problems, but the monitoring is randomly dropped to the various xmq, at that time in a very long period did not report the alarm; This is an accident, how to let not appear this problem in the future? We did a paritition monitoring, so that each case only fell to a xmq, polling all the XMQ, on-line, ran for several months, and indeed found some problems. 12. As the number of XMQ is increasing and the polling time is getting longer, new problems arise and how do you quickly know if a xmq is a problem? We consider multithreading, two scenarios, one is to write multithreaded code, the other is the use of multi-threaded run case tool, two programs have been doing a few weeks, encountered a lot of problems, the main thing is that this is a long connection, sleep will affect all threads, if a certain message time is too long, It can also cause other threads to hang up, which is not the same as a short connection, and the other is how log is handled, and the multi-threaded tool puts all the logs together, including the call stack, which is difficult to analyze, and the last is stability, and multithreading is more affected by the network. In the end, we didn't take a multi-threaded approach. 13.case of the degree of differentiation, multi-threaded, then let the case run faster, then you can quickly traverse all the XMQ, so the new scheme is produced, first divided into p0,p1,p2 level, P0 daytime run, is the user's most commonly used functions, first ensure that these functions do not hang, p1 run at night, These are the users are not too commonly used features, later found that the problem is relatively small, p2 is the rest of the case, because the whole run down time is too long, just a few hours to run again. Another dimension is that partition monitoring is traversed in xmq order, and the case of the app is randomly dropped on xmq to differentiate. 14. Re-emergence of the problem, users look at the push statistics report when the response, we did not send a topic message yesterday, why the report shows topic it? Check it out, that's the information that the monitor account sends. Ah, this time it's out of control. Help the development of the server, they provide the extra field for monitoring, if set to test field, the statistical report will filter out the test message data, to avoid user complaints. 15. The previous xmq was written in Erlang and later wanted to be replaced by the Java language, which was very cautious because of its heavy impact. We tried to xmq each field in the package to check again, then the problem of monitoring is seen, previously only checked some key fields, not all check , so case all pass, but XMQ issued the package, the user when the use, or there is a problem, because you can not figure out which field users will use, so in the grayscale upgrade found the test of vulnerability, before and afterAfter the toss for a long time, XMQ issued a package of a layer, demolished for several days. Finally Java XMQ all finished line, case of various checks also added, XMQ sent out the package structure also made clear. 16.log processing, alarm mail sent out, need to quickly locate, log structure is clear, the key content is displayed, unimportant content is deleted, said easy, do not simple. Three people define the format standard of log, then divide the case, change it, This has been changed for a while 17.sd room full, no place to add new machines, service and no multi-engine room plan, at this time the message will have various delays, monitoring has been in the alarm; Development is very anxious, the hardware has not, can only change the software code, so that the service performance is better, so introduce some bugs, monitoring and alarm; All you can do is increase sleep time, turn off some validations, verify only important fields, and some APIs don't return packages, not check. Later the multi-machine room solution came out, migration service to LG, monitoring and even LG service, then alarm, and then processing, this repeated toss a few weeks before the fix. 18. Moved to LG, the network has changed, the new problem has been generated, a large number of network problems caused by fail again, this time we gave up the multicolored City Room monitoring machine, in ZC applied for a new monitoring machine, operation brother in the bottom network to do the processing, let ZC even LG close to the LG directly connected, ran for a while, Good 19. With more case, more and more types of monitoring, how to start these monitoring regularly is a problem, initially crontab, this method is simple and easy to do, but the service hangs you do not know; later using Jenkins, this thing trouble lies in the deployment and no way to send alarm mail, nor to the history of Data for analysis, and these are what we need; our monitoring system needs a control service, this service should be timed run different monitoring, collect log, deposit database, send alarm mail and rice chat, do data analysis on a regular basis, monitor the operation of the process of abnormal, there should be an alarm. Li Yuan intermittent spent nearly 2 months, in the work of the gap and spare time to write the control service, now every morning there will be a first day of several data analysis reports issued, according to these statistics, still can see some of the service problems. 20. Experienced the above various transformation, now monitoring has gradually stabilized, the function of the service will be reported very quickly, now the biggest problem is the network problem, the most recent position has 2 points, a. may be ZC to LG network problem, B.FE-GW's front-end may have some problems. We then added the bind Monitor to traverse all the xmq. The transformation of monitoring is continuing ...
Here are just a few examples, in fact, there are more problems than these. The sensitivity and false alarm rate of monitoring is a set of innate contradictions, the higher the sensitivity, the higher the false alarm rate, the lower the sensitivity, the lower the false alarm rate; We need to find the balance, which is the hardest part. If you ask me about the push monitoring now, I can't say how good it is, but I can sleep at ease at night.
Finally, I want to talk about how it feels to monitor this thing, 1. Monitoring is not a person's business, this project is associated with many teams, may require client development, server development, operations engineers and test engineers, and even other teams work together to find the root cause of the problem to solve many problems. Just like every team is a group of gears, everyone has their part polished, put together in order to co-rotating, and eventually stitched into a precise Swiss watch. 2. An unwritten conclusion of the scientific community is that basic research is not done by smart people, but only by a clumsy person who insists on doing it every day for 20 years and 30 years in the end to prove something. Monitoring is a very basic thing, can not be done overnight, this evening to play the charge, tomorrow morning will be able to achieve communism; really need some people to take a step down the heart, seriously do it for two years, perhaps to make something. 3. Do the monitoring is a small thing, good monitoring is a big thing, depending on the size of the things you look at the granularity, the more detailed, the more things, the more things you do, the more rough the look, the smaller the matter, the less things you do. There is nothing tall on the world, but others have stepped on the pit than you a few orders of magnitude, so others formed a technical barrier, which you can not copy, only to walk again and again from the pit to climb out, will feel profound. 4. During the interview, we often see the interviewer say how many things he has done in a short time, but seldom see anyone say that I have only done one thing these years. I think of Ono, Michelin Samsung chef, 90-year-old, said to squeeze nearly 60 years of sushi, known as the god of sushi, the Michelin Guide to his sushi evaluation is: it is worth spending a lifetime in line to wait for the delicious. People spend their lives doing one thing, do we spend a year doing something good or bad? Then tell the interviewer the details of the incident and see how he reacts, well, it should be interesting.
This year is 2015 years, I also want to do something small this year? If you want to do a very small thing, we can talk about it, if you want to do big things, brother, I am low IQ, than my head of the day, I do not understand. I just want to do a white shirt.
I just want to do a white shirt--millet push service monitoring notes