Now on a bit of scale system, especially the financial industry system, business rules complex, is generally divided into smaller sub-modules, each person to develop one or several modules, module development completed after the completion of a jar package for other modules to call, after all the module development is completed and then integrated together. For the top-up system is more complex, in addition to the system into sub-modules, but also to interact with many peripheral systems, such as receiving service providers, recharge centers, banks and so on. Programmers are the developers of one or several modules.
The main point of this paper is: How to protect yourself well in case of problems in the system.
For software developers, we are generally in the weak group in the company, when the system problems caused by the accident, the operators will generally point to the research and development staff. They usually do this by skimming off the responsibility in case of a fire. When the accident to the company caused the actual loss, then the company boss will be involved, there are always someone to take the blame, this responsibility will be from the upper level of the pressure down, until pushed to a module specific coding of the programmer, this time we developers the only weapon to fight back is the log, If the log does not clearly record what time to call what code, what time to call the code is abnormal, the reason for the exception is the details of what, then the programmer responsible for coding will be confused Mo mouth difficult to argue, excrement basin all buckle to your head, the final result is to take the blame and leave.
This situation in my last company has happened many times, in the time of the job team leader often remind us that the code must be careful and careful, do not make mistakes, accident I want to be scolded by the boss, you have to be dismissed. In the folder of the Department SVN server there is a summary report dedicated to the incident, an accident, a Word document. Almost 11 documents, the direct cause of economic loss of a-level accident also has about 3, these colleagues were dismissed when the mood is definitely extremely depressed. Some time ago to get the news from a colleague of the company, the company has a technology Daniel wrote the program and another company's system for business interaction, the results of the top-up message was modified, the amount was modified to 2 million, the result of this Daniel writing code is not rigorous, not thoughtful, Received the response message did not do the amount of calibration to do a transfer operation, I heard that this look company lost 2 million, the boss in the company, the laptop heavily fell on his desk. The last is to be dismissed, no matter how good the skill of the person or careful, as the saying goes: Be careful to sail the ship.
Now I used a top-up system to illustrate my point of view, how to call Black hole code (the so-called black hole code is that you are going to call this method, for you is a black box, you do not know what this method does, you do not know what the error will occur)
This is a UnionPay card recharge system, the function of this system is: as long as you will be able to pay the online bank card (which bank can) and its own mobile phone number binding, you can always use your phone to dial automatic voice recharge phone number or other people's mobile phone number to recharge, The participants in this system are IVR voice service providers, receiving service providers, and telecom recharge centers. The whole system can be divided into IVR voice module, receiving module, telecom instruction interaction module, message encryption and decryption module, payment module.
Module Development Schedule:
Module name |
Developers |
Call Module number |
Serial number |
IVR Voice Interaction Module |
Tom |
2 |
1 |
Receiving module |
John doe |
5, 3 |
2 |
Telecom Instruction Interaction Module |
Harry |
4 |
3 |
Message encryption and Decryption module |
That six |
|
4 |
Payment Module |
Xiao Qi |
|
5 |
System on-line is very unstable, top-up failure rate is high, on average, there is a wubi on the failure of a recharge. The problem is located in the 2 and 3 of the two modules, two developers each have a word, arguing the red. Let's look at the code of John Doe, John Doe code to call the Harry code.
John Doe's Code:
Circled in red with the code is Harry Code, Harry code into a jar package for John Doe to call, John Doe will harry the business implementation with spring, and then directly to call the Docharge method, and return the results, it seems that there is no problem. It's just that the surface looks fine.
Mark code do you dare to use it this way?
Risk One:
Docharge method for John Doe is a black hole code, do not know what this method does, there will be errors occur, because the Docharge method does not declare the method may throw exception, John Doe thought this docharge method is safe, so no add try{}catch () code catch exception. What's the matter of fear in case, what if this docharge appears abnormal?
Risk Two:
What if the Docharge method causes a deadlock for some reason, then your invocation is dead inside and never returns the result of the call?
Risk-response measures:
For the key method to invoke, regardless of whether it has declared an exception to be thrown, we have to keep a skeptical attitude to it, add try{}catch capture, and will catch the exception, and after logging, the wrapper continues to throw it to the upper caller. Let the upper-level caller know that the error, the exception throws you to do the duty of notice, the system is not related to the problem with you, or a question to ask you: why don't you catch an exception? Why not throw an exception? Although there are many reasons to explain to the questioner, but more than one thing less, don't lazy add a try{}catch () catch possible exceptions.
Countermeasures for risk two:
Make a metaphor: You are a hunter, you want to catch a wolf cub out in the cave, but you are not sure what the danger in this cave, smart hunters will put the hounds in to catch the wolf Cub, if the hounds in a period of time did not come out, indicating that there is a danger, the hunter think of other ways. If the hunter himself is at risk, who knows if the hole is a wolf or a tiger. The analogy is to say that if you want to invoke a method that you think is not too safe, do not use the main thread call (Hunter), create a calling thread (a hound) to invoke, the benefit is to be able to monitor the success of the call, you can also set the time-out of the call.
In this analogy we create a hound tool class that calls the black hole method automatically when the calling thread is generated, and if the call time times out, throws TimeoutException
Use the modified code in the Hound mode:
After finally found the bug, the problem in the Harry Telecom instruction Interactive module, sent to the telecommunications message length must comply with the protocol, otherwise, the end of the telecom will receive an illegal packet will disconnect the socket, the Agreement stipulates that the recharge amount must be 4 digits, less than 4, left 0, such as customer to recharge 10, The whole thing is 0010.
Recharge 100, Is 0100. If the customer recharge less than 10 yuan, to the left to fill 3 0, the bug appears in the Harry in the processing of single-recharge, less than a 0, the result is a customer to recharge 2-digit amount of the bill is successful, a single-digit number of charges socket connection disconnects, John Doe call thread has been blocked here.
If the first John Doe to use the Hound model, the problem of a look at the log will know where the problem is, well-grounded to point out the problem, do not have to carry the blame.
The above code is suitable for Jdk1.5 above, if you want to use it in jdk1.4, please change it yourself.
PS: Should the user request, put out the Hounds code, advantages and disadvantages please compare to
Normal invocation:
Hounds Call:
- Concurrenttest.rar (25.9 KB)
- Description: Test code for Hounds
- Download number of times: 412
http://www.iteye.com/topic/1116449
The idea is very comprehensive, if you do team leader I believe most developers are willing to follow you, because there is a system accident you will not direct the problem to subordinate that. This idea and suggestion was mentioned, I remember when a new employee just entered the job, he thought that the code of the architecture group design is very rubbing on the mass mailing and architecture group communication, but do not know why this communication changed flavor, and finally became a war of words. His suggestion was not adopted but offended the people of the architecture group.
We are powerless to change this chaotic situation of research and development management, only to adapt.
What we can do is:
1, carefully write good code, will be able to think of problems to do well to respond to measures.
2, exception handling good, the toss of the throw, the capture of the capture, detailed logging. The responsibility of the accident will not be pushed to my head, we only focus on the monthly salary can be hit on the payroll card.
3, have the right opportunity to leave this intrigues company.
Learn a lot, the original project small, one or two people do, almost no record of the log, now the project is large, more people, the integration of modules, it is necessary to record the log, not to find the responsible person, but for more efficient work. And I think it's very necessary to have some skepticism about the black hole method.
Personal feeling, from a purely technical point of view, the landlord's approach in fact belongs to the palliative, is really ugly structure and management methods have to use ugly practices.
From the risk of the response measures, this code one time regardless of 3,721 exception all catch up, plus a paragraph without any business semantics of the description (an error occured during XXX, if the method name is good, The same information can be obtained from the exception's call stack. If you find this unusual information useful, there are three possible: 1. You're going to let this exception go all the way to the user interface and display the exception information. or 2. Your team did not look closely at the habit of the abnormal call stack. or 3. Your code class name and method name do not know the so-called. ), and then wrap it into another unchecked exception thrown out. Normally, if there is a large try catch in the upper-level of the architecture to do this thing, there is no need for it.
Generic exception handling should be placed on top of the hierarchy, usually the underlying code is runtimeexception without the need for a lower-level code, unless:
1. You have a specific way of dealing with this anomaly (that is, at your level, this is not an anomaly at all, but a part of the business).
2. You think the caller of this code has the ability to handle this exception, you should encapsulate the exception as a checked exception and throw it again.
3. You throw a message with more business semantics for this exception.
From the point of view of finding errors, the first and last two logs have solved all the problems, and the exception handling here is purely superfluous. Unless your upper-level code eats the exception (neither the log nor the top throw), this is purely an architectural nonsense, Hairenbujian (if you can get their code, find out where to eat the anomaly, and take responsibility when you are accountable).
However, even if the upper-level code is not controllable, the landlord should also be controlled within the scope of this generic exception processing to the top, do not have so many layers of try catch at the bottom.
The second risk response is not the right way, you open a new thread to tune someone else's things, timed out your thread is abnormally terminated, the other thread is still running (even if you forcibly kill the new thread, you do not know whether there is no new thread in the black box, or asynchronous way to do other things, such as read and write databases, etc.), How does it run, and whether there are any side effects you don't care about. After you this abnormal termination, the whole system is in an unknown state, after the operation of how to do, simply directly runtime.exit () and then restart the security point.
Since someone else's package is expected to be used synchronously, and you don't know its running details, it's no good to force it into async. If you are afraid of someone else's bag deadlock, it is generally a two-and-a-half log to find the problem. Feel you are a bit greedy here, in addition to find errors, but also want to make some error recovery, but asynchronous calls can not be so simple to handle. But if it is to be handled well, the cost is too high.
From a management point of view, there are also problems:
1. What is the black box for your own code? I generally ask programmers to look at at least three layers of implementation code, including others ' code and open source code, if they call an interface that is not written by themselves. If you want to modify the code, you need to look up all the calling relationships and read the code, making sure you don't modify the bug on one call path and kill the other call path. If it is a black box code (I do not have a project basic), unless there is a sound technical support or good word of mouth, otherwise, and call the black box code must be done before and after the log. Once the violators are found, verbal contempt.
2. It is not impossible to find someone responsible, but it is not possible for the boss to go straight to the programmer's trouble. Do not know the technology (even know the technology, but do not know the specific context) directly criticize the programmer, often only found the surface phenomenon on the scold (such as the landlord said John Doe), until the real reason, and Shing face, but more uncomfortable you. The example of the landlord, I imagine the scene is this: the boss asked Zhang San, you this voice module how to go, so card. Zhang San said, specifically unclear, but I called the John Doe package, may have a relationship with this (typical shirking responsibility). Then the boss asked John Doe, you how the matter, John Doe said specifically not clear, can you tell me how to reproduce the error, I debug (typical serious responsibility)? Boss a notebook Pat John Doe face: Go to die you, is you this black sheep, head pig brain .... (Omit 100 words). How can this pointers control the technical team?
Fortunately, the landlord said it was the last company, it is estimated to have been out of misery, congratulations XI.
=======================================
Add a little ...
I would have liked this method to call the log in fact the best use of Debug or trace level, at the info level, it is not pressing in the production environment to close the info-level log (I loop call a black box method 100 times will see 200 duplicate logs, and then 50 users concurrent access ...) )。 But later think there is a reason, the log here is not to check the wrong, but in order to clear the responsibility in the first time, until the discovery of errors and then open the log is too late. So ugly management leads to ugly code ...
Like this kind of post, talk about my point of view.
From the technical point of view:
1. Internal call can be through the Convention, everyone is white box, avoid using useless try/catch. External calls to the business exception as far as possible prior to the contract, do not guess, system-level exceptions depending on the situation to determine whether to capture, whether it is necessary to continue to throw to the upper layer. External call log information to be complete, easy to locate the problem on the line.
2. Hound mode is a good way to make remote calls, and for frequently problematic service providers, you can do this with defensive coding. But if every call is written like that, it hurts too much.
From the management:
1. After the problem, you should be wrong about things. Shout loudly, rush people, this is the next worst. So everybody is trembling, the heart will be scattered sooner or later. After the problem, through the analysis to find deep-seated reasons, can let the parties do case sharing, so that other students learn from the lessons, to prevent the same mistake. Long-term persistence, you can accumulate a "lessons learned", so that we regularly learn to reduce the likelihood of committing similar errors.
2. The work encourages everyone to write code boldly, do not fear to make mistakes, for unreasonable code can be reconstructed in time.
In the reign of the day, we should try our best to change. If you can't change it, take the early flash.
Should be used in conjunction with the interceptor mechanism, or because the other side of the exception, deadlock or more hidden trouble for each other to trap. The price is too great.
1. Really safe should be try catch (Throwable ex) if only exception still have a leak.
2. Even if you want to open a person, you should open the test rather than develop it.
3. Meeting such a company, it is seen that a few companies, very likely to develop and test.
4. Such software is not a bank software at all, should be a variety of mobile telecommunications broadcasting operators of small applications to collect money. The corresponding names are Bss,boss, payment platform, call center, Bank collection, bank withholding, bank withholding and so on. Such software is not considered financial software.
5. It is unlikely that a small amount of money can be cut off by 2 million. ATMs can only fetch tens of thousands of dollars a day, and more than 100,000 will need to make an appointment with the bank. Online Banking and telephone bank transfers are not as high as 2 million. The high degree must be handled at the counter.
6. I do not understand the message has been modified what is meant, if the document stated that the other party returned the amount of whichever, then directly recharge. From the process, this recharge 2 million is also to recharge to a customer's ledger. The money is only debit to the telecom mobile operator, and the operator just modifies the customer's ledger amount (the credit balance). In other words: The bank buckle 2 million to the operator, the operator just to the customer's amount of credit increased a little. Handling method The operator returns 2 million to the bank and adjusts the customer's balance back. 2 million is not a loss, the loss is a pack of cigarettes a bottle of Maotai bar.
8. I think the real situation should be the customer clearly only from the bank deduction of 200 dollars, is someone else to change the amount of the ledger to 2 million, become a customer seems to have 2 million charges. So a SQL can change the data back.
Finally, third-party integrators are pathetic.
People in the lake: How to Protect yourself with code (turn)