In the article "architect portrait," I mentioned some of the mistakes I made in system design, think it is quite meaningful, this article to review their own nearly eight years to do some of the system design, to see some of the more big bloody mistakes (many of them are overturned), this eight years has mainly done three basic technology products, Three large technical projects spanning three years, two of which are still in progress, have found that big errors are largely concentrated in previous years, and from this point it seems to be proud to say that in the last few years there has been a lot more maturity in the control of system design than before.
1th mistake.
When designing the service framework, I expect the service framework to be completely non-intrusive to the user, So I made one on the outside and put one. xml file to describe which beans in spring is published as a service design, after this release, the first mouse user reluctantly used, but felt very awkward, but still endure the use of, to the release, found that there are two problems, one is the XML file development does not know where to put Okay, so I don't know where to get the XML file when I publish.
The key error of this design is that the design is not considered in the development phase, the operational phase of the impact, and then the way to correct this error is to remove the XML file, instead of writing a spring Factorybean, The user is configured in spring's bean configuration file.
Therefore, for an architect, the design should be fully considered in terms of comprehensiveness.
2nd mistake.
Service framework in the core application on-line, there is a high front-end Web application load, processing the number of threads is not enough to use the phenomenon, the way to deal with this failure is to roll back the service Framework on-line, this troubleshooting for a relatively long time, the reason is the service framework used by JBoss Remoting the default time at communication is 60s, causing some slow processing requests to occupy the processing thread pool of the front-end Web application.
The reason for this failure is simply that the time-out of the distributed call is too long, but to think more deeply, the problem is to make the technology selection in the design service framework, and not to fully grasp its running details when choosing jboss-remoting. This design error resulted in the decision to discard jboss-remoting, instead of rewriting the communication portion of the service framework based on Mina, which resulted in a two-month delay in the release of the available version of the service framework.
Therefore, for an architect, the technical details of the technology selection is to have a strong control.
3rd mistake.
When the service framework probably evolves to the 4th version, there is a need to make some modifications to the communication protocol, and suddenly it turns out that there is no version number on the previous communication protocol, so the sad thing is to make a very wo consulted processing in the code to judge whether the new version or the old version.
This design error is very obvious, this is in fact, as long as the first design communication protocol reference under the existing many communication protocols can be avoided, so this error correction is very simple, is to refer to some classic protocol redesign.
Therefore, for an architect, the breadth of knowledge is very important, or in the design of the future has a certain consideration is also very important.
Speaking of the agreement, incidentally, at that time in the design of communication protocols and select serialization/deserialization did not fully consider the future of multi-language problems, resulting in the multi-language scene is very passive, which is due to the design of the lack of foresight, so-called forward-looking is not to say that the future will be the problem will be solved, It is also important for architects to leave a method that does not require the entire transformation to be solved.
4th mistake.
After the service framework switches to Mina, a problem occurs when the application of the Publishing service restarts, which is to find that the machine load in the cluster is severely uneven after the restart, and the discovery is due to the fact that the caller of the service is using hardware load balancer to establish a connection to the service publisher, and it is a single long connection. Because it is built through hardware load balancing, it means that the service caller is actually seeing the same address, which results in the service caller reconnecting to the surviving machine, connecting to the long-connected, and therefore causing the load imbalance to occur when the service Publisher restarts.
This design error is mainly due to not consider the production environment in the hardware load balancing, this single long connection problem, this error is really not good to correct, then a temporary method is the service caller's connection every 1w request, the connection automatically disconnects the rebuild, The final solution is to remove the middle point of the load balancer device.
So for an architect, the overall design time is very good, I now generally more in the way is to deduce the situation after the online, generally in mind once again it will be easier to consider these issues.
5th mistake.
Service Framework after more than a year, a version of a serious bug, and then we want to be able to notify the use of this version of the application emergency upgrade, at this time the sad discovery of a problem is that we do not know which applications and machines in the production environment to deploy this version, At that time had to use a temporary sweep the whole network machine method to solve.
This problem was later corrected by the service release and the caller at the point of connecting us, by the use of the version number of the service framework, so it is very simple to know the whole Network Service framework is currently running version number.
Therefore, for an architect, the design-time comprehensiveness is very important, the deduction can play a significant role in helping.
6th mistake.
Service Framework this basic type of product, in the release will encounter a big problem, is the need to notify users to publish, resulting in the entire release cycle will be quite long, at that time made a decision, put resources to achieve a completely dynamic release, that is, do not need to restart, until the time to find this is completely a super pit, In the end, it took two people to do nearly six months before finally decided to give up, and finally to see the problem of upgrading is not so big.
The biggest mistake of the question is the lack of control over details and the slowness of decision-making.
So for an architect, the control of the technical details is important, and the decision-making force is important.
7th mistake.
Service publishers often encounter a problem, that is, some of the methods in a service is relatively resource-intensive, and some may be less resource-intensive, but the business is very important to the method, some scenarios will be due to the consumption of resources in the method is requested more than the way to lead to less resource-intensive methods are affected, In this scenario, if you want to split into multiple services, it will lead to the development phase is still very painful, so the service framework decided to provide a method to do seven-tier routing, the publisher of the service can write a rule file in a place, this rule file allows the production environment according to the method of the machine into different groups, This allows different methods to be called to different machines when the service caller calls.
This feature is very good for some scenarios, but with the evolution of time and the replacement of personnel, the number of people who can maintain that document has become less and more problematic.
This function so far I actually feel that is also in dispute, I do not know whether it is good or bad ...
Therefore, the design-time comprehensiveness is very important for an architect.
8th Error
Service Framework is used more and more widely, encountered a more prominent problem, the service framework depends on the jar version and application-dependent Jar version conflict, service framework as a generic technology product, basically no way for an application to change the service framework itself depends on the jar version, The question is how to solve the problem, then thought for a long time.
probably because of my previous OSGi background, I made a decision to introduce OSGi, to put a bunch of jars of the service framework in a separate classloader, and to separate the application itself, so that the problem of jar collisions could be avoided. After I made the decision to introduce OSGi, the team's 1 senior classmates went on to do it, and the result was a nearly two-month-old MAVEN development environment that matched OSGi completely, and then I decided to go in and do it, even though I was more familiar with OSGi, Also toss for almost 1 months to the entire development of the environment, engineering structure, as well as the previous code basically migrated to the OSGI structure, this thing at that time toss good on-line, the effect looks good, reached the expectations.
But this happened later as new developers joined the service framework and found that most of the newcomers had invested a lot of time in learning the development of the OSGi model, which was difficult to adapt to, so then there were other businesses that asked whether to introduce OSGi, and I would basically recommend not to introduce it, The main reason is that the OSGi model is familiar with the development model, the impact of troubleshooting, unless it is clear that the need to classloader isolation, dynamic of these two points.
Let me make a decision again, I will remove the introduction of OSGi, I do a simple classloader isolation policy to solve the problem of jar version conflict, to maintain a familiar development model.
Therefore, for an architect, the design-time comprehensiveness is very important.
9th mistake.
After the service framework is used very widely, the team will often be plagued by a problem, that is, the business will often encounter the call service error or timeout phenomenon, this situation will often let the service framework side of the research and development to help troubleshoot, this phenomenon is more complicated, because the service call is usually a multi-layered relationship, is not a simple a–>b problem, many times there will be a–>b–>c–>d or more layers of calls, timeouts or errors may be in one of the links, so troubleshooting is very troublesome.
In this problem more and more trouble, this time only remembered in 09 years or so the team has classmates saw G home of a paper called Dapper, and did a similar thing, just then on the line after we have been trying to understand what this thing to do, to the problem of the exposure of the more and more serious, Finally gradually think of this thing seems to be able to troubleshoot the problem will have a great help.
At this stage to do this, the main thing is not technical problems, but how to upgrade the new version of the problem, this toss a long time, and then on-line found a new problem is that even if the service framework has the trace ability, but the service will be transferred outside such as database, cache, etc. Those places, if there is a problem will not be seen, troubleshooting or trouble, so this thing to really show the effect must let trace completely through all systems, in order to do this, N team paid for several years of the price.
Therefore, for an architect, design-time comprehensiveness, forward-looking is very important, such as the importance of trace, if initially considered, then in the beginning can be left a good place to bury the foreshadowing, the back of the complete will not be too complicated.
10th mistake.
Service publishers sometimes encounter a phenomenon is that the service is not fully ready to be called, and the second phenomenon is the service publisher problems, to keep the site troubleshooting problems, but the service has been called, in this case there is no way to completely retain the site to slowly troubleshoot the problem.
The reason for these two phenomena is that the service framework is designed to connect to a hub after it has been started, that other callers will be able to call after a heartbeat is unsuccessful, that it will not be transferred after a heartbeat failure, and that it will appear to be automated, but in fact the other problem is that the ability to externally control the downline is weak.
This design error is mainly in the design of the lack of comprehensive consideration.
Therefore, design-time comprehensiveness is important for an architect.
11th mistake.
In one year I and a few small partners decided to change the mode of Xen at that time, replaced by a lightweight "virtual machine" way to do, so as to improve the density of the application of single-machine running, in doing this, we decided to do a lightweight class virtual machine, the decision was to run the process on a machine directly, Then encountered a lot of problems, such as from the operation and maintenance system, I hope that ssh to "machine", independent IP, see their system indicators and so on, in order to solve these problems, with more than n black technology, made very sad, more sad to urge the problem is not much, so with some machines ran this mode, The result finally found here to black science and technology to solve the problem is too much, then suddenly a small partner put forward our trial LXC bar, only to find that we used the black technology solution of a lot of problems are gone, eh, then is decided to switch to this mode, the result is the pile of machines on the line again.
The main mistake of the design is that the knowledge is not wide enough to make an incorrect decision and to reinvent it.
Therefore, for an architect, the breadth of knowledge is very important, in the technology selection point is very obvious.
12th mistake.
Or the above technology product, this thing has a demand is the limit of disk space, and to support disk space to a certain degree of oversold, then the practice is to use the image to occupy the disk space limit, this way to run for a while to feel no problem, so much more spread out, but spread out after a long run , there is a problem, is often the physical machine disk space is not enough alarm, and delete the LXC container files are still not, because image mode as long as the occupied will always occupy this size, will only expand will not shrink.
At that time, the problem of extreme headache, can only be deleted files, rebuild image, but this will have a requirement is that there is enough space on the physical machine, even if there is enough space, this operation is very frustrating, because you have to stop the container, CP file to the newly created container, this if something more, It's going to take some time.
Later found that this mode is really unable to play, so look for new solutions to meet the disk space limit, allow the two demand for oversold, and finally we also toss a long time after finally found a more reliable solution.
The main error of the design is still in the choice of technical solutions did not consider clearly, the details are not enough, the consideration of the surface is not enough, resulting in the back in order to replace the image of this scheme, with a great price, I was the impression that a bunch of people boil over the night to solve.
So for an architect, the breadth of knowledge, the mastery of technical detail, and the comprehensiveness of design are all important.
13th mistake.
is still the above technology product, in the process of running, suddenly encountered a virtual machine thread number created too many, resulting in other virtual machines also can not create threads phenomenon (not because of insufficient physical resources), the discovery is because although LXC support each container running the same name of the account, But the UID of the account with the same name is the same, and Max processes is limited to the UID, so when a virtual machine creates more threads than it does, it also affects the other containers of the same account.
This problem I think a certain degree can also be considered a design problem, when the design is really due to the details of the lack of control, the consideration of not all caused by ignoring this point.
Therefore, it is important for an architect to master the technical details and the comprehensiveness of the design.
14th mistake.
Three years ago to do a very large project, the project is about to go online time, suddenly found a problem is that there is a key point missing out, had to hurriedly discuss the plan to decide how to do, the change action is very big, so the project's on-line time can only be postponed, I remember that time emergency weekend overtime to engage in this matter, Finally with a higher risk.
The main reason for this problem is the omission of this key point in the overall design, not completely ignoring the point, but in the technical details of the error, leading to the thought that the change is not too much.
So for an architect, it's important to take control of the technical details, but it doesn't mean that the architect knows exactly what he wants, but the architect should know who the person is on the point.
I've made 14 mistakes in system design.