We are often asked what are the common problems and how to solve them? As you can imagine, we (Cloud Network) is the world's largest outsourcing service provider of large Internet systems, so we have almost met all the problems.
We run all kinds of games, electric dealers, mobile communications, advertising, finance, social activities, tourism and many other websites, so we have encountered various problems on various system platforms. In more than 10 countries around the world, millions of users, processing thousands of transactions per second, system problems and system crashes often occur, avoid.
However, some common problems we often encounter, often dealt with, basically these problems can be summed up to four types of problems, such as reliability problems, system performance and system expansion, system security issues and cost problems.
Reliability issues
For reliability problems, there are many causes, such as overload, code problems, server crashes, database problems, bandwidth, hardware, cloud issues, CDN, data center problems. We have also witnessed a number of system updates that have been applied without perfect testing, and are man-made, created by programmers, content editors, game developers, and even our members.
In the long run, our biggest and most frequent problem is the problem of disk space. The number of clients has soared and the amount of blogs has soared. No matter how much disk space you provide, they will deplete the disk space to handle data and other transactions. So, like other system administrators, we do everything we can to increase the disk and increase the storage space. Fortunately, today's 3TB disk is really big, but unfortunately the data file is also very large and the use of cloud storage expensive. So we often receive such notices, in line with customer needs, manual or automated cleaning of storage space.
The database problem is another common problem. From overload to a common replication problem. Customers often misunderstand replication, do not understand replication requirements and replication effects, so often create problems, and we have been constantly addressing these issues, including the use of new detection, monitoring and management tools to make the system normal operation, ensure data accuracy. This work is becoming more and more important because data is becoming increasingly critical and financially oriented in the E-commerce and advertising industries.
Other reliability issues include PHP, Java, and other Django issues, including, of course, system crashes and issues that we monitor, manage, and solve on a day-to-day basis. Especially for China, the main problem we have to deal with every day is bandwidth, which is sometimes good, sometimes bad, and then back to normal. At the same time, in some regions of China, the bandwidth problem has always existed, the first second is very good, the second is disconnected. Today, to recover a connection, it is usually at least a data center, a telecommunications department, and a clear connection between what and what.
System performance and Scalability issues
System performance problems include overload, the common CPU, RAM, and Io are heavily occupied; many users, sometimes at the same time on the same day, log on to the customer site causing various problems. In the boundless internet world, everything is difficult to be expected, unpredictable.
Frequently encountered problems are:
Poorly written PHP code suddenly increases the load, causing the system to not have enough CPU, or some programs occupy a large space, resulting in insufficient RAM, and SQL is poor, no index, so that the database crashes, unable to handle concurrent events, locked or even for input and output operations.
System expansion issues are different from other issues, and to cope with the growth of transactions over the next few days, weeks, or months, it is also necessary to build or extend the system quickly. Because the system architecture does not normally take this into account, load balancing does poorly, even without load balancing, or without a portable Php/java session, resulting in a lack of balance.
Often have customers come to us and say that their system has encountered a "bottleneck", the first minute still running well, then, suddenly one day because of overload collapse. In theory, this should not be the case, but if the monitoring software used is poor and cannot show whether the system is close to the system limit, then this situation will occur frequently. Unfortunately, when the system CPU usage is 95% and 100%, the user experience is very different: 95%, the system may run a bit slow, and at 100%, the system will not work at all.
System security Issues
System security has always been a challenge, and while our system is generally safe, our customers are using code that is unsafe and additional tools such as Chanel or various management interfaces such as phpMyAdmin are not secure. So, when we are not careful, these villains have the opportunity, beginnings: "Sink destroyed in the nest." ”
Fortunately, our security is multi-layered, and the assigned permissions are the lowest, so the damage to the security of the system rarely occurs. However, occasionally there will be system damage, we have to clean the system, change the authorization, increase the customer log and security Monitor and so on. Sometimes we do audits to see if there are hackers and where they are hiding.
Cost issues
Finally, the problem we often meet is how to save money. This is not a technical problem, but we often find that customers spend a lot of money on systems and servers, and even spend too much on them. They buy a lot of servers because the system is slow, they don't know how to solve or debug the system, or they don't know how to virtualize it and put it in a private cloud.
Here, we can extend the system by debugging it without having to buy a new system or expanding the system in a more economical way to create a private cloud, which can save a lot of money for our customers.