Cloud computing is not just a fast self-service for virtual infrastructure. Developers and system administrators are looking for ways to monitor and manage large-scale cloud computing. This article is part of a series of articles that INFOQ focuses on dynamic pool automation tools and ideas for computing resources. If you want to get reminders for new articles in this series, you can subscribe here.
Early cloud computing deployments typically involve only a small number of servers used by one or two employees for a specific instance. Today, however, we see an increasingly widespread adoption of the public cloud, as well as the use of the vast array of functions across all cloud service patterns (Iaas,paas,saas) by different employees across the enterprise.
From early-stage start-ups to the world's largest corporations and government departments, more and more organizations are expanding their use of public cloud services, and large-scale cloud computing is beginning to show problems.
Potential problems on the scale public cloud
Although the adoption of the public cloud has undoubtedly brought extraordinary results to all forms and sizes of enterprises, the public cloud has brought many new challenges and risks. The most important of these are the following points:
Cost
In the beginning, we only allow a few people to have limited access, which is relatively easy to track costs. However, as more and more individuals from different departments gain access to permissions, we are likely to experience the cost of the expected savings due to overlapping functions, excessive configuration, unauthorized purchase, unused zombie instances, excess bandwidth and storage costs, and other unnecessary factors.
Unauthorized access
It's easy to manage small-scale access to public cloud services, but as users grow, things quickly get out of hand. The departing employee may still have permission after leaving, and the employee's permissions are not updated as the role changes, while new employees struggle to gain access to the required resources. As many cloud vendors fail to provide enterprise-class security, as the use of the public cloud grows, we quickly become victims of unauthorized access.
Penetration
The problem of the infiltration of cloud services by external malicious personnel is even more serious than that of employee rights control. Lost passwords, shared user IDs, data leaks, password simplification, social engineering, phishing, and malware all expose the impact of data loss, manipulation, attacks, denial of service, and other malicious infiltration to the public cloud services.
Human error
When cloud services are relatively small, they can be easily managed by individuals, but as they scale and scale, we cannot always maintain management by adding new employees. This means fewer people need to do more work, and according to the law of averages, there will always be people making mistakes. Although the problem is not limited to cloud computing, it can cause large-scale failures in turn.
Visualization of
When you have only a few carefully managed services, only one or two people can see where they are deployed, how they are configured, how they are paid, how they are used, who they are, what they are, how to fix them, when to turn off the service, how to recover, and so on. But in large-scale systems, cloud usage becomes increasingly opaque as the spread of various public cloud deployments and open access to more use cases.
Consultation
Due to poor visibility, one of the consequences of this is that the problem positioning will become significantly harder. For example, if you do not see where the system is running or connected to other services, it is almost impossible to locate where the transaction flow is slowing down. W.edwards Deming, a leader in systemic thinking, said "we cannot manage things that cannot be measured," but rather, we cannot manage invisible transactions.
auditable
At the same time, poor visibility has another side effect: as more systems and services are abstracted from the cloud, tracking the access content, time, and manner of each person and the original audit poses serious problems. The ability to track, record, and view access, modify, fail, expose, use, and so on in a large scale environment can be extremely difficult without tools to automate the process.
Recoverability
Although serious downtime is not unique to the cloud platform, it seems that every week we hear dramatic stories of public cloud failures. However, many cloud vendors, especially commodity services, do not build recoverability, and even some more robust services may not provide real-time recovery or prioritize the business needs of users. Downtime would be a real disaster if there were no standby systems for backup, failover, and recovery.