A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
The cloud computing concept originated in super large-scale internet companies such as Google and Amazon, and with the success of these companies, cloud computing, as its supporting technology, has been highly recognized and widely disseminated by the industry. Today, cloud computing has been widely regarded as the new phase of IT industry development, which has been endowed with a lot of industry and product level significance. Because of the multiplicity of meanings, various concepts are complex, many companies and practitioners have their own cloud, as Xu Zhimo in the "accidental" a poem said: "I am a cloud in the sky, and occasionally projected in your wave heart."
Traditional system design is mainly considered stand-alone environment, and cloud computing is the main environment is the data center. From stand-alone to data center, many design principles have radically changed, and the extreme point can even be said that the consistent system design principles of the PC era have been completely inapplicable to today.
Considering the many implications of cloud computing, data Center computing (Datacenter Computing) may be a more appropriate representation from a technical standpoint. In this paper, the technical field of data center calculation and the change of design principle are discussed. A view, for reference only.
Introduction to Cloud Computing
Since the development of personal computers in the the 1980s, the computing power of PCs has been increasing, with a single PC you can store all the data that you need and do the processing, such as writing documents, processing emails, etc. But in the internet age, when an internet company provides services, it needs to use data that is far more than a person's size, and the storage and processing of that data requires the concerted work of thousands of machines. This kind of server scale is not the individual can provide, only the big company or the organization can have, this seems to return to earlier mainframe times. From mainframe to PC to cloud, this is the development process of computer technology spiraling.
In short, cloud computing is the use of system architecture technology to integrate thousands of servers to provide users with flexible resource allocation and task scheduling capabilities. Here are a few keywords: one is super large, including the number of machines, the number of users and the number of concurrent tasks; second, resource integration, thousands of server resources can be assembled to do one thing, such as storing large amounts of data, or handling a large task; third, flexible and fast delivery, Large-scale server resources can be flexibly provisioned, broken down into several virtual resource pools by application requirements, and quickly support a large number of concurrent requests or jobs.
The advent of cloud computing technology has made it ever more powerful to collate and process data, and this ability can help us figure out the laws behind many seemingly unrelated events and use them to predict future developments. The combination of mobile and IoT technology can also better serve the society and People's daily life, such as disaster warning, smart city and intelligent transportation. The data processing capability is developed on the basis of massive data, and the system architecture technology which is supported as the foundation is developed and merged gradually, which makes up the cloud computing technology that we all see now.
Integrated system architecture and data processing technology, cloud computing technology from the bottom up can be divided into hardware infrastructure, software infrastructure and data intelligence three levels, as shown in Figure 1.
Figure 1 Cloud computing can be divided into three levels
Hardware infrastructure, including the server, network and data center design and implementation of technical areas, software infrastructure focused on storage, computing and large-scale distributed systems and other technical fields, data intelligence is focused on data warehousing (Warehouse), machine learning (Machine Learning) and technical fields such as data analysis & visualization. It is worth mentioning that these three levels of division are mainly in the technical field as the starting point, and the commonly mentioned three levels of cloud computing Saas/paas/iaas is more from the provision of resources and interfaces for the consideration of the Division, the two are not the same dimension.
The current concept of large data can be seen from the perspective of mass data analysis technology and software architecture support, including software infrastructure and data intelligence related technology. Both are related to data, but the difference is that the software infrastructure is concerned primarily with data format, capacity, and access patterns, and data intelligence is more concerned with the semantics of the data.
Data center computing is the design of hardware and software system from the perspective of architecture. The relevant technical areas and design principles will be discussed below.
Data center Calculations
Technical areas and challenges
As shown in Figure 2, data Center computing includes technology areas such as storage, computing, real-time storage and computing, hyper-scale systems, architecture, and data centers. Storage-system requirements come from two dimensions. First, a large number of unstructured data requires tables (table), objects (object) and files (file) and other storage structures, and secondly, the different access modes (such as read-only, write less, read and write evenly, etc.) will greatly affect the storage system design and optimization considerations.
Figure 2 Technology areas included in the datacenter
The requirements and technical characteristics of computing systems are closely related to the types of computing tasks. The data-intensive representation is mapreduce, which is more balanced between CPU and I/O requirements. Compute-intensive tasks and traffic-intensive tasks are CPU-intensive, but the data are accessed in different sizes. It is computationally intensive if only a small amount of data is required. And if you need access to large amounts of data, such as large matrix iterations, and memory limits that data must reside on multiple machines, then the system bottleneck is often shifted to the latency of communication, similar to traditional high-performance computing.
The usual storage system and computing systems can only support a certain level of latency and concurrency, and for higher requirements, real-time storage and computing systems are needed based on memory constructs. Considering the characteristics of memory, it is more suitable to provide rich semantic data structure on storage. On the basis of distributed data structure, the model of streaming data processing and triggering event processing is added to support real-time retrieval, OLAP, pubsub and other applications.
The large-scale system mainly through the distributed related technology to ensure the system availability (availability) and manageability (manageability), including system modeling, design, development and operational dimensions. Architecture includes virtual machines, server design, and so on. Data centers, including cabinet design, network planning and design, data center design and construction, mainly focus on energy efficiency (PUE).
System design Principles
Traditional hardware and software system is mainly for stand-alone and individual, in the desktop environment, we can also call it Desktop Computing (Desktop Computing). From the desktop to the data center, the application features and load models have changed dramatically.
On a single machine, mainly for a user, he may run a number of tasks, tasks can be divided into the foreground task and background tasks two. Users are sensitive to the responsiveness of the system (promptness or responsiveness), so the foreground task usually takes precedence over the background task, and the latter task wants to be fairly dispatched. This is why preemptive scheduling (preemptive scheduling) policies ultimately outperform collaborative scheduling (cooperative scheduling).
In the data center, there are also two types of applications online and offline, online systems are directly user-oriented, while off-line systems are used for data processing. The online system is usually a large-scale application service to the massive user, the user is still very sensitive to the system responsiveness. However, because of the huge user scale and the Internet service is usually free, the cost pressure is very serious, so the system needs to fully tap the user's tolerance for responsiveness. In general, people's perception of event response is around 500ms, which can be used to optimize system scheduling and save resources. In the case of extreme pressure, there is not enough resources to meet all the requests, many systems begin to extend the response time, and then lose the response under constant pressure until it crashes. At this point, providing a normal service for a service-wide request, and providing a quick rejection response for requests exceeding the scope, will give the user a better experience and increase the availability of the system. In the end, we will find that the online service system should be based on a stable limit throughput (sustained throughput) as the primary design goal. Of course, the prerequisite for a certain delay threshold is guaranteed.
The off-line system mainly serves the data processing class operation, these operations involve the massive data, the user's expectation is not particularly high, at this time the processing efficiency is more important. Typically, these jobs will be merged in batches to increase the overall throughput of the system, i.e., resource utilization as the primary scheduling target.
In system design, there are some eternal contradictions that need to be compromised, such as latency and throughput, fairness and efficiency. In the desktop environment, we chose low latency and fairness, while in the data center environment we chose high throughput (or stable limit throughput) and efficiency. There are also different choices on the implementation, such as synchronous and asynchronous model, thread and event drive, thread pool and queue.
From the desktop to the data center, the same changes also occur in the development model. PCs are an open system where each vendor is responsible for only a portion of the system, both software and hardware, and needs to be considered to work with different components. Due to the large number of users, the requirements are different, only to take the hierarchical organization of the system (layered architecture), as well as the adoption of standardized norms. Although this ensures the universality of the system, as well as the effective division and coordination of various components from different sources, it also brings some problems, for example, a function needs to penetrate multi-layer to complete, and each layer does not trust each other, need to carry out strict parameter check.
More seriously, there may be some duplication of functionality in each layer of the system. In the case of storage, one write needs to go through a libc file stream (file stream), a buffer to the filesystem, a buffer in the drive, and a long call process to the cache on disk to complete the persistence (persistency). It is reasonable to look at each layer individually, but from the perspective of the system, there is a waste of performance. In addition, due to the transparency of the layering, data persistence has to be ensured through additional fsync operations, thus making the system reliability assurance mechanism more complex.
In addition, we often talk about the separation of mechanism (mechanism) and strategy (policy) in architecture design. Fixed, explicit functions are called mechanisms, and are configured with flexible and variable policies to make the system scalable. But in fact, each layer is independent and transparent, and often follow the same design concept, which does not guarantee the effective separation of mechanism strategy, the final system is often difficult to achieve scalability and performance of a good balance.
We can see that layering causes each layer to tend to be smarter and more complex, but with less overall results. In today's data center environment, as mentioned earlier, many times we are actually doing a very large application, the characteristics of the application needs to be fully considered. In addition, the system usually has only one manufacturer, which can be designed or integrated vertically. At this point, the policy is provided by the upper layer of the application tier or platform, and the lower level only needs to consider the mechanism, which makes the system simpler, thus achieving better performance, and extensibility can be well guaranteed.
For example, SSD is now generally assumed to be used by the file system in its design. Due to the erasure characteristics of flash memory, it is necessary to consider the write buffer, and because the buffer needs to have space to have a complex permutation algorithm and recycling mechanism, which has a great impact on performance and cost (also including development costs). But in the data center environment, we usually have a complete design of the storage system, data organization and read and write process is also fully optimized, the need for storage equipment is the most basic fixed-length block. In this case, SSD logic can be done very simply, directly exposed to the upper layer of the internal state (such as access, physical block), thereby improving performance, reduce costs. More importantly, this will effectively improve delivery speed-which is critical for mitigating the long implementation cycles of hardware systems such as servers, networks, IDC, and the growing scale of business growth.
The requirements for the lower layer are logically simple and functional, while the lower layer exposes more details to the upper level, and the most complex logic judgments are performed by the top-level application, which is another way of layering. Moreover, there is no need to maintain a physical boundary between the layers (such as the current application and the kernel), you can use the method of function calls to achieve a flexible hierarchical division. Interested readers can refer to some design ideas of Libos "Note: Exokernel" or In-kernel Web server "Note: khttpd".
The third change from the desktop to the datacenter is the evaluation system. A medium-sized datacenter typically contains tens of thousands of of servers, and hardware failures are commonplace in such a scale. In general, we solve hardware failure problems with redundant replication or repetitive processing. After getting used to hardware failures, our attitude toward software bugs changes. There is an occasional bug in a software bug "Note: Also known as heisenbug, which means the Heisenberg uncertainty principle" is the hardest to detect and the hardest to debug, and the cost of eliminating these bugs is enormous. But given that the probability of such a bug is comparable to a hardware failure, we can actually do it the same way.
As the scale increases, the complexity of the system becomes more and more high, so much so that more than one person has direct control. It becomes very difficult in this case to understand the state of the system to ensure its proper operation. At this point, we can take advantage of the characteristics of system redundancy, a number of components for periodic restart (reboot), through the reset state to reduce the probability of a bug triggered "note:" Recovery oriented Computing "". This is especially true for performance issues, sometimes requiring data mining to optimize or system debug "NOTE: M.K." Aguilera, J.C Mogul, j.l. Wiener, etc., "configured debugging for distributed BAE to black boxes", in Sosp ' 03.
Massive data and data processing applications also have a great impact. Due to the size of the data and the characteristics of the processing algorithm, many times the system only needs to provide the correct results in the sense of probability, do not need to ensure the absolute reliability of the data, and do not need to strictly guarantee the repeatability of the results of the operation.
All in all, Internet services are huge, cost sensitive, and business needs are changing unusually frequently, unlike PC applications. Now the system design principle is in the desktop environment lasted more than 30 years of development, but today has completely not adapted to the data center environment, we need to rethink and summed up the applicable design principles, this reflected in the following three aspects.changes in the environment from Single-user multitasking to multiuser tasks have led us to revisit the tradeoff between latency and throughput, fairness and efficiency in system design. The development of a full set of systems is possible, transparency is no longer a virtue, architecture from the hierarchical to the silo-style evolution, the system is driven by demand and customization. Because of the size and complexity of the increase, we no longer pursue zero defect, but with the fault and bug dances. At the same time, the data becomes a part of the system, which makes the previous deterministic system uncertain, and the evaluation index is changed from correctness (correctness) to accuracy (precision).
It should be emphasized that the changes in these design principles do not mean that we need to subvert the general system of the desktop environment and turn to the proprietary system. The design of the common system was based entirely on the desktop environment, now it is a new environment, new application patterns and new business requirements, there is another type of common system needs. This is like the current NoSQL system, which is dedicated, but is becoming more common.
The most notable characteristic of Internet service differs from the traditional industry is the large scale data and the rapid iterative development way, through the data can analyze the user behavior, but the fast iteration causes the data analysis result to be effective quickly, thus optimizes the operation or adapts the user needs the change. It can be said that the size of the data and the speed of the iteration determines the speed of an internet company's innovation, but also its technical level, and the most important thing is the cloud computing technology.
Cloud computing can be decomposed into large data and data center computations. Large data from the perspective of mass data analysis technology and system architecture support, including software infrastructure and data intelligence and other related technologies, and data center computing is from the perspective of architecture and software and hardware system. Traditional software and hardware system based on desktop environment design, and today's data center environment has a lot of changes, such as the application characteristics and load model, development model, evaluation system, etc., which led to the inheritance of the design principle no longer applicable.
This paper mainly discusses the characteristics of data center calculation from macroscopic point of view, aiming at clearing the concept and starting a new idea, and causing the industry to rethink the principle of system design. For the specific technical direction such as storage, computing and large-scale distributed systems, etc., the article is not described in detail, for future discussion.
(Responsible editor: Lu Guang)
Start building with 50+ products and up to 12 months usage for Elastic Compute Service