Cloud computing and data center computing

Source: Internet
Author: User

Reprinted: http://www.programmer.com.cn/9767/

The concept of cloud computing originated from ultra-large Internet companies such as Google and Amazon. With the success of these companies, cloud computing, as a supporting technology, has been highly recognized and widely spread in the industry. Today, cloud computing has been widely recognized as a new stage of the IT industry development, which has been given a lot of industry and product significance. Due to its multiple meanings and complicated concepts, many companies and practitioners have their own clouds in their eyes. As Xu Zhimo said in the "accident" poem: "I am a cloud in the sky, occasionally projected in your heart ".

Traditional system design focuses on the standalone environment, while cloud computing focuses on the data center. From a single machine to a data center, many design principles have undergone fundamental changes. The extreme point can even be said that the system design principles that have been consistent over the past 30 years in the PC era have not been applicable to today.

Considering the many connotations of cloud computing, datacenter computing may be a more appropriate statement from a technical point of view. This article discusses the technical fields and design principles of data center computing. For reference only.

Introduction to cloud computing

Since the development of personal computers in 1980s, PC computing capabilities have been constantly enhanced. With one PC, you can store all the data you need and complete the processing, such as writing documents and processing emails. However, in the Internet era, an Internet company needs to use data that far exceeds the size of its individual to provide services. The storage and processing of such data requires the collaboration of thousands of machines. This type of server is not provided by individuals. It can only be owned by large companies or organizations. It seems that it has returned to the mainframe era earlier. From mainframe to PC to cloud, this is the development process of computer technology.

Simply put, cloud computing integrates thousands of servers using system architecture technology to provide users with flexible resource allocation and task scheduling capabilities. There are several keywords: one is ultra-large scale, including the number of machines, the number of users and the number of concurrent tasks; the other is resource integration, thousands of server resources can be combined to do one thing, such as storing a large amount of data or processing a large task. The third is flexible and fast delivery, large-scale server resources can be flexibly allocated and divided into several virtual resource pools based on application requirements, quickly supporting a large number of concurrent requests or jobs.

The emergence of cloud computing technology makes the ability to organize and process data unprecedentedly powerful. This capability can help us identify the rules behind many seemingly unrelated events and use it to predict future development. The combination of mobile and IOT technologies can also better serve the society and people's daily life, such as disaster warning, smart city and smart transportation. This data processing capability is developed based on massive data volumes. It is developed and gradually integrated with the system architecture technology, which serves as the basic support to form the cloud computing technology that everyone can see.

In terms of System Architecture and Data Processing Technology, cloud computing technology can be divided into three layers: hardware infrastructure, software infrastructure, and data intelligence, as shown in 1.

Figure 1 cloud computing technology can be divided into three layers

The hardware infrastructure includes the design and implementation of servers, networks, and data centers. The software infrastructure focuses on storage, computing, large-scale distributed systems, and other technical fields, data intelligence focuses on data warehouse, machine learning, data analysis and visualization, and other technical fields. It is worth mentioning that the division of the three layers mainly takes the technical field as the starting point, the three levels of cloud computing SaaS, paas, and IAAs are usually divided by resource provision and interfaces. They are not in the same dimension.

The popular Big Data concept can be seen as data analysis technology and software architecture support from the perspective of massive data, including software infrastructure and data intelligence-related technologies. Both are related to data, but the difference is that the software infrastructure focuses mainly on data formats, capacities, and access modes. Data intelligence cares more about data semantics.

Data center computing looks at the software and hardware system design from the perspective of the architecture. The relevant technical fields and design principles will be discussed below.

Data center computing

Technical Fields and challenges

2. Data Center computing includes storage, computing, real-time storage and computing, ultra-large scale systems, architecture, data centers, and other technical fields. The requirements of the storage system come from two dimensions. First, a large amount of unstructured data needs to be supported by multiple storage structures, such as tables, objects, and files. Second, different access modes (such as read-only, and even read/write) will greatly affect the storage system design and optimization.

Figure 2 technical fields included in the data center

The requirements and technical features of the computing system are closely related to the types of computing tasks. Data-intensive mapreduce is used to balance CPU and I/O requirements. Computing-intensive tasks and communication-intensive tasks are both CPU-intensive tasks, but the two access data scales differently. If you only need a small amount of data, it is computing-intensive. If you need to access a large amount of data, such as large matrix iterations, and the memory limit must be stored on multiple machines, the System Bottleneck will usually shift to the communication delay, this is similar to traditional high-performance computing.

Generally, storage systems and computing systems can only support a certain level of latency and concurrency. For higher requirements, a real-time storage and computing system must be constructed based on memory. Considering the memory features, it is more suitable for providing data structures with rich semantics on storage. Based on the distributed data structure, the model of Stream Data Processing and triggered event processing can better support real-time retrieval, OLAP, pubsub, and other applications.

Ultra-large scale systems use distributed technologies to ensure system availability and manageability, including system modeling, design, development, O & M, and other aspects. The architecture includes Virtual Machine and server design. The data center includes cabinet design, network planning and design, and data center design and construction. It focuses on the Performance-to-performance ratio (Pue ).

System design principles

Traditional software and hardware systems are mainly designed for standalone and individual users. They can also be called desktop computing ). From the desktop to the data center, the application characteristics and load model have undergone great changes.

On a single machine, it is mainly intended for a user who may run multiple tasks, which can be divided into foreground tasks and background tasks. The user is very sensitive to system responsiveness (promptness or responsiveness). Therefore, foreground tasks usually take precedence over background tasks, while background tasks want to be fairly scheduled. This is also why preemptive scheduling is better than cooperative scheduling.

There are also two types of applications in the data center: online and offline. Online systems are directly oriented to users, while offline systems are mostly used for data processing. An online system is usually a large application serving a large number of users. Users are still very sensitive to system responsiveness. However, because of the large scale of users and the free Internet services, the cost is very high, so the system needs to fully explore the user's tolerance for responsiveness. Generally, the human perception of Event Response is about Ms. This feature can optimize System Scheduling and save resources. In extreme stress situations, many systems start to extend the response time, and then lose the response under continuous pressure until it crashes. In this case, a normal service can be provided for requests within the service range, and a rapid denial response can be provided for requests beyond the scope of the service range, which will bring a better user experience and improve system availability. Finally, we will find that the online service system should be designed with a stable maximum throughput (sustained throughput. Of course, the latency threshold must be guaranteed.

The offline system is mainly used for data processing jobs. These jobs involve massive data volumes, and users' expectations are not very high. At this time, the processing efficiency is even more important. Generally, these jobs are executed in batches to improve the total throughput of the system, that is, resource utilization is the primary scheduling target.

In system design, there are some eternal contradictions that need to be considered as trade-offs, such as latency and throughput, fairness and efficiency. In the desktop environment, we chose low latency and fairness, while in the data center environment, we chose high throughput (or stable maximum throughput) and high efficiency. Different implementations are also introduced, such as synchronous and asynchronous models, thread and event drivers, thread pools, and queues.

From the desktop to the data center, development modes also change. A pc is an open system. No matter the software or hardware, each vendor is only responsible for part of the system and needs to consider working with different components. Due to the large number of users and different requirements, you can only adopt layered architecture and standardized specifications. Although this ensures the versatility of the system and the effective division of labor and collaborative work of various components from different sources, it also brings some problems. For example, a function can be completed only through multiple layers, however, each layer does not trust each other and strict parameter checks are required.

More seriously, there may be repeated functions in each layer of the system. Taking storage as an example, a write operation needs to go through the file stream from libc, to the buffer zone of the file system, and then to the buffer zone in the drive, the persistent calling process such as caching on the disk can complete persistence ). This process is reasonable from each layer, but from the perspective of the entire system, there is a waste of performance. In addition, due to the transparency brought by layering, data persistence must be ensured through an additional fsync operation, making the system's reliability assurance mechanism more complex.

In architecture design, we often talk about the separation of mechanisms and policies. A fixed and clear function is called a mechanism. It is configured with flexible and variable policies, so that the system has good scalability. But in fact, each layer is independent and transparent, and usually follows the same design philosophy, which does not guarantee the effective separation of mechanisms and policies, the final system is often difficult to achieve a good balance between scalability and performance.

We can find that layers tend to be intelligent and complex, but the overall effect is not satisfactory. In today's data center environment, as mentioned above, we are actually doing a very large application, and the characteristics of the application need to be fully considered. In addition, this system usually has only one producer, which can be vertically designed or integrated. At this time, policies are provided by the application layer or the upper layer of the platform, while the lower layer only needs to consider mechanisms, which will make the system simpler and achieve better performance, scalability can also be well guaranteed.

Taking SSD as an example, the current SSD is usually used by the file system during design. Because of the erasure feature of flash memory, you need to consider the write buffer, and because the buffer needs to have a reserved space, you also need to have a complex replacement algorithm and recovery mechanism, which is performance and cost (including development costs) it has a big impact. However, in the data center environment, we usually have a fully-designed storage system, and the data organization and read/write processes are also fully optimized. The requirement for storage devices is the most basic fixed length block. In this case, the SSD logic can be very simple, and the internal status (such as path and physical block) of the upper layer can be directly exposed to improve performance and reduce costs. More importantly, This will effectively improve delivery speed-this is crucial to alleviate the conflict between the long implementation cycle of hardware systems such as servers, networks, and IDCs and the demand for fast business growth.

The upper layer requires a simple logic and a single function, while the lower layer exposes more details to the upper layer. The most complex logic judgment is completed by the upper layer application, this is another method of hierarchy. In addition, there is no need to maintain a physical boundary between layers (such as between the application and the kernel). You can use function calls to implement flexible hierarchy division. If you are interested, you can refer to some design ideas of libos [Note: exokernel] or in-kernel Web Server [Note: khttpd.

The third change from desktop to data center is the evaluation system. A medium-sized data center usually contains tens of thousands of servers. In this case, hardware faults become common. Generally, we solve hardware faults through redundant replication or repeated processing. When we get used to hardware faults, our attitude towards software bugs will also change. There is an occasional bug in software bugs. [Note: this is also known as heisenbug, which refers to the principle of uncertainty in the heenburg test.] The most difficult to detect and debug is required. Eliminating these bugs requires a huge cost. However, considering that the probability of such a bug is comparable to that of a hardware fault, we can use the same method.

As the scale grows, the complexity of the system is getting higher and higher, so that many times the system has exceeded the direct control capability of a person. It is very difficult to understand the running status of the system to ensure its normal operation. At this point, we can use the features of system redundancy to regularly restart some components (reboot) and reset the status to reduce the probability of bug triggering. [Note: "recovery oriented computing "]. This is especially true for performance issues. Sometimes we need to use data mining methods for optimization or system debugging. [Note: M. k. aguilera, j.c. mogul, J. l. gini, Etc .,
"Performance debugging for Distributed Systems of black boxes", in sosp '03 ].

Massive data volumes and data processing applications also have a huge impact. Because of the data size and processing algorithm features, many times the system only needs to provide correct results in the probability sense, and does not need to ensure absolute data reliability, you do not need to strictly guarantee the repeatability of the running results.

All in all, Internet services are large in size, cost-sensitive, and business needs change frequently, which is different from the characteristics of PC applications. The current system design principles have been developed over 30 years in the desktop environment. However, today we have completely failed to adapt to the data center environment. We need to rethink and summarize the applicable design principles, this is reflected in the following three aspects.

  • From single-user multitasking to multi-user single-task environment changes, we re-examine the compromise between latency and throughput, fairness and efficiency during system design.
  • It is possible to develop a full set of systems on your own. Transparency is no longer a virtue. The architecture evolves from hierarchical to vertical, and the system is customized based on demand.
  • As the scale and complexity increase, we no longer pursue zero defects, but dance with faults and bugs. At the same time, data has become a part of the system, which makes the previous deterministic system uncertain, and the evaluation indicators have also changed from correctness to precision.

It should be emphasized that the changes in these design principles do not mean that we need to subvert the General System of the desktop environment and switch the entire system to the dedicated system. In the past, the design of general-purpose systems was based entirely on the desktop environment. Now it is a new environment, new application forms and new business needs. At this time, another type of general-purpose system is required. This is like the current nosql system, which is dedicated at the time of proposal, but is gradually becoming universal.

Summary

The most significant difference between Internet services and traditional industries is the ultra-large scale data and rapid iterative development methods. data can be used to analyze user behavior, while rapid iteration can make the data analysis results take effect more quickly, to optimize operations or adapt to changes in user needs. It can be said that the data scale and iteration speed determine the speed of an Internet company's innovation and are also a sign of its technical level, and the most critical of which is cloud computing technology.

Cloud computing technology can be divided into big data and data center computing. Big Data: data analysis technology and system architecture support from the perspective of massive data, including software infrastructure and data intelligence, data center computing looks at hardware and software systems from the perspective of architecture. Traditional software and hardware systems are designed based on the desktop environment, and today's data center environment has changed a lot, such as application features and load models, development models, and evaluation systems, as a result, the design principles inherited so far are no longer applicable.

This article mainly discusses the features of data center computing from a macro perspective. It aims to clarify the concept and inspire others, and lead the industry to rethink the system design principles. Specific technical directions such as storage, computing, and large-scale distributed systems are not described in detail in this article, which will be discussed later.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.