Reading notes-modern operating systems-8 multi-processor systems-more than 8.2 computers

Last Update:2016-07-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

More than 8.2 computers

Multi-computer (multicomputes) has become a more easily constructed system to avoid the problem of multi-processor construction and high cost. Because its basic components are just a bare-metal PC with a high-performance network interface card. Get high-performance design internetwork and interface cards. This is exactly the same as constructing a shared storage device in a multiprocessor.

8.2. More than 1 computer hardware

1. Interconnect Technology
There are many ways to connect the network topology mainly:

Two kinds of switch system:
Packet swapping: Each message is first decomposed into a block with a maximum length limit, called a package. Stores the transfer packet interchange (Store-and-forward packet switching).
Circuit switching (circuit switching): Includes a path that is established by the first switch to reach the target switch through all switches. Once the path is established, the bitstream can be transported from source to destination as quickly as possible through the entire path.

2. Network Interface
In all multiple computers, there is some RAM on the interface board that is used to store incoming and outgoing packets. Because the internet is usually synchronous, it is necessary to continuously send data at a constant bit rate, which is not possible if you use primary RAM. Therefore, dedicated RAM is required. The receiving end also requires dedicated RAM acceptance.
There can be one or more DMA channels on the interface board. by requesting block transfer on the system bus, the DMA channel can replicate packets at a very high rate between the interface board and the primary RAM, so that several words can be delivered at once without the need to request the bus separately for each byte.
Many interface boards have CPUs and DMA, which are known as network processors (processor) and are increasingly powerful.

8.2.2 Bottom Communication software

The biggest obstacle to high-performance communication in multi-computer systems is excessive replication of packets.
In the best case, the source node is copied from RAM to the interface board, copied from the interface board on the sending side to the interface board at the receiving end, and finally sent to the receiving end of RAM from the interface Board of the receiving end. A total of three copies.
However, if the interface board is mapped to a kernel virtual address, it also requires the user process to send a system call to the kernel to complete a copy of the virtual address from the kernel virtual address path memory. In addition to the previous three copies, a total of five copies are required.

So many computers map interface cards directly to user virtual addresses, but this brings two problems:

If there are several processes running on the node and need to access the network to send the packets, which process should get the interface card?
One solution is to map the interface card to all the processes that need it, but doing so requires a mechanism to avoid competition. Some kind of synchronization mechanism is required.
The core itself often requires access to the connected network. The simplest design is to use two network interface boards, one to map to kernel space, and another to map to user space.

node-to-network interface communication
The quickest way to send packages to the interface card Board is to use the DMA chip on the board to directly copy them from RAM to the board. However, DMA uses a physical address instead of a virtual address, and it runs independently of the CPU. (Personal opinion that the original book does not explain how to solve the problem)

8.2.3 User layer Communication software

Simple scenario: Exposing message delivery to a user process. The operating system provides a way to send and receive messages, and the library process makes these underlying calls available to user processes.
Replication scenario: The actual message passing is hidden by the user by making the remote communication look like a procedure call method.

1. Sending and Receiving

A call to send a message: Send (DEST,&MPTR), the message pointed to by the Mptr parameter is sent to the process identified by the Dest parameter, and causes the caller to block until the message is emitted.
Accept the invocation of a message: Receive (ADDR,&MPTR), which causes the caller to block until the message arrives. When the message arrives, it is copied to the buffer pointed to by the Mptr parameter and the blocking of the caller is revoked. The addr parameter specifies the address to be monitored by the recipient.

Address: The use of two-part encoding is part of the CPU number, and the other part on the address of the addressable CPU on a process or port number.

2. Blocking calls and non-blocking calls

The above content is a blocking call (synchronous invocation).

Non-blocking invocation (asynchronous invocation):

Send Side
Send is non-blocking, and it returns control to the caller immediately before the message is sent.
The advantage is that the sending process can continue to operate in parallel with the message delivery, rather than having the CPU idle.
The disadvantage is that the message buffer is not modified until the message is sent out to the sender, and worse, the process does not know when the pass will end.

Solution:

Let the kernel copy this message to the internal kernel buffer, and then let the process continue. However, additional replication can degrade system performance
After the message is sent, it interrupts the sender, informing the buffer that it is available, and handling it more cumbersome.
Allows the buffer to be copied when it is written, marked as read-only before the message is sent out, or in copy if the buffer is reused. Replication results in degraded performance

It is still best to be blocked under normal circumstances, especially in the case of multiple threads.

Receiving End
A blocking call is a pending caller until the message arrives. This is an efficient and simple way to do this if you have multiple threads available.
Non-blocking receive simply notifies the kernel buffer where it is located and returns control almost immediately.

A multi-party notifier has received a message:

Interrupt, Programming complex
Use a procedure poll to poll incoming messages.
The arrival of a message naturally causes the creation of a new thread. (pop-up thread), this thread runs a predefined procedure whose arguments are a pointer to the incoming message. The thread is automatically revoked when the thread is finished processing.
Like the previous one, run the recipient code directly in the interrupt program. The message itself is scientific with a handle to the handler (handler). This allows handlers to be invoked only in a few instructions when the message arrives. The greatest benefit is not to replicate, called (Active message active messages). But there is also a handle to the message with a high security requirement.

8.2.4 Remote Procedure Call

All of the above are discussed using the message-passing model. The disadvantage is that all communication generics are input/output, and the process send and receive are basically IO-made.

Another completely different technique: Remote Procedure Calls (Remotes Procedure CALL,RPC) run CPU calls on other CPUs that have become the basis of a large number of multi-computer software.
The basic idea is to make the remote procedure call as local as possible.
The simplest scenario: The client program must be bound to a small library process called the client stub, which represents the server process in the client address space. A similar server is also bound to a process called a server stub.

Basic steps:

Clients call Client Piles
The client pile packages the parameters into a message, and the system calls the message, and the process of packaging the parameters is called Orchestration (marshaling)
The kernel sends the message from the client to the server.
The kernel passes the incoming message to the server end pile.
Server-side pile call Server procedure.

Implementation-related issues

About the use of pointer parameters
For weakly typed languages, there is no way to orchestrate parameters because there is no way to determine how big.
Parameter types are not always pushed out
The global variable is not valid.

Although RPC is more complex, it is still widely used.

8.2.5 Distributed shared storage device

The distributed shared storage (distributed shared MEMORY,DSM) is good at preserving the illusion of shared storage, although this shared storage does not actually exist.

There are two ways to implement sharing:

Implemented by the operating system: shared storage is between the runtime system and the operating system.
Implemented by the upper-level software: between the runtime system and the operating system.

Physically implemented shared memory is actually between the operating system and the hardware

In the DSM system, the address space is divided into pages, which are distributed across all nodes in the system. When a CPU refers to a non-local address? A trap was created, and DSM software picked up the page that contained the address change and restarted the command to execute the interrupt.

1. Copying
The basic improvement is to copy only the read-only pages.
Another improvement is to not only copy the read-only page, but also copy all the pages. If the copied pages are modified, additional mechanisms are required to maintain consistency.

2. Pseudo-sharing
One difference between DSM and multiprocessor is that multi-processing can use cache, the size of the cache block is usually 32 bytes or 64 bytes, this is to avoid taking the bus process time, and the DSM block must be an integer multiple of the page size, so much larger.

The advantages of large pages are: The start time of network transmission is very long, the transmission of long pages and transmission of short page time is similar.
The disadvantages are:
Large page transfers occupy the network for a long time, blocking other processes.
can also cause pseudo-sharing (false sharing), two process variables are stored in the same block, and two processes belong to a different CPU, so it is necessary to go to the block in two processes belong to the CPU directly and continuously exchanged transfer.

3. Achieving sequential consistency
How to resolve the problem of modifying the copied blocks and maintaining consistency:
DSM before writing to a shared page, send a message to all CPUs holding the copy of the page informing them to dismiss and discard the page. After all of its deallocation is complete, the CPU writes again.

If you have more detailed constraints, you can also allow multiple copies of the page to be written.

A method runs a process that obtains a lock on a part of the virtual address space, and then carries out multiple read and write operations in the locked space, where the lock releases its modified content on other pages, and only one CPU can lock a page at a given moment.
Another way to do this is to make a clean copy of a potentially writable page when it is actually written for the first time, and save it on the CPU that issued the write operation. You can then lock on the page, update the page, and release the lock. Later, when a process on a remote machine tries to get a lock on the page, the previous write CPU compares the current state of the page to a clean copy and constructs a list of all the modified words. The list is then sent to the CPU that acquired the lock. This allows it to update its copy page without discarding it.

8.2.6 Multi-Computer scheduling

In multiple computers, each node is set up by its own storage and processes, so how to assign work to nodes is more important.
Multi-computer scheduling is similar to multiprocessor, but not all algorithms are the same, such as a central linked list that maintains process readiness, because each process runs only on the current CPU.
Group scheduling can be used, but additional protocols are required to guarantee the simultaneous nature.

8.2.7 Load Balancing

Processor allocation algorithm (processor allocation algorithm), usually
Known process properties: Includes CPU requirements, storage usage, and traffic per process and other processes
Possible goals: CPU time wasted due to lack of work, minimizing total communication bandwidth, and ensuring user and process fairness.

1. Graph theory Determination algorithm:
Complete the allocation with minimal network traffic. Each vertex is a process, and each arc represents the amount of traffic between two processes. Look for a way to divide (cut) a graph into K-unconnected sub-graphs.

2. Sender-initiated distributed heuristic algorithm
When a process is created, it runs on the node where it was created, unless the node is overloaded, the measure of the overloaded node may involve too many processes, too much work, or other measures. If it is overloaded, the node randomly selects another node and asks for its load. If its load falls below a certain condition, it will send its new process to the idle query node to run. If the load is not lower than a certain threshold, select another machine, in the N-Probe, if the appropriate host is not found, the algorithm terminates.
The disadvantage is that in the case of heavy loads, too much questioning reduces efficiency

3. Distributed heuristic algorithm issued by recipients
Basically, the difference is that the idle person makes the request.
In practice, the 2 and 3 algorithms can be combined.

Reading notes-modern operating systems-8 multi-processor systems-more than 8.2 computers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Reading notes-modern operating systems-8 multi-processor systems-more than 8.2 computers

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support