CSCI.4210 Operating Systems Fall, 2008, Class 20

CSCI.4210 Operating Systems Fall, 2008 Class 20
Multiprocessors, Multicomputers

Multiprocessor Operating Systems

The text discusses three possible configurations of operating systems for a shared memory multiprocessor system.

Each CPU has its own operating system. In such a system, a portion of the memory is allocated for each CPU, and there one instance of the OS code in memory. While this is simple to implement, it is not generally used for several reasons.
- Each OS does its own scheduling, so there can be no sharing of processes. In particular, if one process has multiple threads, they all run on the same CPU so there is no speedup due to parallelism.
- I/O buffering and caching is largely impossible because maintaining cache consistency is problematic.
- Page sharing is impossible because each CPU has its own memory
- Load balancing is difficult or impossible. One CPU may be overworked but there is no simple way for some of this work to be offloaded to another CPU.
A Master-slave arrangement, in which one CPU is the master and controls the other CPUs. In this implementation, one CPU runs the operating system and farms processes out to the other CPUs. Note that all system calls are handled by the CPU running the OS. This solves some of the problems of the above system, but the OS becomes a bottleneck.
Symmetric Multiprocessors (SMP) In this model there is only one copy of the OS in memory, but any CPU can run it. Thus, when a process makes a system call, that CPU traps it and processes it. The major problem with this model is that, as we saw in week 4, concurrent processes can result in incorrect results if not handled carefully. If two different CPUs are updating the Process Table at the same time, the result can be an inconsistent process table, followed by disaster.
The solution is to divide the kernel up into critical regions that can be run independently of one another, and surround each with a mutex so that only one CPU at a time is running in a particular critical region. Since CPUs may be accessing several of these critical regions at the same time, deadlock is a possibility, so the system needs a mechanism to prevent deadlock.
In spite of its complexity, this is how most multiprocessor systems are designed now.

Even implementing synchronization primitives on a multiprocessor system is more difficult. On a single processor system, the kernel could insure that any code which locked a mutex was performed atomically without the possibility of interruption. On a multiprocessor system, this is not so simple. There needs to be a way to make sure that two different CPUs don't both lock a mutex at the same time.

Since all of the CPUs share a common bus, one solution is to allow the bus to enforce mutual exclusion with a special flag. In other words, when a CPU wants to perform an operation on a mutex, it requests the bus in the usual way and also requests that this flag be set. Thus, no other CPU can use the bus (meaning that no other CPU can access memory), until the CPU finishes locking the mutex and releases the bus. This works, but it requires special bus hardware, and it can also incur a substantial performance penalty because other CPUs are prevented from using the bus and accessing memory.

Process scheduling on a multiprocessor system

Process scheduling on a single CPU machine is far from trivial, and processing on a multiprocessor system is even more complex. If all processes are independent, then we can use a scheduling algorithm similar to that used for a single processor. Each CPU is running a process, and the system maintains a single list of runnable jobs. When a process on a particular CPU becomes blocked or uses up its time quantum or terminates, that CPU just gets the next runnable job from the list and continues.

But there are some complications. The first is that since mutexes are much more common on symmetric multiprocessors, it is possible that a process will use up its time quantum while it holds a mutex. It can't easily yield the mutex because it is in the process of updating a global data structure. On the other hand, if it is simply returned to the ready queue, any other processes which are waiting for that mutex will be forced to wait even longer. One solution is to give such processes extra time; if a process uses up its time quantum while it is in a critical region, it is permitted to finish its critical region and unlock the mutex before being replaced by another process.

Another issue is that it might make sense to restart a process on the same CPU when it returns from a blocked state. It is possible that much of its data is still in cache, and that some of its pages are still in the TLB. This leads to a two level scheduling algorithm. Once a process is assigned to a CPU, it always (or almost always) runs on that CPU, even if it is blocked for an extended period. So process scheduling on a CPU is one level. Assigning new processes to a CPU is the second level of process scheduling.

Multicomputers

The second computer system architecture is the multicomputer, in which each CPU has its own memory, but the CPU are tightly coupled. Processes can no longer use shared memory for communication, so interprocess communication is done with Message Passing. Because the messages need to go over a network, message passing communication is several orders of magnitude slower than communication through shared memory. A group of tightly clustered computers is called a cluster

Clusters provide a number of attractive features.

They are highly scalable. It is possible to create clusters with hundreds or even thousands of nodes. Note that this was not generally feasible with shared memory architectures because of bus contention.
They can provide high availability. Because each node in a cluster is a stand alone computer, failure of a node does not mean loss of service.
Clusters can provide very good price performance because they use commercial, off the shelf (COTS) components.

There are several methods of connecting the nodes together. One obvious method is using Ethernet, which is a broadcast protocol. However, other systems use various point-to-point protocols. For example, the nodes can be connected using a grid or mesh.

One interesting example of this is a hypercube. The number of nodes (CPUs) on a hypercube should be a power of 2. Picture a cube, with the corners as nodes and lines between them as connections. There are eight nodes, and no node is more than three hops from any other node. It takes three hops to get from the top left front node to the bottom right back node.

A four dimensional cube would have 16 nodes, and a maximum of four hops to get from one node to another.

A five dimensional hypercube would have 32 nodes and a maximum of five hops to get from one node to another, and so on.

Clusters often include middleware to provide a single image to all of the users; that is, the fact that there are numerous nodes is transparent to the typical user. A user logs into the cluster, not onto a particular machine in the cluster, and sees a single file system, such as NFS, a common user interface, and a common view of resources and devices.

Even though each node in the cluster has its own memory, it is possible for a cluster to provide a single memory image for multiple CPUs; this is called Distributed Shared Memory. Each page is in the memory of one of the nodes, but all nodes have access to it. This is made possible because of virtual memory. When a CPU tries to access a page that it does not have in its page table, the OS locates the page in the network (perhaps on some other computer). The network sends to page to the requesting node where it is mapped in its page table.

This can result in cache inconsistencies if a CPU writes to its own instance of a page. As a result, whenever a process is about to execute a WRITE to a virtual memory page, a message is sent to all the other CPUs which hold a copy of that page telling them unmap and discard the page.

Load balancing is a major issue in such systems. Once a process has been assigned to a node, it is difficult to move it, and so if the system is busy, it is important to keep all of the processors occupied. It is easy for a situation to develop where one node is overworked and another is idle.

There are a number of algorithms for this, depending on how much information is available and how much information the scheduler has. Here are three:

A graph theoretic algorithm If the system knows in advance how much communication there will be between various processes, it can minimize the amount of network traffic by assigning clusters of processes which will be sending many messages to each other to the same node. The algorithm which does this is a graph partitioning algorithm Each process is a vertex of the graph, and each arc is the message volume between two nodes. If there are more processes (p) than there are nodes (n), the algorithm will partition the graph of p vertices into n partitions such that the amount of communication between any two nodes is minimized.
A sender initiated distributed algorithm When a new process is created, it runs on the node on which it was created unless the workload on that node exceeds a certain level, in which case it chooses another node at random and asks what its load it. If the load of the new node is below a certain threshold, then the process is sent to that node to be run. If the load on the randomly selected node is too high, then another node is selected at random until a node is found with a low load.
Note that unlike the previous algorithm, this makes no attempt to minimize network traffic, and in fact if the system is heavily loaded, this algorithm generates even more network traffic.
A receiver initiated distributed algorithm This is the opposite algorithm. Whenever a process terminates on a node, if the workload on that node is below a threshold, it queries other nodes. If it finds a node which is heavily loaded, that node will send a process to the lightly loaded node.

In order to implement load balancing, it there must be a mechanism to migrate a process from one node to another. There are several ways to do this.

The simplest way is to simply copy the entire process address space, but this can be quite time consuming, and it often results in a lot of unneeded copying.
Precopying the process continues to execute on the source node while the address space is copied to the target node. Pages which are modified during the precopy operation may need to be copied a second time.
Transfer only those pages which are in main memory and which have been modified. Any additional blocks will be transferred on demand only. While this is very efficient, it means that the sending node has to continue to be involved in the process, by maintaining page tables and transferring pages as needed.
Flushing All dirty pages are simply copied to disk, and the process starts on the new node by loading pages from disk rather than copying directly from the memory of the source to the memory of the destination. This is simple to implement, but it may require extra disk accesses.

A final issue which has to be resolved in process migration is the status of outstanding signals and messages.

The standard cluster software for a bunch of Linux machines is Beowulf.

Virtualization

Virtualization refers to running several different OSs (or multiple instances of the same OS) on a single machine. This concept has been around for a long time; in the 1970s, IBM had a mainframe operating system called VM, in which each user was given their own instance of the OS. The company VMWare has made virtualization popular. Virtualization has become so important that Intel has added features to assist virtualization onto their Processors. The technology, called VT by Intel creates containers in which virtual machines can be run.

The underlying software that controls this is called a hypervisor, and this is the true Operating System. The other operating systems (the ones that the users see) are called guest operating systems. Whenever a process on a guest OS tries to execute a privileged instruction, this is intercepted by the hypervisor.

There are a number of problems that virualization has to solve. The first is that the guest operating system will have to execute privileged instructions (recall from the first class that there are some processor instructions which should only be executed in kernel mode). In all types of virtualization, these privileged instructions are trapped by the hypervisor rather than run directly, and the hypervisor then executes them.

Another problem is memory virtualization. Each guest OS thinks that it can see the entire memory, and of course each has its own page table. What happens if two different guests try to map to the same physical page. The hypervisor solves this problem by creating a shadow page table that maps the virtual pages used by the virtual machine onto real pages. This can result in a performance penalty, because every change to a guest's virtual page table results in an update to the shadow page table as well.

Return to the course home page