The text discusses three possible configurations of operating systems for a shared memory multiprocessor system.
The solution is to divide the kernel up into critical regions that can be run independently of one another, and surround each with a mutex so that only one CPU at a time is running in a particular critical region. Since CPUs may be accessing several of these critical regions at the same time, deadlock is a possibility, so the system needs a mechanism to prevent deadlock.
In spite of its complexity, this is how most multiprocessor systems are designed now.
Since all of the CPUs share a common bus, one solution is to allow the bus to enforce mutual exclusion with a special flag. In other words, when a CPU wants to perform an operation on a mutex, it requests the bus in the usual way and also requests that this flag be set. Thus, no other CPU can use the bus (meaning that no other CPU can access memory), until the CPU finishes locking the mutex and releases the bus. This works, but it requires special bus hardware, and it can also incur a substantial performance penalty because other CPUs are prevented from using the bus and accessing memory.
Process scheduling on a multiprocessor system
Process scheduling on a single CPU machine is far from trivial, and processing on a multiprocessor system is even more complex. If all processes are independent, then we can use a scheduling algorithm similar to that used for a single processor. Each CPU is running a process, and the system maintains a single list of runnable jobs. When a process on a particular CPU becomes blocked or uses up its time quantum or terminates, that CPU just gets the next runnable job from the list and continues.
But there are some complications. The first is that since mutexes are much more common on symmetric multiprocessors, it is possible that a process will use up its time quantum while it holds a mutex. It can't easily yield the mutex because it is in the process of updating a global data structure. On the other hand, if it is simply returned to the ready queue, any other processes which are waiting for that mutex will be forced to wait even longer. One solution is to give such processes extra time; if a process uses up its time quantum while it is in a critical region, it is permitted to finish its critical region and unlock the mutex before being replaced by another process.
Another issue is that it might make sense to restart a process on the same CPU when it returns from a blocked state. It is possible that much of its data is still in cache, and that some of its pages are still in the TLB. This leads to a two level scheduling algorithm. Once a process is assigned to a CPU, it always (or almost always) runs on that CPU, even if it is blocked for an extended period. So process scheduling on a CPU is one level. Assigning new processes to a CPU is the second level of process scheduling.
Multicomputers
The second computer system architecture is the multicomputer, in which each CPU has its own memory, but the CPU are tightly coupled. Processes can no longer use shared memory for communication, so interprocess communication is done with Message Passing. Because the messages need to go over a network, message passing communication is several orders of magnitude slower than communication through shared memory. A group of tightly clustered computers is called a cluster
Clusters provide a number of attractive features.
One interesting example of this is a hypercube. The number of nodes (CPUs) on a hypercube should be a power of 2. Picture a cube, with the corners as nodes and lines between them as connections. There are eight nodes, and no node is more than three hops from any other node. It takes three hops to get from the top left front node to the bottom right back node.
A four dimensional cube would have 16 nodes, and a maximum of four hops to get from one node to another.
A five dimensional hypercube would have 32 nodes and a maximum of five hops to get from one node to another, and so on.
Clusters often include middleware to provide a single image to all of the users; that is, the fact that there are numerous nodes is transparent to the typical user. A user logs into the cluster, not onto a particular machine in the cluster, and sees a single file system, such as NFS, a common user interface, and a common view of resources and devices.
Even though each node in the cluster has its own memory, it is possible for a cluster to provide a single memory image for multiple CPUs; this is called Distributed Shared Memory. Each page is in the memory of one of the nodes, but all nodes have access to it. This is made possible because of virtual memory. When a CPU tries to access a page that it does not have in its page table, the OS locates the page in the network (perhaps on some other computer). The network sends to page to the requesting node where it is mapped in its page table.
This can result in cache inconsistencies if a CPU writes to its own instance of a page. As a result, whenever a process is about to execute a WRITE to a virtual memory page, a message is sent to all the other CPUs which hold a copy of that page telling them unmap and discard the page.
Load balancing is a major issue in such systems. Once a process has been assigned to a node, it is difficult to move it, and so if the system is busy, it is important to keep all of the processors occupied. It is easy for a situation to develop where one node is overworked and another is idle.
There are a number of algorithms for this, depending on how much information is available and how much information the scheduler has. Here are three:
Note that unlike the previous algorithm, this makes no attempt to minimize network traffic, and in fact if the system is heavily loaded, this algorithm generates even more network traffic.
In order to implement load balancing, it there must be a mechanism to migrate a process from one node to another. There are several ways to do this.
A final issue which has to be resolved in process migration is the status of outstanding signals and messages.
The standard cluster software for a bunch of Linux machines is Beowulf.
Virtualization refers to running several different OSs (or multiple instances of the same OS) on a single machine. This concept has been around for a long time; in the 1970s, IBM had a mainframe operating system called VM, in which each user was given their own instance of the OS. The company VMWare has made virtualization popular. Virtualization has become so important that Intel has added features to assist virtualization onto their Processors. The technology, called VT by Intel creates containers in which virtual machines can be run.
The underlying software that controls this is called a hypervisor, and this is the true Operating System. The other operating systems (the ones that the users see) are called guest operating systems. Whenever a process on a guest OS tries to execute a privileged instruction, this is intercepted by the hypervisor.
There are a number of problems that virualization has to solve. The first is that the guest operating system will have to execute privileged instructions (recall from the first class that there are some processor instructions which should only be executed in kernel mode). In all types of virtualization, these privileged instructions are trapped by the hypervisor rather than run directly, and the hypervisor then executes them.
Another problem is memory virtualization. Each guest OS thinks that it can see the entire memory, and of course each has its own page table. What happens if two different guests try to map to the same physical page. The hypervisor solves this problem by creating a shadow page table that maps the virtual pages used by the virtual machine onto real pages. This can result in a performance penalty, because every change to a guest's virtual page table results in an update to the shadow page table as well.
Return to the course home page