CSCI.4210 Operating Systems Fall, 2008, Class 13

CSCI.4210 Operating Systems Fall, 2008 Class 13
Input/Output Systems

When people want a powerful computer, they look at the clock speed of the CPU or the design of the CPU, but what determines the performance of most systems is the speed and design of the various I/O devices. Most processes will read from a hard disk, and often from other devices as well, and these tend to take many orders of magnitude longer than simple processing with the data in main memory.

There is an enormous range of devices that can be attached to computers these days. Here is a far from complete list.

Keyboard
mouse
monitor
hard drive
CD-ROM
floppy drive
printer
network interface
modem
scanner
joystick
digital camera

Mechanisms for performing I/O

The operating system provides an interface between applications and the I/O devices. Most operating systems make the I/O operations as general as possible. For example, Unix systems have only two I/O system calls, read and write, regardless of whether the device is a hard drive, a floppy drive, a keyboard or a network interface. This makes programming I/O very simple for an applications programmer. The operating system has the job of determining the nature of the device that the program is accessing, handling the low level operations, and scheduling the operations.

There are three ways that a device can get data into memory or from memory.

Programmed I/O - the processor issues an I/O command on behalf of a process to a device, and the process busy waits until the I/O is completed.
Interrupt-driven I/O - the processor issues an I/O command on behalf of a process but then continues to execute intstructions (either in the same process, or, more typically, in some other process). When the operation is completed, the device issues an interrupt through the device controller, and the processor stops what it is doing and processes the operation. For an input operation, this typically involves transferring the data from the device buffer to the memory space of the process.
Direct-Memory Access (DMA) - a DMA module controls the exchange of data between main memory and the I/O device. The processor sends a request for the transfer of a block of data to the DMA module, which transfers the data directly from or to memory. The DMA controller issues an interrupt to the processor only after the transfer is complete, and the data transfer itself takes place without using the CPU or the kernel.

Programmed I/O is rarely used today, and interrupt driven I/O is being replaced by DMA as the preferred method for I/O. This is part of a general trend to do more I/O off line, that is, with a separate I/O processor rather than involving the CPU in I/O.

DMA is far more complicated than the other two methods. The DMA controller copies data directly to or from main memory without involving the processor. It uses the main system bus to do this. Meanwhile the CPU is doing other things, and will be accessing the bus. The bus controller hardware needs a mechanism for multiplexing these two independent modules and dealing with possible contention when both the CPU and the DMA controller want to use the bus at the same time. However, with modern cache memory, the CPU does not need to access main memory on each instruction as it used to, and so it is possible for the DMA controller to access the bus when the CPU doesn't need it. This is known as cycle stealing.

Another problem that DMA has to address is that the kernel has to make sure that the process that requested the data is not swapped out. Since the DMA controller copies the data directly into the memory space of the process, that memory, or more technically, that page or pages, has to be pinned in memory. Recall that if the system is busy, the kernel may choose to copy some processes entirely to disk for a while to improve performance. If the process that requested the read is swapped out, a deadlock situation could occur, in which the process is swapped out while waiting for data to be transferred but the DMA is blocked waiting for the process memory to be reloaded.

One solution which many operating systems use is to pin pages of memory where DMA will be performing I/O so that there is no danger that they will be replaced, even if the process has been sleeping.

Design Issues

Devices are divided into two broad categories, block oriented and character oriented, sometimes called stream oriented.. Block oriented devices store data in fixed size chunks (blocks), and transfers are done a block at a time. Disk drives and tapes are examples of block devices. Character oriented devices transfer data either as single characters or as a stream of characters with no block structure. Terminals, printers, and network communication ports are examples of character oriented devices.

Block oriented devices tend to be random access, which means that they support a seek operation; i.e. data can be read from anywhere on the device. Both Unix and Win32 allow reads to start at an arbitrary point; it is possible to say "read the ten bytes starting at byte 1000 in the file" without first reading the preceding 1000 bytes. Character oriented devices to not generally support seek operations; the data has to be accessed in the order that it is presented on the device. When the user enters a string of keystrokes at the keyboard, there is no way for a process to read the fifth keystroke without reading the first four keystrokes first.

The job of the I/O systems is to read data from whatever device and get the data into the memory space of the calling process or to write data from the process memory space to the device. The kernel provides a number of services to accomplish this. One of these, the file system, will be discussed in a different lesson.

The kernel typically provides a number of buffers for I/O operations, particularly for read operations from block devices like a disk drive. Whenever it processes a request for an I/O operation, it assigns one or more buffers. A block of data is read in from the device to the kernel buffer, and then the kernel copies the data from the kernel buffer to the user space memory.

Input performance can be improved if the system assigns multiple buffers to a read operation. As soon as one buffer is filled, the system not only starts transferring the contents of this buffer to the memory in the calling process, but it also starts reading the next block of data, even if the process has not requested this data yet. This makes sense for two reasons. First, most processes read files sequentially, so if they have requested a particular block of data from a file, it is likely that the process will soon be requesting the next block of data. Second, reading sequential blocks of data from a disk is far faster than reading random blocks, so the additional time and overhead needed to read the next block is pretty small. This makes it worthwhile to read this block even though there is a chance that the data will not be needed.

Assigning two buffers to an input process when only one is apparently needed is called double buffering and this is used in most modern operating systems.

If the operating system reads ahead as described above, this means that when a process makes a read system call to get data from a file on a disk, it is likely that in fact the requested data has already been read and is in a kernel buffer. The obvious exception is the first read from a file or other device.

Buffering can also speed up output. When a process makes a call to write, the system first copies the data to be written from user space to a kernel buffer. Copying from one memory location to another takes several orders of magnitude less time than copying from memory to disk. After this copy is finished, the calling process can continue execution while the I/O system copies the data from the kernel buffer to the disk.

Each device has a device controller, a hardware/software device that does the physical read or write and converts the data to or from software to hardware. For example, a disk drive controller reads a string of bits off the disk and converts these into bytes. Typically, in addition to the actual data bits, it may read some preamble bits and a check sum. The controller calculates the check sum from the bits that it read and confirms that the data that it read are correct. It then transfers these bits into main memory, either in the kernel, or in the case of DMA, to the DMA controller, which in turn copies the data to the memory space of the process.

Each device also has a device driver. A device driver is a piece of software, usually residing in the kernel, which communicates requests between the kernel I/O system and the device controller. Linux and other newer operating systems have a mechanism in which device controllers can be installed on the fly, i.e. without having to reboot the operating system.

Device drivers communicate with the device controllers with registers. The controller has several registers. Often there are four, command (read vs write), status, in, and out. Most operating systems have a standard interface which the kernel uses to communicate with the device driver.

Hard drives

Hard disks are probably the most important I/O device, and so we will start off with a discussion of how they operate. There are lots of good web sites which describe the basics of hard drives, and many students probably know all this stuff anyway, but you are responsible for reading these two web sites, which together could be called hard drives for dummies.

How stuff works; how hard disks work By the way, Marshall Brain, the author of the How Stuff Works series, is a Rensselaer graduate.

The PC Guide: hard Disk Drives

This is a computer science course, not a computer engineering course, so we are not particularly concerned with the hardware of the disk drive per se, but rather with the software/hardware interface. Recall that a disk drive is a block oriented device, which means that data are read and written in fixed sized blocks. A typical block size is 512 bytes. A sector contains one block, along with some alignment information.

The software interface to the disk controller will tell the disk controller which track and sector to read from. To do this, the controller has to move the read/write head in or out to the appropriate track. The time that it takes to do this is called the seek time. Once the head is correctly positioned, the controller has to wait for the desired sector to spin around so it is under the head. This time is called the rotational delay

Calculating seek time is hard to pin down because it involves startup time and some settling time to stop the arm when it is in the correct position. Typical seek times today are 5 to 10 msec. Typical hard disks rotate at 10,000 rpm, which is equivalent to 1 revolution every 6 msec. Thus, the average rotational delay will be 3 msec.

The total transfer time can be expressed as

T = b / rN

where T is transfer time, b is the number of bytes to be transferred, N is the number of bytes on the track, and r is the rotation speed in revolutions per second. Thus total access time can be expressed as

T_a = T_s + 1/2r + b/rN

where T_s is the average seek time.

If there are 320 sectors per track, here are some typical values.

T_s 10 msec

1/2r 3 msec

b/rN .02 msec

Essentially all of the access time is in the seek time and rotational delay. There is an obvious point to be made here. Once the disk controller has moved the read/write arm to the appropriate track and waited for the sector to come under the arm, reading additional blocks on the same track takes very little time. This is why reading additional data in the same file is essentially free if subsequent sectors on the same track have sequential data in the same file.

Disk scheduling policies

In a typical system, the disk drive will be receiving many requests for reads and writes, and these will have to be queued up. We can look at several strategies for scheduling I/O events on a disk.

First come first served: The obvious default scheduling algorithm is first come, first served; in other words, process the requests in the order that they are received.

Suppose a platter has 128 tracks. To do a simulation of this we will make some simplifying assumptions. First let's assume that requests for data come in random order. In practice this is not the case, typically files are read sequentially, and as we saw above, operating systems take advantage of this to read ahead. Second, we will ignore any times related to starting and stopping the read/write head. We will assume that it moves at a constant speed, taking 12.8 msec to get from the outermost track to the innermost track or vice versa, in other words, it takes .1 msec to move across one track. Third, we will assume that rotational delay is always 3 msec.

Here is a burst of requests, the start of the burst is time 0. Initially the head is at track 0.

time track

1 0 34

2 2 110

3 5 77

4 7 12

5 12 62

6 18 31

7 20 108

8 22 43

	time	track
1	0	34
2	2	110
3	5	77
4	7	12
5	12	62
6	18	31
7	20	108
8	22	43

When the disk controller receives the first request, it starts moving toward track 34. It takes 3.4 msec to position itself over track 34, and an additional 3 msec for rotational delay. So the first operation takes 6.4 msec. By this time, two more requests have come in, one for track 110 and one for track 77. Since we process these in order, the controller moves the head to track 110 from track 34. This takes 7.6 msec. and an additional 3 msec of rotational delay, so operation two takes 10.6 msec and is completed at time 17. But by this time three more requests have come in. The controller processes the third request, moving the head from track 110 to track 77. This takes 3.3 msec, so the entire operation takes 6.3 msec, and completes at time 23.3.

The following table shows how long it would take to complete these eight operations and the end time of each.

1 6.4 6.4

2 10.6 17.0

3 6.3 23.3

4 9.5 32.8

5 8.0 40.8

6 6.1 46.8

7 10.7 57.6

8 9.5 67.1

It would take a total of 67.1 msec. to complete these eight I/O operations. Can we do better?

Closest Track Next It might make sense to always do the closest job next. This would obviously result in shorter seek times. However, even though the seek times might, on average, be shorter, there are two problems. The first is a fairness issue; data on tracks in the middle tracks would be accessed far faster than data in the very low numbered or very high numbered tracks. It should be obvious that on a platter with 128 tracks, a sector on or near track 64 is more likely to be near a randomly selected track than is a sector on track 1 or track 127. It might be possible for an operating system to use this unfairness to its advantage by placing often used files on the middle tracks and rarely used files on the outermost and innermost tracks. However, when files are created, the system does not generally know whether or not the file will be heavily used, and so no operating system uses this method.

The second, more serious problem with the closest track next algorithm is the potential for starvation. Recall that the definition of starvation is that the time to honor a particular request is unbounded. A request for an operation on a high numbered track could be blocked indefinitely if requests for low numbered tracks continuously arrived.

Because of these two problems and because the performance of the next algorithm is almost as good and does not have the same problems with fairness or starvation, closest track next is not used.

The elevator algorithm

There are some similarities between a read/write head seeking tracks and requests for an elevator in a skyscraper. In both cases, requests come in to go to a particular floor or track on a more or less random basis and the elevator has to get its passengers to the requested floors as efficiently as possible. Elevators solve this problem by going in one direction as long as there are more requests in that direction, and then going in the other direction as long as there are more requests.

It turns out that this algorithm works well for disk seek operations. At any given instant, the read/write head is either moving in toward the center of the disk or out toward the outside. If it can satisfy a new request by moving in the same direction, it does so, but if it would have to switch directions and there are additional requests that it could satisfy by going in its current direction, it will not satisfy the new request until it turns around.

If the head is moving in the direction of higher track numbers and it is at track 40 and if there is a request for data on track 100, it would not attempt to satisfy a request for data on track 38 until it had satisfied all of the requests which were higher, even though it only has to move two tracks. When the read/write head gets to track 100, it would turn around and go to track 38 if there were no requests for tracks greater than 100.

The elevator analogy clarifies this. If you are on an elevator going down to the first floor, and the elevator is on the 20th floor, and a request comes in to go to the 22nd floor, you would be pretty mad if the elevator went up two flights without first going to the first floor.

Let's do a simulation of the elevator algorithm using the same data. The disk arm starts at position zero and the first request comes in for track 34, so this takes 3.4 msec of seek time and 3.0 msec of rotational delay for a total of 6.4 msec, the same as FIFO. However, now two more requests have arrived, one for track 110 and one for track 77. The read-write head is going up, and so it honors request 3 (track 77) before request 2. This takes 4.3 msec of seek time and 3 msec of rotational delay for a total of 7.3 msec. But by now two more requests have come in. Request 4 is for track 12 and request5 is for track 62. Although track 62 is closer, the elevator algorithm first goes to track 110. Here is the complete simulation.

Request Cumulative

Number track time time

1 34 6.4 6.4

3 77 7.3 13.7

2 110 6.3 20.0

7 108 3.2 23.2

5 62 7.6 30.8

8 43 4.9 35.7

6 31 4.2 39.9

4 12 4.9 44.8

Request			Cumulative
Number	track	time	time
1	34	6.4	6.4
3	77	7.3	13.7
2	110	6.3	20.0
7	108	3.2	23.2
5	62	7.6	30.8
8	43	4.9	35.7
6	31	4.2	39.9
4	12	4.9	44.8

Note that it took a total of only 44.8 msec to process the eight operations with elevator algorithm in contrast to 67 msec with the First In First Served algorithm.

There is no problem with starvation with the elevator algorithm because the time to get to a particular track is always bounded. If the head is at track 40, going up, and a request comes in for track 38, the upper bound would be the number of tracks greater than 40 as the head moved upward plus the number of tracks between the maximum track and 38 as it moved back down. This may be a large number, but it is bounded, so it satisfies our definition of no starvation.

There is still a problem with fairness, but it is much smaller. Some tracks will, on average, receive slower service than other tracks. If this is a matter for concern, a small variant of the elevator algorithm is called C-SCAN. In this algorithm, the arm always goes in the same direction, from lowest track request to highest track request for example. After processing the request with the highest track number, the arm moves back to the first track or to the track with the lowest numbered request. It then starts moving upward again, processing the requests in track order.

RAID

Performance (i.e. speed) of disk drives has not improved nearly as rapidly as the performance of memory or CPUs, but disk drives have become very inexpensive. This has lead to the use of multiple disk drives to improve performance. With multiple disk drives, separate operations can be done in parallel as long as the operations are on separate disks. Even a single operation can be speeded up if the data to be retrieved or written can be distributed properly across several disks.

The computer industry has agreed on a standard for transparently distributing a file system across multiple disks. It is known by its acronym RAID which stands for Redundant Array of Independent Disks. There are seven different RAID architectures. They all share some common characteristics.

A RAID installation has multiple disks but they are viewed by the operating system as a single logical drive
Data are distributed across the disks in an organized fashion.
Redundancy is build into the design so that if a single disk fails, no data is lost.

Here are the seven levels:

RAID Level 0

Level 0 is not technically a RAID scheme because it does not include redundancy to improve performance. Data are striped, that is, divided up into strips which are distributed in an orderly fashion across all of the disks. Strips can be blocks or sectors or some other unit. Consecutive strips are mapped to consecutive disks. If there are four disks, strip 1 of a file would be on disk 0, strip 2 of the file would be on disk 1, strip 3 of the file would be on disk 2, strip 4 of the file would be on disk 3, strip 5 of the file would be on disk 0, and so on.

A set of logically consecutive strips that maps exactly one member to each disk is referred to as a stripe. Thus, logical strips 1, 2, 3, and 4 of the file constitute a stripe. This is important, because when the file is read (or written), note that strips 1, 2, 3, and 4 can be read in parallel. This can dramatically improve reads and writes of large files to and from disk.

This performance improvement can only happen if the entire data pathway is completely parallel; that is, not only do the reads happen in parallel, but each disk has to have its own data line to the controller and from the controller to the system bus and eventually to memory.

Note also that this improvement in performance can only happen if the system makes large data requests. However, as we saw, modern I/O systems generally do make large requests, larger than the initial read. If the initial read is for a few bytes, all of which reside on strip 1, most systems now would read the entire file into memory buffers on the assumption that the process is likely to read other strips shortly.

Even without striping, a system with data distributed across multiple disks can show improved performance. Consider a transaction processing system in which there are a large number of requests for small records in a database. If the records are distributed randomly across the disks, several different records can be read at the same time, thus enhancing performance.

RAID level 1

All of the levels of RAID other than 0 incorporate redundancy to preserve data in the event of the loss of data on a disk. One of the most common types of hardware failure in a computer is the head crash of a disk. The read/write head moves very close to the platter, and the platter is spinning very rapidly (typically 3,600 or 7,200 rpm). This requires very fine tolerances. If the head is jarred ever so slightly or if a piece of dust gets between the head and the platter, the platter can get scratched, causing some or all of the data to be lost.

All of the levels of RAID other than 0 incorporate redundancy to preserve data in the event of the loss of data on a disk. Level 1 is the simplest, but is not used much. In level 1, data striping is used as in RAID 0, but each logical disk is mapped to two different physical disks so that every disk has a mirror disk with the same data. With our example, there would be eight disks in the array, and strip 1 would be on both disk 0 and disk 4.

This has two advantages. First, a read request can be made from either of the two disks, whichever involves the minimum seek time plus rotational latency. Second, if a disk crashes and all of its data is lost, recovery is trivial because all of the data is on the mirror disk.

Write operations have to be done twice, but this does not usually impose much of a performance penalty because the two writes can be done in parallel. However, the write time is the time of the slower of the two operations.

The principle disadvantage of RAID level 1 is the cost, since it requires twice as much disk space as the logical disk space.

RAID level 1 is often called mirroring for obvious reasons.

RAID level 3

I will discuss RAID level 3 before RAID level 2 because it is simpler to understand and more widely used.

Both RAID level 2 and level 3 use a parallel access technique in which all member disks participate in every I/O operation. Typically the spindles and the read/write heads are synchronized so that all of the heads are at the same place on each platter at all times. Strips consist of very small units, on the order of a single byte or word. Parity information is also stored to provide error detection and/or error correction. If there is a single bit error, the RAID controller can detect and correct the error immediately without other components of the system being aware of the error.

RAID level 3 has a single parity disk. In the event of a disk failure, the data can be reconstructed. Note that every read and write involves all of the data disks and the parity disk, and that the disks are so synchronized that we can have corresponding bits (or bytes or words) on all of the disks. Typically strips are very small, on the order of a single byte. Thus successive bytes of a file are stored on successive disks.

The parity value is calculated by performing an exclusive or operation on the data bits. One way to think about this is that the value on the parity disk is set to either zero or one in order to make the total number of one bits be an even number.

Here is an example. Suppose that a system has four disks for data and a fifth disk for parity. Strips are one byte wide. The table below shows the bits for four, 8-bit bytes and the value of the parity disk. The columns represent different disks, so there are five columns.

 0  1  0  1  0
 1  1  1  1  0
 0  1  1  1  1
 0  0  0  0  0
 1  0  0  0  1
 0  0  1  1  0
 0  1  0  0  1
 1  0  0  1  0

Thus, the first disk has the byte 01001001, the second disk has the byte 11100010, the third disk has the byte 01100100, and the fourth disk has the value 11100101. These would typically be four consecutive bytes of a file. Note that the value on the parity disk is the result of performing an exclusive or operation on the four bit values.

The parity disk allows for data to be reconstructed in the event of a disk failure. Suppose that the leftmost disk crashes and is completely unreadable. Because of the parity bits, we can reconstruct all of the data on this disk. If you look at the top row, we still have the three data bits 1, 0 and 1, and we know that the parity bit is zero, so we know that the missing value must be zero.

Notice that reads and writes can still be done in parallel. Because there are four disks, four bytes can be read at the same time, thus speeding up the time of reads by a factor of four. In fact, the parity disk is read as well, but is also done in parallel, so there is no time penalty.

There is one parity disk no matter how many data disks there are. If there are 16 data disks, meaning that 16 bytes can be read simultaneously, there is still a single parity disk and if any single disk fails, all of the data can be reconstructed.

RAID level 2

RAID 2 uses Hamming Code. This is a method of storing data redundantly so that errors can be not only detected, but often corrected on the fly. Hamming code is often used for memory error checking.

Here is a web site which explains Hamming code better than I possibly could (this may be on the test). Here is a cached version

Thus, with RAID level 2, if there are four data disks, there would need to be three additional parity disks for a total of seven. This is not widely used, because most disk drives now include error correcting code.

RAID level 4

The remaining levels of RAID use striping and parity, but the strips are relatively large. With RAID level 4, bit-by-bit parity is calculated across corresponding strips on each data drive, and the parity bits are stored on the corresponding strip on the parity drive.

This level incurs a performance penalty with small write operations because each write operation has to write not only the data, but also the corresponding parity bits, and this means that it has to read the old parity bit and recalculate it. Consider a system with four data disks and a parity disk. In order to write a single byte which happens to be on disk 1, the system first has to read the old data on disk 1 and the corresponding parity byte on disk 5. It then recalculates the value of the parity bits, and rewrites both the data and the parity byte. Thus one logical write operation involves two read operations and two write operations. This only makes sense on systems where most of the write operations involve writing large amounts of data. Even then, since every write has to involve the parity disk, this can be a bottleneck.

Here is a diagram for RAID level 4.

RAID level 5

RAID level 5 is similar to RAID level 4 except that the parity bits are distributed across all of the disks instead of being concentrated on one disk. This overcomes the problem of the bottleneck associated with the fact that every write operation involves the parity disk.

This is a very widely used system. It is very efficient for operations involving large files. For example, suppose that a process has to read a file involving 20 sectors. Rather than reading each sector sequentially, it can read the sectors four at a time in parallel (if there are four data disks), thus achieving a 400 percent speedup. When it later writes the file, it can write four sectors plus the parity sector at the same time, so again there is a 400 percent speed up compared to a single disk (minus the relatively small overhead of calculating the parity bits). Unlike levels 2 and 3, it does not involve synchronizing the drives, which is difficult to achieve.

It also has redundancy built in so that if any one disk fails completely, all of its contents can be reconstructed automatically with no loss of data. Of course if two disks fail simultaneously, all of the data is lost.

RAID level 6

RAID level 6 was introduced later. It is for systems which require very high levels of redundancy and availability. It requires the computation of two different parity calculations for each each stripe, and thus it requires two additional disks. A system which requires four data disks would have six disks in all. The advantage of RAID level 6 is that if two disks fail, no data is lost. The disadvantage is that there is a more substantial penalty incurred for small write operations since every write operation also requires calculating and writing two separate parity values.

Return to the course home page

1	6.4	6.4
2	10.6	17.0
3	6.3	23.3
4	9.5	32.8
5	8.0	40.8
6	6.1	46.8
7	10.7	57.6
8	9.5	67.1