CSCI.4210 Operating Systems Fall, 2008, Class 12

CSCI.4210 Operating Systems Fall, 2008 Class 12
More on File Systems

Implementing Directories

A directory is just a file with an array of pairs, file names and pointers to inodes.

Unix supports shared files, i.e. one file listed in two or more directories. Also, it is possible for a file to have two (or more) names in the same directory, as shown in the above diagram.

The shell command
link existingfile newfile
will do this.

There is a also a system call
#include <unistd.h> int link(char *existingfile, char *newfile);
so that you can do the same thing from inside a program.

You can create a symbolic link with the system call
int symlink(const char *name1, const char *name2);
One important difference between hard links and symbolic links is that hard links must be in the same file system (on the same server) while symbolic links can cross file servers (more on this shortly).

Windows supports the same concept. The windows term for a symbolic link is a shortcut. These have a .lnk suffix.

The fact that one file (inode) can have multiple pointers to it from different directories complicates the problem of deleting files. Note that one of the fields of the Inode is the number of links. Whenever a new link is created, this is incremented, and whenever a link is deleted, this is decremented. If this value gets to zero, then the system can overwrite the contents of the file and delete the Inode.

The name of a file is an entry in a directory, and a file can have more than one name. The Inode does not have a field for the file name.

Also note that a dump (i.e. a backup) may copy the file two ore more times, once for each link to it; and a search (the Unix search command is grep (don't ask)) may search the same file two or more times.

The connection between the name of a file in a directory and the pointer to its Inode is called a hard link. Unix also allows the user to create Symbolic Links (these are called shortcuts in Windows). A symbolic link differs in a subtle way from a hard link in that the symbolic link contains the address (pathname) of a hard link.

When I do an ls -l / on Solaris, asking for a long listing of the root directory, one of the lines looks like this.
bin -> /usr/bin
This means that the entry /bin is really a symbolic link to the directory /usr/bin.

Here are the steps involved in creating a new file in Unix

create a new inode
find a free block on disk for the inode
find a free block on disk for the first block of data
find the directory on the disk and read it into memory
add an entry to the directory
write the directory
write the inode
write the data

Note that to traverse a pathname such as
/usr/include/sys/types.h the system first has to read the inode for the root directory, then find the data for the directory and read it. This will provide a pointer to the inode for the directory usr, so this inode has to be loaded into memory and read so that the contents of the actual directory can be read. This will provide a pointer to the inode for include, and so on. Thus, a total of ten reads have to take place in order to find the location of the file and read it into memory.

Keeping track of free space on the disk

The file system has to keep track of free blocks of memory. There are two widely used systems for this. Some systems maintain a linked list of free blocks, while others keep track of free blocks with a bitmap. With a linked list, when memory becomes freed because a file is deleted, a pointer to that block of memory is simply added to the list, and when a new block of memory is needed because a new file is created or because an existing file grows, the system just has to find any available free block and remove it from the list.

If the system uses bitmaps, a bitmap is maintained in memory, with one bit corresponding to each block. When a block becomes freed, its corresponding bit is set to 0, and when a new block is needed, its bit is set to 1.

Improving File System Performance

Processing speed has improved much faster than disk access speed, which means that the bottleneck in many processes is disk access. Anything that might reduce the number of disk accesses can dramatically improve performance. There are a number of methods that file systems use to accomplish this.

block caching storing a collection of blocks that logically belong in the file system, but are kept in memory The file system uses a hashing system to determine if a block is in the cache.
Note that as with any caching system, you need a page replacement algorithm.
If an inode is read into the cache, modified, but not written back, an inconsistent file system can result. As a result, inodes (and other blocks which maintain file system consistency) should always be written to disk immediately.
Unix has a command sync which writes all modified blocks in the cache to disk. This is run every 30 seconds or so.
Block Read ahead When a request to read block n is read, also read block n+1, n+2, .... Modern file systems with large memories can carry this to the extreme of reading a whole file into memory when it is first opened.
Reducing Disk Arm Motion - If an Inode is stored near the actual data, disk arm movement can be reduced because when the file is opened, the Inode is read into memory, and if the system is using block read ahead, much of the file is then read into memory, often before the first read operation.

Block Size

The block size is an important design decision for file systems. Obvious candidates are sector size, or page size. A larger block size results in fewer seeks and rotational delays, but more internal fragmentation. The median file size in UNIX is about 1K, and so if the block size is 4K, the majority of files can fit in a single block. However, if a file is, say, 500 bytes, there is 3.5 K of wasted disk space.

Disk is now inexpensive, and disks are very large, so the wasted space of fragmentation is less of a problem than it used to be.

Log Structured File Systems

If a System has a large cache, most reads and writes are to cache rather than to disk. This means that the vast majority of disk accesses are writes, not reads. Writes are often done in small chunks. To create a new file, a directory entry has to be written, an inode has to be written, and the actual data has to be written. (For example, if the user enters the shell command
ls -l > temp
the system has to create a new file). These cannot be buffered.

One solution to this problem is a Log Structured File System in which the entire disk is a log. All pending writes are collected into a single segment and written as a single segment at the end of the log. This is done as one continuous write, and so it is very efficient because there is little or no seek time. At the start of each segment is a segment summary listing the contents of the segment. Finding the contents of a file from the inode is no harder, but finding the inodes is much harder (because Inodes are simply a part of the log, and therefore distributed more or less randomly on the disk) so the system maintains an inode map.

The log is the only structure on the disk. This makes writes very much more efficient compared to a traditional file system. With a traditional file system write, the system has to read the Inode to determine the sector number (although the Inode is presumably cached, so this may not be a costly operation), then it has to perform a seek operation on the disk to move the read/write head over the appropriate cylinder. This is a very costly operation.

With a log structured file system, all of the writing operations are done consecutively. At first all of the data to be written is kept in a buffer, but this buffer is written periodically. Since it is written consecutively, there is no seek time and so writes can be done much more quickly. The initial paper describing the Log Structured File System found that it could use 70% of the bandwidth for writing, as compared to only 5% to 10% of the bandwidth for a traditional file system.

For example, to write a new single block file, a traditional file system requires four non-sequential write operations. The inode for the directory has to be written, the directory contents have to be written, the inode for the new file has to be written, and the actual contents of the file have to be written. Each of these operations will require a separate disk seek operation. In contrast, the LSFS will still perform the same writes, but it will be done in one long write of all of the information.

If the disk were infinitely large, this would be the end of the story, but over time, many blocks in segments are defunct, so LFS have a cleaner program that runs as a thread of the kernel and cleans things up. All it has to do is to ignore defunct blocks and put active blocks back in the cache for rewriting. This ensures that there are always large chunks of free space available on the disk for writing the log.

Journaling File Systems

It is important to maintain file system consistency. It would not do to have blocks which were neither assigned to a file nor on the free list, or i-node pointers which pointed to the wrong blocks. This could happen if the system crashed during the middle of a file system operation.

Here are the steps involved in deleting a file

Remove the file from the directory
Release the inode to the pool of free inodes
return all the blocks to the pool of free blocks

The order that these are executed does not matter as long as the system does not crash, but if the system crashed between steps 1 and 2, the inode and disk blocks would remain, but there would be no way that a user could access them.

A Journaling File System, such as ReiserFS or ext3, both of which are used on various flavors of Linux, keeps track of this. All of the steps for an operation are written to a journal before any of them are executed, and this journal is written to disk. Whenever the system is rebooted, the system checks the journal and exeutes the steps in the journal.

Virtual File Systems

It is common now for one operating system to have multiple file systems. Microsoft systems do this by designating different file systems with different letters A: B: C: These do not all need to be running the same type of file system. When you load a CD, it is running a different file system than the file system on your hard drive, and if you are running Samba, (which maps a Unix file system to your PC), this is running yet another type of file system.

A Unix systems can also be running different file systems, but Unix likes to unite them all in a single directory tree with a single root. Many unix systems have a networked file system in which files are on a separate system, although they appear as all one local system to the user.

The solution to this problem is a Virtual File System (VFS). In a VFS, all of the system calls (read, open, mkdir etc) are identical across all file systems. These are sent to the VFS layer in the kernel. This has a separate interface to each different file system.

Each file system has a VFS interface which consists of several dozen function calls. Generally each file system has a vector of all of these functions so the VFS layer simply has to index into this vector.

In the unix world, the process of establishing an interface to a new file system is called a mount

File System Backup

There are two ways to back up an entire file system, logical and physical. A logical backup starts at the root and makes a copy of each file, directory, etc. on the backup medium (often still tape).

A physical backup simply copies the contents of an entire disk to the tape.

The advantage of the logical backup is that only sectors contains valid data are copied and so the backup is smaller. The disadvantage is that it involves a lot of jumping around on the disk. The advantage of the physical backup is that it can be done very efficiently, there is essentially no jumping around. However, empty or invalid sectors are copied.

There is also the issue of what to back up. Once a complete backup has been made, it is usually only necessary to copy those files that have changed since the last backup (an incremental dump). This has the disadvantage that restoring a particular file is more difficult.

If the file system is active, backing up is more difficult, and can result in inconsistencies.

Logical dumps are more common.

Maintaining file system consistency

Operating systems generally have a utility to detect file system inconsistencies. On unix, this is call fsck, and on Windows, it is called scandisk

These systems build two tables. In both tables there is a counter for each block. The first keeps track of how often each block is used in a file, the second table keeps track of how often each block is on the free list.

The program then goes through a list of all the inodes, incrementing the blocks in each file.

Once these two tables have been completed, the utility checks for inconsistencies. In a completely consistent file system, each block will be a file exactly once, or it will be on the free list exactly once.

Here is a list of possible inconsistencies

A block is neither used or on the free list
A block is in two different files (i.e. pointed to be two different inodes)
A block is in both the free list and in a file
A block is on the free list twice

It is also necessary to traverse each directory to make sure that all files have an entry in at least one directory and to make sure that each entry in a directory is a valid file. This also checks to make sure that the number of links to each file as specified in the inode is correct.

Example File Systems

CD-ROM File System (ISO-9660
These are very simple because files are not allowed to expand. This means that space can be allocated contiguously. A CD does not have concentric cylinders, the sectors are in a concentric circle, so writing is very easy. Each sector (logical block) contains 2352 bytes, but each includes some metadata (preamble, error correction, etc) so the actual data is 2048 bytes per block.

There is also space at the front of the CD for information such as (creation date, volume id. This is followed by the root directory

Unix
I mentioned the inode structure Many implementations of Unix use NFS, the network file system, or a similar system in which the files are distributed over several different computers (file servers). There will be a detailed discussion of this later in the course.

Windows File system implementation

FAT

Modern Microsoft operating systems support several different file systems. The simplest of these is the FAT file system. Recall that FAT stands for File Allocation Table, and the idea behind this is that the system maintains a table for each file which contains the disk addresses of each block.

FAT comes in several flavors, including FAT12, FAT16 and FAT32. A pure file allocation table would have an entry in the table for each sector (512 bytes), but for performance reasons, the disk is broken up into larger units called clusters or allocation units, which contain a number of contiguous sectors, usually in the range of 4 to 64 (2,048 to 32,767 bytes). Each file is allocated a number of clusters; one cluster cannot contain data for more than one file.

A portion of each disk is allocated for the FAT, but this is usually kept in memory for faster access.

FAT12 was a component of early DOS computers. The FAT12 file system used a 12 bit field to hold the cluster number. This meant that a volume using FAT12 could only have 4,086 clusters (12K minus a few for housekeeping). This is hardly ever used today.

FAT16 is still used on many older systems; you might have been able to guess that it uses a 16 bit field for cluster numbers. A FAT16 file system could have 65,526 clusters.

FAT32 is used on the more modern Windows operating systems. It uses a 28 bit field to store cluster numbers (that's not a typo, apparently they thought that the name FAT32 was more aethetically pleasing than FAT28). Since the maximum cluster size that it can support is 32KB, the FAT32 file system can support disks of up to 2^41 bytes, which is more than enough.

There is one FAT for the file system. Directory entries contain that FAT entry of the first cluster of the file. This entry contains the cluster number of the first cluster and the FAT entry for the second cluster of the file. The FAT entry for the second cluster contains the FAT entry for the third cluster, and so on. This allows the operating system to obtain the cluster numbers of all of the clusters for a file without any time consuming disk reads (assuming that the FAT is in memory).

Note that deleting a file on a FAT file system (or on most other file systems) does not mean that the contents are erased. It simply means that the entry in the FAT is deleted. In fact, often the file system does not even delete entries in the FAT table; it simply marks them as deleted. This makes it possible to recover a deleted file (often third party software will have such a utility). Of course over time, the clusters get overwritten and so recovery is no longer possible.

The clusters of a file can be distributed anywhere on the disk. As we saw last week, it is more efficient if the clusters of a file are stored contiguously. The defrag utility goes over the entire disk and stores logically contiguous clusters of each file on physically contiguous clusters. This permits much faster file access.

NTFS

The New Technology File System (NTFS) was introduced as a part of the Windows NT operating system. It incorporated a number of improvements over the various FAT systems, particularly FAT16, which was the standard at the time that NTFS was first developed. These improvements included:

Improved reliability, the ability to recover from crashes and other problems with minimal data loss.
Security and access control
Long File names (FAT16 only allowed file names of up to eight chars with a three char suffix.)
Improved networking. Modern file systems are distributed, and NTFS makes it easier to support this.
Uses Unicode, not ascii

Files are not just a string of bytes, there are a number of fields such as name. But one file can have multiple streams.

Each volume (partition) has a Master File Table (MFT) - this has metadata and a record for each file. (1KB per record). If a file is short enough, the data can be kept here too. Big files may have more than one record in the MFT.

Storage Allocation is based on runs. A run is a series of consecutive blocks. One file can have many runs. But note that unlike an inode, where there was a pointer to every block, the number of pointers may be much less. All you need is a pointer to the starting point of each run, and its size.

NTFS also supports file compression, journaling, and file encryption.

Return to the course home page