CSCI.4210 Operating Systems Fall, 2008, Class 18

CSCI.4210 Operating Systems Fall, 2008 Class 18
Server Design, Networked File Systems

Miscellaneous socket stuff

Sockets on Windows

Socket programming on windows uses exactly the same system calls as Unix, with just a few annoying quirks (but you could have guessed that!). The socket system is called winsock

In order to use sockets on windows, the program have to initiate the Winsock dynamic link library WS2_32.DLL. This is done with a call to WSAStartup. This takes two arguments, WORD wVersionRequested and a pointer to a WSADATA structure. Rather than try to explain this, I'll just give you the code. This has to appear in any program which uses winsock prior to any calls to the socket APIs.

#include <winsock.h>
int retval;
WORD version;
WSADATA stWSAData;
...
version = MAKEWORD(2,2);
retval = WSAStartup(version, &stWSAData);
if (retval != 0) error(...)

Here is a link to the online help for WSAStartup.

Windows has no concept of a file descriptor, so the socket system call returns a value of type SOCKET. The accept call also returns a type SOCKET.

Finally, you need to link to the winsock library during the compile. Here is how you do this in Visual Studio 2005.

Here is the server code for a stream socket using winsock.

The select system call

The basic socket program that we saw in the previous class has some limitations. If the receiver executes a read or recv, and the sender is not sending any data, the receiver will be blocked forever. It is easy to imagine a deadlock situation in which each end of a connection is waiting for data from the other.

Also, there are times when a server has several sockets open and would like to do other things during those period when there is no data on any of the lines, but would like to be signaled when data is available on any of the lines.

The select system call solves both of these problems. This function is somewhat awkward to use, but it solves the two problems mentioned above. It allows the process to sleep, but be awakened when any one of multiple I/O events occurs. This is done by passing in as an argument a set of file descriptors on which the server can listen. It also allows the process to set a timer and to be awakened if none of the events occur during that time period.

Here is a link to the FreeBSD man page for select.

Using the Alarm to break out of communication deadlock

Another way to break out of the communication deadlock situation described above is to use the alarm system call. Here is the prototype

int alarm(int timer)

This causes a SIGALRM signal to be sent to the calling process after timer seconds. The default action for SIGALRM is to terminate the program, but this can be overwritten using the signal() function that we discussed several weeks ago.

When you are reading from a socket or pipe, and there is some danger that communication deadlock will occur, you can set the signal handler and call alarm before the read or recv call. If, after that much time, nothing has happened, the program will jump into the signal handler function, and your program can take appropriate action.

Design of Servers

Possible Server Designs

iterative server; each client request is completely processed before moving on to the next client.
concurrent server with fork, one process per client
concurrent server with threads. one thread per client
concurrent server using select
Preforking server - create a pool of child processes at startup, one process from the pool handles each client
Prethreading - create a pool of threads at startup, one thread from the pool handles each client

When would you want to use each?

Iterative server best if the response from the server is quick because there is minimal overhead.

Concurrent server with fork best if there will be few connections but each connection will do extensive reading and writing over an extended period. You have to deal with zombies.

Concurrent server with threads generally better than a concurrent server with fork, because the overhead of creating a new thread is much less than that for creating a new process.

server with select This design is best for a server which wants to listen on many sockets simultaneously but where connections are relatively rare. We will see a good example of this with the unix internet daemon inetd describe below.

Preforking server This is a new concept, but it is worth studying because it is probably the best design for servers which have to handle many requests and response time is important. For example, file servers or web servers usually use a preforking model. These receive many requests and have to respond very quickly. The overhead associated with creating a new process or even a new thread for each request would be prohibitive if the server gets heavy use.

To do this, the server calls socket, bind and listen exactly as we have seen, and then calls fork several times to create a number of identical processes. Each of these processes then enters its infinite loop and each calls accept.

Recall that when a call to fork creates a new child process, all of the file descriptor information is duplicated. This means that the child process is listening on the same port as the parent.

Each process goes to sleep. What happens when a connection occurs is somewhat system dependent; here is how it works on Berkeley Unix. When a connection arrives, all N processes are awakened. This is because all have been put to sleep on the same wait channel. Exactly one of these will accept the connection (accept will return). The others will go back to sleep.

The code should be written such that each connection is handled concurrently. The process that accepted the connection will read the request and supply the response. Meanwhile, if other connections arrive, another process will accept and handle it. When a particular process completes a request, it closes the socket on which it received and sent the data, and goes back to the accept statement again.

One issue is how many processes to create. If there are too few processes for the number of connections, clients may still be forced to wait if all of the processes are busy handling other connections. However, there is some minimal overhead associated with waking up many processes, and so it is inefficient to create too many processes.

This may not work on other Unix implementations. One solution is to put a lock or some other mutual exclusion primitive around the accept statement so that only one process will be able to accept at any given instance.

The apache web server uses preforking with an additional twist. It can change the number of processes based on load. It periodically checks to see how many processes are busy. If most are busy, it creates more processes; if most are idle, it can kill some of the processes. The system administrator can set a minimum and maximum number of child processes.

Prethreaded server This works in much the same way as a preforking server, and has more or less the same advantages and disadvantages. One potential problem with this is that if a fatal exception occurs, such as a segmentation fault, it will kill the entire process including all the threads, while if it happens in a preforked server, it will kill that process, but the other processes can continue. Of course if code is well written, this should never happen.

Daemon Processes and the inetd superserver

A daemon is a process that runs in background and is not associated with a controlling terminal. Typical Unix systems have 20 to 50 daemons running in background doing various administrative tasks. The windows equivalent is a service.

Most daemons are started at system initialization. There is a system initialization script that starts them. They generally have superuser privileges

One of these is the cron daemon, which keeps a table of events in a file such as /etc/crontab. It wakes up once a minute and sees if anything needs to be run.

If a daemon has to output a message, it can't do it directly because it has closed stdin, stdout and stderr. Therefore, messages that would normally be written to standard output or standard error are written to the system log. There is a syslogd daemon which daemons can use for this.

Here is some skeleton code for creating a daemon (modified from Unix Network Programming: The Sockets Networking API Vol 1, third edition,by W. R. Stevens, B Fenner, and A. M Rudoff, Addison Wesley, 2004)


int daemon_init(const char *pname, int facility)
{
   int i;
   pid_t pid;

   pid = fork();

   if (pid < 0) error("forking");

   if (pid > 0) exit(0);  // parent process terminates 

   if (setsid() < 0) error("setsid");  // sets a new session id
          //so that shell cannot send a kill signal

   signal(SIGHUP, SIG_IGN);  // ignore the hangup signal

   pid = fork();
   
   if (pid > 0) exit(0);

   //this guarantees that the child is not a session leader
   //and so it cannot obtain a controlling terminal

   chdir("/"); // change to the root directory

   for(i=0;i < MAXFD;i++) close(i);

   open("/dev/null",O_RDONLY);
   open("/dev/null",O_RDWR);
   open("/dev/null",O_RDWR);

   // This guarantees that anything written to stdout or stderr will
   // not cause a seg fault.

   openlog(pname, gLOG_PID, facility);

The inetd Daemon

On a typical Unix system, there could be many servers in existence, waiting for a request. Before BSD4.3 each had a process associated with it. Each daemon took a slot in the process table, but was asleep most of the time.

Examples include ftp, telnet, rlogin, finger

These all do pretty much the same thing

The solution is inetd, the internet superserver

inetd starts, makes itself a daemon, reads /etc/inetd.conf and creates a socket for all services specified in the file.

Each socket is bound appropriately. Port is determined by calling getservbyname with the service-name and the protocol fields.

It listens on each socket

It calls select

Whenever a connection is received on any of the listening sockets, it wakes up, forks off a child and execs the appropriate process to handle the connection.

IPv6

IP version 4 has been in use since 1983, and it has worked well, but there are plans to upgrade it. The next version is IP version 6, (IPv6 or IPng for next generation). The major reason why this upgrade is occurring is that we are running out of IPv4 addresses, but there are other motivations as well. IPv6 will provide better mechanisms for authentication and other security issues, and it will make routing easier.

The basic concept of packet switching is unchanged. The IP is still connectionless and unreliable. Here are some of the changes

The most obvious is that the size of an IP address is expanded from the current 32 bits to 128 bits. This is obviously overkill if the goal is simply to get more IP addresses. Every man, woman, and child on the planet can be assigned their own block of 4 billion IP addresses; allowing them to assign an IP address to their refrigerator, their toaster and other household appliances with plenty left over for other purposes.
There is a more flexible header. Every IPv6 packet has a 40 byte fixed size header. This is followed by a linked list of additional headers for Route Information, Security, etc.

There has been extensive discussion of how to implement IPv6. Clearly there cannot be a simple switch-over day like there was when TCP/IP was introduced in 1983. There are two possibilities: The first is that most routers would implement a dual stack for a while; that is, they would be able to process both IPv4 and IPv6 packets. The second possibility is tunneling. If an IPv6 packet arrives at a router which only uses IPv4 (or the reverse), the router can treat the entire packet, including the IPv6 header(s) as the payload and construct one or more IPv4 packets, which it forwards. When these packets get to the end of the IPv4 network (often just a single hop), the IPv4 header is stripped off and the IPv6 packet is sent on its way.

Remote Procedure Calls (RPCs)

The idea behind a remote procedure call is that it looks like a regular function call to the calling program, but it is executed on a different machine.

There are several reasons why a programmer might want to use RPCs

The procedure might use a specialized architecture.
The procedure might access a distributed database.
The procedure might access data on a different computer (distributed file system)
In some cases performance can be improved by using many CPUs

There are some obvious disadvantages to using RPCs.

There is a performance penalty because the procedure call has to be sent over the network.
Pointers are not permitted because pointers always refer to an address on the local machine, not the remote machine
Local resources, such as file descriptors, cannot be accessed in a remote procedure call.

Sun RPC is the standard. It includes:

A message format
A standard method of representing data (XDR) (external data representation)
A compiler system to generate code

There are RPC servers and clients. The server is the machine on which the remote procedure runs, and the RPC client is the process, generally on a different machine, which makes the RPC call. Like a thread, a remote procedure call can only pass one argument, and it can only return one argument, so, as with threads, if you want to pass multiple arguments to an RPC, you have to create a struct and pass the struct as an argument.

Each remote program is identified by a unique 32 bit number, and each procedure within the program is identified by a unique 32 bit number. It also supported version control, so you can assign a version as well. One host can run multiple versions of code simultaneously. So a specific remote procedure can be identified by a triple (prog, vers, proc).

The RPC mechanism on the remote machine enforces mutual exclusion to make sure that only one instance of each procedure is running at a time. This is particularly important for RPCs that update databases, because it would be easy to corrupt a database if several procedures were permitted to update the same records simultaneously.

RPCs can use either UDP or TCP. UDP is substantially faster, but it can present problems because it does not assure reliability. This can present problems. Suppose a procedure is called and no response is received, and so it is called again. This means that the same procedure may be called twice. The sun RPC libraries have a simple timeout retransmission strategy, but is not reliable in the strict sense. In practice, most RPCs are done on local area networks, and these tend to be highly reliable, and so this is not a serious issue.

A particular remote program is not at a well-known port. Rather, the client first connects with a port mapper, which is at a well known port (111) on the server, and this tells it which port to use for that program. Each remote program has to register itself with the port mapper, and the port mapper assigns a port number to it. The Sun RPC mechanism hides this users. Users do not need to worry about the Port Mapper.

The Sun RPC system provides broad support for writing RPCs, but I will not go into detail. Here is a high level overview of the services that it provides.

XDR library routines convert individual data items from internal form to XDR standard form (similar to ASN.1).
XDR library routines format complex data aggregates such as structs to a standard XDR form
RPC run time library routines allow a program to call a remote procedure, register a service with the port mapper, or dispatch an incoming call to the correct remote procedure inside a remote program.
a program generator tool to produce the c source files

Authentication

There are several levels of authentication. The purpose of these is to prevent unauthorized users from accessing remote procedure calls.

The default is none, which means that anyone can access a remote procedure if they know its address. There is also Unix authentication, which checks for Unix style permissions, but as we shall see in next week, this can be easily subverted. There are higher levels of authentication as well.

Java has a similar mechanism called Remote Method Invocation (Java RMI) in which a method in a Java virtual machine can be invoked by processes in different virtual machines, possibly on different machines.

Distributed File Systems

The file systems of many modern computer systems are distributed; this means that the files themselves are on file servers, powerful computers with huge disk farms attached. This allows users to sit down at any computer on the network and get access to their files, with the illusion that the files reside on their local machine.

Most distributed file systems use RPCs as described in the previous section. It should be noted that the structure of the file system is independent of whether it is remote or local; a typical user cannot easily tell whether the file system is distributed or local, and usually doesn't care (until the file server crashes). All of the file system calls are identical, but the implementation of the file system calls in the kernel has to use RPCs for a distributed file system. On Unix systems, this is done through the vnode. In the discussion of the Unix file system, I discussed the inode in some detail, and mentioned the vnode, which is a more general implementation of the inode. On a system with a local file system, the vnode is just a pointer to the inode. On a system with a distributed file system, the vnode contains the information about the location of the file as well as a pointer to the inode.

There are two widely used distributed file system protocols, both of which are used on our campus. The Network File System (NFS) developed by Sun Microsystems, was the first widely used distributed file system, and is used by the Computer Science Department system. AFS (Andrew File System) is a more robust distributed file system and is used by RCS.

NFS

NFS was developed by Sun Microsystems, originally for diskless clients, but now a standard. It is a protocol, not a product. This means that anyone can implement it if they wish. Many computer manufacturers use code licensed from Sun, but people are free to write their own implementation of NFS.

A key component of NFS is the concept of a mount. The mount protocol allows the file server to hand out remote access privileges to clients. The file server runs a mount daemon mountd. When a new client is booted, it calls the mount system call, which attaches a specific directory tree to a mount point, which is a node on the client's local file directory tree.

An example will clarify this. Suppose the client has a directory that looks like this:

and the server has a directory that looks like this:

When the client is booted, it makes a call to mount, requesting the server to mount file system D to /C. After this is done, the file system on the client would look like this to a user.

Note that some of the files are local and others are remote. but this is transparent to the user; it looks like a single file system. The user does not need to know where the file actually resides.

One potential source of confusion with NFS is that different clients can mount the same remote file system in different places. With our example, another client could choose to mount D onto B. This means that the same files are available to users on different machines, but in different places.

NFS servers are dumb and NFS clients are smart. It is the clients that do the work required to convert the generalized file access that servers provide into a file access method that is useful to applications and users.

The server is stateless. A stateless protocol means that each call is independent of every other call.A server should not need to maintain any protocol state information about any of its clients in order to function correctly. Stateless servers have a distinct advantage over stateful servers in the event of a failure. With stateless servers, a client need only retry a request until the server responds; it does not even need to know that the server has crashed, or the network temporarily went down. The client of a stateful server, on the other hand, needs to either detect a server failure and rebuild the server's state when it comes back up, or cause client operations to fail.

When a user calls open, the call has to figure out if the file is local or remote, and if it is remote, the NFS client on that machine has to contact the appropriate server.

Since NFS has to accommodate heterogeneous file systems. (i.e. DOS) the client is the only one that interprets full path names. This may mean multiple NFS queries to resolve a single request For example, the file /a/b/c/d might take four request to resolve. But this means that the server doesn't need to know anything about the client's naming system or directory structure.

Once a client has identified and opened a file, the server gives the client a handle for subsequent reads and writes. This is an opaque data structure that the client uses for future reads and writes to that file.

Note that since the server is stateless, the client has to store file offset info. This means that a call to lseek is completely local.

Performance is an extremely important concern. For this reason, communication between clients and servers uses UDP. Also, when the server is started, it forks off a number of processes at creation so that it does not need to call fork create a thread for each request.

NFS has a function calls to do almost anything that the user might want to do with a file. Here are some examples.

Get file attributes (similar to stat)
Set file attributes
Read a number of bytes from the file
Write bytes to the file.
read a symbolic link
create a file
remove a file
rename a file
create a link to a file
Create a directory
remove a directory

AFS

AFS is a distributed filesystem that enables co-operating hosts (clients and servers) to efficiently share filesystem resources across both local area and wide area networks. It is far more robust than NFS.

AFS was developed at Carnegie-Mellon University. This was called the Andrew File System, named after both Andrew Carnegie and Andrew Mellon. There are several implementations of AFS. A propritary version from Transarc Corporation (now owned by IBM) used to be the most widely used, but is losing favor. There is an open source version called openAFS.

Recall that with NFS, different clients could mount the same file system in different places. AFS has gone to the opposite extreme; there is one AFS file system for the planet. If you are on an AFS system such as RCS, the root is /afs. This provides access to every (or at least most) systems running AFS. (Hint: Don't type ls -l /afs because it will need to contact each of the sites in the world to get the information, and this takes a while. You might want to go to lunch while you wait for this. But you should try this command.
ls /afs)

AFS files are grouped together in cells. An AFS cell is a collection of servers grouped together administratively and presenting a single, cohesive filesystem. Typically, an AFS cell is a set of hosts that use the same Internet domain name. For example, all of the files in the rpi.edu domain constitute a cell. AFS cells can range from the small (1 server/client) to the massive (with tens of servers and thousands of clients).

AFS client machines run a very efficient Cache Manager process. The Cache Manager maintains information about the identities of the users logged into the machine, finds and requests data on their behalf, and keeps chunks of retrieved files on local disk.

The effect of this is that as soon as a remote file is accessed a chunk of that file (often the whole file) gets copied to local disk and so subsequent accesses are almost as fast as to local disk and considerably faster than a read across the network. Local caching also significantly reduces the amount of network traffic,

Unlike NFS, which makes use of /etc/filesystems (on a client) to map (mount) between a local directory name and a remote filesystem, AFS does its mapping (filename to location) at the server. This has the tremendous advantage of making the served file space location independent.

Location independence means that a user does not need to know which file-server holds the file, the user only needs to know the pathname of a file.

To understand why such location independence is useful, consider having 20 clients and two servers. Let's say you had to move a filesystem "/home" from server a to server b.

Using NFS, you would have to change the /etc/filesystems file on 20 clients and take "/home" off-line while you moved it between servers.

With AFS, you simply move the AFS volume(s) which constitute "/home" between the servers. You do this "on-line" while users are actively using files in "/home" with no disruption to their work.

With location independence comes scalability. An architectural goal of the AFS designers was client/server ratios of 200:1 which has been successfully exceeded at some sites.

AFS files are stored in structures called Volumes. These volumes reside on the disks of the AFS file server machines. Volumes containing frequently accessed data can be read-only replicated on several servers. For example, if there are many users using the C compiler gcc, there can be several instances of it on different servers. Note that a given user does not know anything about this. He or she just types gcc, and the AFS server finds an instance.

AFS (and thus RCS) do not use the standard Unix permission system, and this is a source of confusion, because the Unix file permission bit are settable and visible, but ignored. On a typical Unix system, permissions are done on a file specific basis, but on AFS, permissions are on a directory basis.

The AFS permission system allows the owner of a directory to set four types of permission for that directory, lookup, insert, delete, and administer. Each file has three types of permissions, read, write and lock. Unlike normal Unix, these can be set for specific users, you can give Suzy read privileges for files in a directory for example. You can even give read privileges for everyone except Suzy.

AFS is far more secure than NFS. It uses the Kerberos authentication system, which will be discussed in detail in a later lesson.

Samba

Samba is a freeware system that allows Unix and Microsoft to share files. Samba is a suite of Unix applications that speak the SMB (Server Message Block) protocol, which is the protocol used by Windows to perform client-server networking. By supporting this protocol, Samba allows Unix servers to communicate with the same networking protocol as Microsoft Windows products. Thus, a Samba-enabled Unix machine can masquerade as a server on a Microsoft network and offer the following services:

Share one or more file systems
Share printers installed on both the server and its clients
Assist clients with Network Neighborhood browsing
Authenticate clients logging onto a Windows domain
Provide or assist with WINS (Windows Internet Naming Services) resolution

The Samba suite revolves around a pair of Unix daemons that provide shared resources -- or shares -- to SMB clients on the network. (Shares are sometimes called services as well.) These daemons are:

smbd A daemon that allows file and printer sharing on an SMB network and provides authentication and authorization for SMB clients.
nmbd A daemon that looks after the Windows Internet Name Service (WINS), and assists with browsing.

Here is a link to more information about Samba.. In fact, most of the Samba material in this class was taken from this web site. If all you want to do is pass the test you can skip this.

Return to the course home page