CSCI.4220 Network Programming
Fall, 2006
The Internet Transport Layer, TCP and UDP

There are two widely used transport layer protocols, TCP and UDP

User Datagram Protocol (UDP)

UDP transmits segments consisting of an 8 byte header followed by the payload. Here is the contents of the header. All of the fields should be self-explanatory, and I can't think of anything else to say about it.

Source Port2 bytes
Destination Port2 bytes
UDP Length2 bytes
UDP checksum2 bytes

Transmission Control Protocol (TCP)

TCP is connection oriented and reliable. Before any data is transferred the client and server have established a connection; each knows that it can send and receive data from the other and has allocated appropriate buffer space and other system resources. How can two entities communicate reliably over an unreliable network?

The transport layer receives a message from the application layer. If it is a long message, TCP breaks it down into segments to pass on to the network layer, which may in turn break it down into smaller packets. The other end reassembles the segments into a continuous byte stream and passes this on up to the application.

Services provided by TCP

The connection between the application layer and the transport layer is the socket. The system calls that send and receive data are write and read (or send and recv). Note that message boundaries are not preserved end-to-end. There is no way that the receiver can determine the chunk size that the sender used to send the data. The data is seen as a continuous stream of bytes.

The TCP Header

TCP appends a 20 byte header to the front of its payload. This header has the following fields

Establishing a connection

before any data is transmitted, the client and server establish a connection with a three way handshake, so called because three packets are transmitted.

While establishing a connection, the sender and receiver may agree on the Maximum Segment Size (MSS). This is done in the option field. A small system such as a PDA or a cell phone may have very limited buffer space, and so would advertise a small MSS to the other end.

A well-known denial of service attack is called a SYN attack, because the malicious sender sends numerous SYN segments to a server. For each one, the server has to allocate buffers and otherwise set up a new connection. The client then never follows up, but the result is that legitimate users cannot get through to the server.

Connection Termination

Terminating a connection uses a four way handshake. Each end of the connection sends a segment with the FIN flag set, and the other end acknowledges this. Note that it is possible for one end of a connection to send a FIN, meaning "I am not going to send any more data to you", but the other end continues to send data.

Ensuring Reliable Transport

Once the connection has been established with the three way handshake, both sides can exchange data. TCP ensures reliable delivery. Conceptually this is easy. The sender transmits a segment, setting the sequence number in the header to the offset of the first byte. It also sets a timer. When the receiver receives the segment, it transmits an acknowledgment. The Acknowledgment field of the TCP header is set the sequence number of the last byte received plus one (i.e., the next byte that it is expecting). If the sender does not receive an acknowledgment by the time that the timer goes off, it sends the segment again.

If the sender is transmitting a long message, it is inefficient to send a single segment, wait for an acknowledgment and then send the next one. To overcome this, TCP uses a Sliding Window Protocol. The window size is the number of unacknowledged segments that the sender is allows to send. Here is an example,

Suppose the window size is three segments. The sender transmits segments n, n+1 and n+2, then stops. When it receives an acknowledgment for a segment, it moves the window along. When segment n is acknowledged, the sender transmits segment n+3. When segment n+1 is acknowledged, the sender transmits segment n+4, etc.

The receiver has to acknowledge segments in order. If it receives segments n and n+2 but not segment n+1, it will only acknowledge segment n. Note that it is technically not acknowledging segments, but actual bytes; it sends the sequence number of the last byte received plus one in the acknowledgment field.

The receiver does not need to acknowledge each segment individually; if it receives segments n, n+1, and n+2, it just has to acknowledge segment n+2, and the receipt of the other two is implicit.

You might ask what a receiver does when it receives an out-of-order segment. It could either store it, on the assumption that the missing segment will soon arrive, or it could simply drop it, on the assumption that it will be retransmitted. The TCP specification is silent on this, but in practice, most implementations will drop the out-of-order packet because it is more work to store it and retrieve it appropriately and because the sender is likely to resend it anyway since the receiver has no way of acknowledging out-of-order segments.

If both ends of a connection are transmitting simultaneously, TCP allows piggy-backing. This means that one segment can have both data and an acknowledgment. This also increases efficiency.

Preventing a sender from overloading a receiver

Recall that the TCP header has a field called Window size. This is the amount of remaining buffer space that the receiver has. It is possible that the sender is sending data faster than the receiving application is executing calls to read (or recv). In this case, the receiver's buffers fill up with data waiting to be read by the application. The sender is not allowed to send more data than the advertised window size of the receiver. If the receiver says that it only has 512 bytes of buffer space free, the sender cannot send more than 512 bytes of data. It is even possible for all of the receiver's buffer space to be full, in which case it advertises a window size of zero, and the sender would not be allowed to transmit more segments.

The advertised window size of the receiver determines the window size of the sliding window of the sender. Note that I have described the window size in terms of segments, but in theory the window size is defined in terms of bytes.

The window size field in the TCP header is 16 bytes, so this means that a receiver cannot advertise a window size greater than 64K. With modern large computers, this may not be enough, particularly if the network is slow, and so the sender and receiver can negotiate a window scaling option to accommodate a larger window size. This is sometimes called a long fat pipe

Setting the Timer

When the sender transmits a segment, it sets a timer, and if it does not receive an acknowledgment by the time that the timer goes off, it retransmits the segment. Clearly, the timer setting is important. If it is set too short, then many duplicate packets will be sent unnecessarily. If it is set too long, transmission will be slowed down because the sender is waiting too long between retransmissions.

Note that it is possible for a segment to be successfully received but the acknowledgment packet is lost somewhere in the Internet. In this case the sender retransmits and so the receiver receives a duplicate. It can detect this by looking at the sequence number, and in such a case it discards the duplicate data.

TCP sets the timer dynamically based on the average round trip time (RTT). It can calculate the RTT for a particular segment by looking at the transmission time and the time of the receipt of the acknowledgment.

TCP computes the average round trip time by using a decay function.

ARTTn = α ARTTn-1 + (1 - α) RTTn

When the acknowledgment of packet n is received, its RTT is calculated. TCP then calculates a new Average Round Trip Time (ARTT) by multiplying the old ARTT times a parameter alpha (0 < α < 1) and adding (1 - α) times the RTT of the latest segment.

A good value of alpha is 7/8.

The actual timer is set to twice the ARTT.

Note that this allows the ARTT, and thus the timer setting, to vary dynamically based on conditions on the Internet and the receiver. If RTT values suddenly slow down because of congestion (or speed up because the congestion has cleared up), ARTT will adjust accordingly.

Newer implementations of TCP also take into account the variance in RTT. If all of the RTTs are about the same, TCP can set the timer to a lower value than if RTT values are highly variable, even when the average RTT is the same.

This is done by keeping track of an additional variable D, the deviation in RTT values. Whenever an acknowledgment arrives, the difference between the expected and observed RTT is noted

The value of the average D is calculated with a decay function similar to the above function, and the timer is set according to this formula

Timer = ARTT + 4 * D

Note that if no acknowledgment is received for a segment and so it is resent, and later an acknowledgment is received, there is no way to know which of the two segments is being acknowledged. Thus, most implementations of TCP will not include this data in the RTT calculations. This is called Karn's Algorithm.

Congestion Control

Congestion is an occasional problem on the Internet, and intermediate routers do not generally deal with it. As a result, most of the heavy lifting in trying to control congestion falls on TCP. Each side has a variable CongWindow, the congestion window which imposes a restraint on the rate at which a TCP sender can send traffic into the network. Specifically the amount of unacked data cannot exceed the minimum of congwin and recwindow

The best estimate of congestion on the Internet is packet loss. The sender reduces its send rate when loss events occur - it uses multiplicative decreases, halving CongWin after each loss event. It increases the size of CONGWIN by one MSS for each successfully received packet.

This is called an additive-increase, multiplicative decrease (AIMD) algorithm

The value of CongWin is typically initialized to one. However, during this initial period, it increases its sending rate exponentially.

Traditionally when routers became overloaded, they simply dropped any packets for which they did not have space in their buffers. This is called tail drop because the last packets in were dropped. This had the effect of dropping segments for many TCP connections simultaneously, causing them all to drop their Congestion Window size at the same time, a process known as global synchronization.

Many modern routers now use an algorithm called random early detection (RED) to drop packets. Rather than waiting for the queue to fill up, and then dropping all incoming packets at once, a router has two threshold values, Tmin and T max. If the queue length lies between these two and a new packet arrives, drop it with probability p. This has proved to be more efficient than tail drop.

The Nagle Algorithm

In applications such as telnet or ssh, each byte is sent as a separate segment, and each char is acked as a separate one byte segment. This is obviously extremely inefficient. As a result, TCP uses an algorithm called Nagle's Algorithm to address this problem. When data comes into the sender one byte at a time from the application, just send the first byte and buffer the rest until the first byte is acked, then send all of the buffered data.

For example, suppose the user is typing hello world. The user hits the h key. Echoing is actually done by the remote server, so the single character is sent down the protocol stack to TCP. Since it is the first char, it is immediately sent to the remote server. However, suppose the user is able to type ello before the h is acknowledged. These four characters are buffered and sent as a single segment once the first acknowledgment is received.

However, there are times when you want to disable Nagle's Algorithm. For example, in the X windowing system, mouse moves have to be sent to a remote host, and if these are bundled and sent all at once, the pointer moves in a jerky and annoying fashion.

The silly window syndrome. Suppose data are passed by the sending TCP in big chunks, but the receiving application reads one byte at a time. The receiver sends a window update to the sender saying that it now has a 1 byte window. Sender sends one byte, receiver acknowledges it, and so on. This is very inefficient. As a result, TCP installations do not advertise a small window size; it waits until the available buffer space is either the minimum segment size or half the available buffer size.

Out of Band Data

In an application such as telnet, it may be important for the sender to transmit urgent data, aka out of band data. An example would be a keyboard sequence that interrupts or aborts the program at the other end. To do this, the sender sets the URG flag. This tells the receiver to read this segment first, regardless of its sequence number. The urgent data pointer in the header indicates where the urgent data ends.

The KeepAlive Timer

TCP connections can be kept alive indefinitely. Suppose a client establishes a connection with a server and they exchange some data, but then the client crashes without sending a FIN packet. The server has no way of knowing that this happened and so would keep the connection open indefinitely. To prevent this, TCP has a KeepAlive Timer. If one end of a connection has not heard anything from the other end for a period of time (two hours is typical), it sends a keepalive message. If the other end replies, both sides keep the connection open. Otherwise, it shuts down the connection.

Required Reading

Here is a link to a web site that covers a lot of the same material that I covered in class.

And here is another one.