CSCI 2300 Data Structures and Algorithms Spring 2000

Lab 12
Dynamic Programming



Introduction
This lab will investigate the implementation of a dynamic programming algorithm for solving the longest common subsequence problem.

You must copy main.cpp and make the appropriate additions.  There are three places where you need to add code.  Essentially, you must add the code to implement the ComputeLCS function.

After the modifications, the main.cpp can be compiled and run.  It should read two input files, compute the longest common subsequence of the two input files, print out the two input files and finally print out the longest common subsequence. (You will need to copy input1.txt and input2.txt to the working directory). You should verify that your program produces the same output as the sample output.


The Longest Common Subseqence (LCS) Problem

The LCS problem is defined as follows:

Input: Strings A and B

Output: The longest string C such that C is a subsequence of A and B

Example:

Input
A = actgaccg
B = tatgtcga

Output
C = atgcg

Note: Unlike a substring, a subsequence does not have to be contiguous (i.e. a subsequence can have gaps).

Applications:

Dynamic Programming
Dynamic programming is an algorithmic strategy that attempts to minimize the number of computations by storing intermediate values rather than recomputing the values.

Advantage: Dynamic programming can be applied to a variety of different problems to greatly reduce the computation.  For instance, the dynamic programming algorithm for solving the LCS problem for 2 inputs of size n, requires O(n^2) computations, whereas traditional algorithmic approaches require O(2^n).

Trade-off: Dynamic programming requires the storage of intermediate solutions.  This storage can be expensive.  For instance, the dynamic programming algorithm for solving the LCS problem for 2 inputs of size n, requires storing O(n^2) intermediate values, whereas traditional algorithmic approaches can solve the problem using O(n) total memory



How to implement the LCS Algorithm?

1. First, the main program provides you with the two structures for storing the inputs.

std::vector<char> A; (size = n)

example:
Index
0
1
2
3
4
5
6
7
Character
a
c
t
g
a
c
c
g

std::vector<char> B; (size = m)

example:
Index
0
1
2
3
4
5
6
7
Character
t
a
t
g
t
c
g
a

2. Next, the main program provides you with the structure for storing the intermediate values.

std::vector<vector<int> > M;

The matix should have (n+1)*(m+1) entries, where n is the size of the first input and m is the
size of the second input.

The matrix should be initialized with zeros along the first row and down the first column
M[0][i] = 0 (for 0 <= i <= n)
M[j][0] = 0 (for 0 <= j <= m)

However, its convinient to intialize the whole matrix with zeros.

example: Although the input is not actually part of the matrix, it is useful to see how the matrix entries correspond to the input strings.  M[1][1] corresponds to comparing A[0] and B[0].  In general, M[j+1][i+1] corresponds to comparing A[i] and B[j].
   
a
c
t
g
a
c
c
g
 
0
0
0
0
0
0
0
0
0
t
0
 
 
 
 
 
 
 
 
a
0
 
 
 
 
       
t
0
 
 
   
 
 
 
 
g
0
 
 
 
 
 
 
 
 
t
0
 
 
 
 
 
 
 
 
c
0
 
 
 
 
 
 
 
 
g
0
 
 
 
 
 
 
 
 
a
0
 
 
 
 
 
 
 
 

3.  Next, you need to calculate the length of the LCS for every position in the matrix.

The computation is quite simple:

if (A[i] == B[j]) {  // If the two symbols match
   // The LCS length is increased
   M[j+1][i+1] = M[j][i] + 1;
}
else {
   // Otherwise, the LCS length is equal to the Maximum
   // of the left or right neighbor
   M[j+1][i+1] = max(M[j+1][i], M[j][i+1]);
}

example:
   
a
c
t
g
a
c
c
g
 
0
0
0
0
0
0
0
0
0
t
0
0
0
1
1
1
1
1
a
0
1
1
1
1
2
2
2
2
t
0
1
1
2
2
2
2
2
2
g
0
1
1
2
3
3
3
3
3
t
0
1
1
2
3
3
3
3
3
c
0
1
2
2
3
3
4
4
4
g
0
1
2
2
3
3
4
4
5
a
0
1
2
2
3
4
4
4
5

Notes:

4.  Next, you need to construct the actual LCS by backtracking over the intermediate solutions.

We know that the length of the LCS is stored in M[m][n], so we start with this cell.

j = m
i = n

To backtrack, we first compare the symbols to see if there is a match at position (j, i).

If (A[i-1] == B[j-1]) then add A[i-1]  to the LCS and backtrack
to (j -1, i -1).

Otherwise, we compare M[j-1][i] and M[j][i-1].
If (M[j-1][i] > M[j][i-1]) then we backtrack to (j -1, i).
Otherwise, we backtrack to (j, i - 1).

We continue backtracking until either i or j equals zero.
 

5.  Finally, since we computed the LCS by backtracking, we must reverse the LCS.



Questions:
  1. Why is it necessary to store all the intermediate steps?
  2. If you only needed to compute the size of the LCS, would you need to backtrack over the intermediate steps?
  3. How much memory would be necessary if you only needed to compute the size of the LCS?