DSA Lab 10

CSCI 2300

Data Structures and Algorithms

Spring 2000

Lab 12
Dynamic Programming

Introduction

This lab will investigate the implementation of a dynamic programming algorithm for solving the longest common subsequence problem.
You must copy main.cpp and make the appropriate additions. There are three places where you need to add code. Essentially, you must add the code to implement the ComputeLCS function.
After the modifications, the main.cpp can be compiled and run. It should read two input files, compute the longest common subsequence of the two input files, print out the two input files and finally print out the longest common subsequence. (You will need to copy input1.txt and input2.txt to the working directory). You should verify that your program produces the same output as the sample output.

The Longest Common Subseqence (LCS) Problem

The LCS problem is defined as follows:
Input: Strings A and B
Output: The longest string C such that C is a subsequence of A and B
Example:
Input
A = actgaccg
B = tatgtcga
Output
C = atgcg
Note: Unlike a substring, a subsequence does not have to be contiguous (i.e. a subsequence can have gaps).
Applications:

Solving variations of the LCS problem is an integral part of bioinformatics. Given two DNA strings, the LCS is a metric for determining similarity (i.e. the DNA evolved from a similar ancestor).

Variations of the LCS algorithm are used in file comparison tools. The same tools used to detect cheating on projects.

Dynamic Programming

Dynamic programming is an algorithmic strategy that attempts to minimize the number of computations by storing intermediate values rather than recomputing the values.
Advantage: Dynamic programming can be applied to a variety of different problems to greatly reduce the computation. For instance, the dynamic programming algorithm for solving the LCS problem for 2 inputs of size n, requires O(n^2) computations, whereas traditional algorithmic approaches require O(2^n).
Trade-off: Dynamic programming requires the storage of intermediate solutions. This storage can be expensive. For instance, the dynamic programming algorithm for solving the LCS problem for 2 inputs of size n, requires storing O(n^2) intermediate values, whereas traditional algorithmic approaches can solve the problem using O(n) total memory

How to implement the LCS Algorithm?

1. First, the main program provides you with the two structures for storing the inputs.

std::vector<char> A; (size = n)

example:

Index 0 1 2 3 4 5 6 7

Character a c t g a c c g

std::vector<char> B; (size = m)

example:

Index 0 1 2 3 4 5 6 7

Character t a t g t c g a

2. Next, the main program provides you with the structure for storing the intermediate values.

std::vector<vector<int> > M;

The matix should have (n+1)*(m+1) entries, where n is the size of the first input and m is the
size of the second input.

The matrix should be initialized with zeros along the first row and down the first column
M[0][i] = 0 (for 0 <= i <= n)
M[j][0] = 0 (for 0 <= j <= m)

However, its convinient to intialize the whole matrix with zeros.

example: Although the input is not actually part of the matrix, it is useful to see how the matrix entries correspond to the input strings. M[1][1] corresponds to comparing A[0] and B[0]. In general, M[j+1][i+1] corresponds to comparing A[i] and B[j].

a c t g a c c g

0 0 0 0 0 0 0 0 0

t 0

a 0

t 0

g 0

t 0

c 0

g 0

a 0

3. Next, you need to calculate the length of the LCS for every position in the matrix.

The computation is quite simple:

if (A[i] == B[j]) { // If the two symbols match
   // The LCS length is increased
   M[j+1][i+1] = M[j][i] + 1;
}
else {
   // Otherwise, the LCS length is equal to the Maximum
   // of the left or right neighbor
   M[j+1][i+1] = max(M[j+1][i], M[j][i+1]);
}

example:

a c t g a c c g

0 0 0 0 0 0 0 0 0

t 0 0 0 1 1 1 1 1 1

a 0 1 1 1 1 2 2 2 2

t 0 1 1 2 2 2 2 2 2

g 0 1 1 2 3 3 3 3 3

t 0 1 1 2 3 3 3 3 3

c 0 1 2 2 3 3 4 4 4

g 0 1 2 2 3 3 4 4 5

a 0 1 2 2 3 4 4 4 5

Notes:

M[j+1][i+1] is the length of the LCS for the substring A[0 ... i] and B[0 ... j].
In order to compute M[j+1][i+1], we must first compute M[j][i], M[j+1][i], and M[j][i+1]. So it is necessary to perform the computations in the correct order.
The length of the final LCS for the input strings will be stored in M[m][n].

4. Next, you need to construct the actual LCS by backtracking over the intermediate solutions.

We know that the length of the LCS is stored in M[m][n], so we start with this cell.

j = m
i = n

To backtrack, we first compare the symbols to see if there is a match at position (j, i).

If (A[i-1] == B[j-1]) then add A[i-1] to the LCS and backtrack
to (j -1, i -1).

Otherwise, we compare M[j-1][i] and M[j][i-1].
If (M[j-1][i] > M[j][i-1]) then we backtrack to (j -1, i).
Otherwise, we backtrack to (j, i - 1).

We continue backtracking until either i or j equals zero.

5. Finally, since we computed the LCS by backtracking, we must reverse the LCS.

Questions:

Why is it necessary to store all the intermediate steps?
If you only needed to compute the size of the LCS, would you need to backtrack over the intermediate steps?
How much memory would be necessary if you only needed to compute the size of the LCS?

		a	c	t	g	a	c	c	g
	0	0	0	0	0	0	0	0	0
t	0	0	0	1	1	1	1	1	1
a	0	1	1	1	1	2	2	2	2
t	0	1	1	2	2	2	2	2	2
g	0	1	1	2	3	3	3	3	3
t	0	1	1	2	3	3	3	3	3
c	0	1	2	2	3	3	4	4	4
g	0	1	2	2	3	3	4	4	5
a	0	1	2	2	3	4	4	4	5

		a	c	t	g	a	c	c	g
	0	0	0	0	0	0	0	0	0
t	0	0	0	1	1	1	1	1	1
a	0	1	1	1	1	2	2	2	2
t	0	1	1	2	2	2	2	2	2
g	0	1	1	2	3	3	3	3	3
t	0	1	1	2	3	3	3	3	3
c	0	1	2	2	3	3	4	4	4
g	0	1	2	2	3	3	4	4	5
a	0	1	2	2	3	4	4	4	5

		a	c	t	g	a	c	c	g
	0	0	0	0	0	0	0	0	0
t	0	0	0	1	1	1	1	1	1
a	0	1	1	1	1	2	2	2	2
t	0	1	1	2	2	2	2	2	2
g	0	1	1	2	3	3	3	3	3
t	0	1	1	2	3	3	3	3	3
c	0	1	2	2	3	3	4	4	4
g	0	1	2	2	3	3	4	4	5
a	0	1	2	2	3	4	4	4	5