BLAST: A real-world example of algorithmics

"In bioinformatics, BLAST (basic local alignment search tool)[2] Links to an external site. is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences." 

BLAST (1990) is an improvement upon FASTA (1985), which is a heuristic solution of the Smith-Waterman algorithm for local sequence alignment (1981), which is a variation of the Needleman-Wunsch algorithm for global sequence alignment (1970). 

All of these are examples of dynamic programming algorithms.

Needleman-Wunsch

In Needleman-Wunsch, the basic idea for aligning two sequences A and B, of length m and n, is to first populate an mxn matrix F where entry Fij is the incremental "score" of aligning character Ai with character Bj, starting at the top left. The score is given by a similarity matrix S and a gap penalty p. Then, the optimal alignment is found by following the highest score from the bottom right back to the start.

Needleman-Wunsch pseudocode for calculating F:

for i in length(A):
F(i,0) = p * i
for j in length(B):
F(0,j) = p * j
for i in length(A):
for j in length(B):
match = F(i-1, j-1) + S(A(i), B(j))
delete = F(i-1, j) + p
insert = F(i, j-1) + p
F(i,j) = max(match, delete, insert)

Needleman-Wunsch pseudocode for calculating the alignment:

i = length(A)
j = length(B)
while ( i>0 and j>0 ):
 
if (i>0 and j>0 and F(i, j) == F(i−1, j−1) + S(A(i), B(j)))
// match! Or alignA = A(i) + alignA alignB = B(j) + alignB i = i − 1 j = j − 1 else if (i > 0 and F(i, j) == F(i−1, j) + d)
// insertion in A or deletion in B        
alignA ← A(i) + alignA
       alignB ← "−" + alignB
       i = i − 1 else
// deletion in A or insertion in B
alignA ← "−" + alignA alignB ← B(j) + alignB j = j − 1 } }

Exercise: what is the time complexity of both of the Needleman-Wunsch steps?

Smith-Waterman

The basic steps of this algorithm are the same as in Needleman-Wunsch, with the main difference being how negative scores (and thus unaligned regions) are handled and where the alignment construction step starts. In Smith-Waterman, alignment starts at the highest score(s), which allows the discovery of local alignments.

Reflection: Given that Smith-Waterman and Needleman-Wunsch are optimal algorithms, how can performance be improved?

FASTA

This algorithm uses a heuristic to quickly find the most interesting regions in the sequences and then do local Smith-Waterman alignment in those regions. Interesting regions are found by searching for perfect matches of short words, then selecting regions with many such matches.

Suppose you're aligning a query sequence A of length n against a database B of length m. The algorithm has 5 main steps:

  1. Find hot spots — For a chosen k, enumerate all possible k-tuples of the sequence (e.g. the word ABCDE has 3-tuples ABC, BCD, and CDE) and input their locations in A into a hash table. Then look up all k-tuples of the database in the hash table, marking their location in a dot matrix so it contains all points of local identity of length k
  2. Identify the best diagonals — Score diagonals of the dot matrix according to the length of diagonal runs (corresponding to gapless local alignments) and select the ten best diagonal runs.
  3. Identify best ungapped regions — Within each selected diagonal, extend the matching regions and score with a scoring matrix. Again, keep the best.
  4. Join the matched regions — Score every combination of diagonals based on individual score plus a joining penalty. 
  5. Create the alignment — Define a diagonal band around the best diagonal and perform a Smith-Waterman within that band to produce an optimal local alignment.

BLAST

A full description of BLAST was beyond the means and time of this teacher. There are so many variations, it turned out to be a bigger job than expected. However, the general process in all BLAST is the same as in FASTA.

The first main difference to FASTA is that there is a preprocessing step that removes low-entropy regions (e.g. sections with just "AAAAAAAAA"). 

The biggest difference is in finding the hotspots (called "seeds" in BLAST). Instead of simply mapping all database k-tuples to the query k-tuples and marking all perfect matches, BLAST first scores the query k-tuples against all possible k-tuples, storing hits above a threshold score in a search tree. This is called the neighbourhood of a match. This makes looking up the database k-tuples more well-targeted, resulting in more meaningful hits.