"In bioinformatics, BLAST (basic local alignment search tool)^[2] is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences."

BLAST (1990) is an improvement upon FASTA (1985), which is a heuristic solution of the Smith-Waterman algorithm for local sequence alignment (1981), which is a variation of the Needleman-Wunsch algorithm for global sequence alignment (1970).

All of these are examples of dynamic programming algorithms.

Needleman-Wunsch

In Needleman-Wunsch, the basic idea for aligning two sequences A and B, of length m and n, is to first populate an mxn matrix F where entry F_ij is the incremental "score" of aligning character A_i with character B_j, starting at the top left. The score is given by a similarity matrix S and a gap penalty p. Then, the optimal alignment is found by following the highest score from the bottom right back to the start.

Needleman-Wunsch pseudocode for calculating F:

for i in length(A):
   F(i,0) = p * i
for j in length(B):
   F(0,j) = p * j
for i in length(A):
   for j in length(B):
      match = F(i-1, j-1) + S(A(i), B(j))
      delete = F(i-1, j) + p
      insert = F(i, j-1) + p
      F(i,j) = max(match, delete, insert)

Needleman-Wunsch pseudocode for calculating the alignment:

i = length(A)
j = length(B)
while ( i>0 and j>0 ):
  if (i>0 and j>0 and F(i, j) == F(i−1, j−1) + S(A(i), B(j))) 
        // match! Or 
        alignA = A(i) + alignA
        alignB = B(j) + alignB
        i = i − 1
        j = j − 1
    else if (i > 0 and F(i, j) == F(i−1, j) + d)
        // insertion in A or deletion in B         
        alignA ← A(i) + alignA
        alignB ← "−" + alignB
        i = i − 1
    else
        // deletion in A or insertion in B
        alignA ← "−" + alignA
        alignB ← B(j) + alignB
        j = j − 1
    }
}

Exercise: what is the time complexity of both of the Needleman-Wunsch steps?

Smith-Waterman

The basic steps of this algorithm are the same as in Needleman-Wunsch, with the main difference being how negative scores (and thus unaligned regions) are handled and where the alignment construction step starts. In Smith-Waterman, alignment starts at the highest score(s), which allows the discovery of local alignments.

Reflection: Given that Smith-Waterman and Needleman-Wunsch are optimal algorithms, how can performance be improved?

FASTA

This algorithm uses a heuristic to quickly find the most interesting regions in the sequences and then do local Smith-Waterman alignment in those regions. Interesting regions are found by searching for perfect matches of short words, then selecting regions with many such matches.

Suppose you're aligning a query sequence A of length n against a database B of length m. The algorithm has 5 main steps:

Find hot spots — For a chosen k, enumerate all possible k-tuples of the sequence (e.g. the word ABCDE has 3-tuples ABC, BCD, and CDE) and input their locations in A into a hash table. Then look up all k-tuples of the database in the hash table, marking their location in a dot matrix so it contains all points of local identity of length k.
Identify the best diagonals — Score diagonals of the dot matrix according to the length of diagonal runs (corresponding to gapless local alignments) and select the ten best diagonal runs.
Identify best ungapped regions — Within each selected diagonal, extend the matching regions and score with a scoring matrix. Again, keep the best.
Join the matched regions — Score every combination of diagonals based on individual score plus a joining penalty.
Create the alignment — Define a diagonal band around the best diagonal and perform a Smith-Waterman within that band to produce an optimal local alignment.

BLAST

A full description of BLAST was beyond the means and time of this teacher. There are so many variations, it turned out to be a bigger job than expected. However, the general process in all BLAST is the same as in FASTA.

The first main difference to FASTA is that there is a preprocessing step that removes low-entropy regions (e.g. sections with just "AAAAAAAAA").

The biggest difference is in finding the hotspots (called "seeds" in BLAST). Instead of simply mapping all database k-tuples to the query k-tuples and marking all perfect matches, BLAST first scores the query k-tuples against all possible k-tuples, storing hits above a threshold score in a search tree. This is called the neighbourhood of a match. This makes looking up the database k-tuples more well-targeted, resulting in more meaningful hits.