BLAST: A real-world example of algorithmics
"In bioinformatics, BLAST (basic local alignment search tool)[2] Links to an external site. is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences."
BLAST (1990) is an improvement upon FASTA (1985), which is a heuristic solution of the Smith-Waterman algorithm for local sequence alignment (1981), which is a variation of the Needleman-Wunsch algorithm for global sequence alignment (1970).
All of these are examples of dynamic programming algorithms.
Needleman-Wunsch
In Needleman-Wunsch, the basic idea for aligning two sequences A and B, of length m and n, is to first populate an mxn matrix F where entry Fij is the incremental "score" of aligning character Ai with character Bj, starting at the top left. The score is given by a similarity matrix S and a gap penalty p. Then, the optimal alignment is found by following the highest score from the bottom right back to the start.
Needleman-Wunsch pseudocode for calculating F:
for i in length(A):
F(i,0) = p * i
for j in length(B):
F(0,j) = p * j
for i in length(A):
for j in length(B):
match = F(i-1, j-1) + S(A(i), B(j))
delete = F(i-1, j) + p
insert = F(i, j-1) + p
F(i,j) = max(match, delete, insert)
Needleman-Wunsch pseudocode for calculating the alignment:
i = length(A)
j = length(B)
while ( i>0 and j>0 ):
if (i>0 and j>0 and F(i, j) == F(i−1, j−1) + S(A(i), B(j)))
// match! Or alignA = A(i) + alignA alignB = B(j) + alignB i = i − 1 j = j − 1 else if (i > 0 and F(i, j) == F(i−1, j) + d)
// insertion in A or deletion in B
alignA ← A(i) + alignA
alignB ← "−" + alignB
i = i − 1 else
// deletion in A or insertion in B
alignA ← "−" + alignA alignB ← B(j) + alignB j = j − 1 } }
Exercise: what is the time complexity of both of the Needleman-Wunsch steps?
Smith-Waterman
The basic steps of this algorithm are the same as in Needleman-Wunsch, with the main difference being how negative scores (and thus unaligned regions) are handled and where the alignment construction step starts. In Smith-Waterman, alignment starts at the highest score(s), which allows the discovery of local alignments.
Reflection: Given that Smith-Waterman and Needleman-Wunsch are optimal algorithms, how can performance be improved?
FASTA
This algorithm uses a heuristic to quickly find the most interesting regions in the sequences and then do local Smith-Waterman alignment in those regions. Interesting regions are found by searching for perfect matches of short words, then selecting regions with many such matches.
Suppose you're aligning a query sequence A of length n against a database B of length m. The algorithm has 5 main steps:
- Find hot spots — For a chosen k, enumerate all possible k-tuples of the sequence (e.g. the word ABCDE has 3-tuples ABC, BCD, and CDE) and input their locations in A into a hash table. Then look up all k-tuples of the database in the hash table, marking their location in a dot matrix so it contains all points of local identity of length k.
- Identify the best diagonals — Score diagonals of the dot matrix according to the length of diagonal runs (corresponding to gapless local alignments) and select the ten best diagonal runs.
- Identify best ungapped regions — Within each selected diagonal, extend the matching regions and score with a scoring matrix. Again, keep the best.
- Join the matched regions — Score every combination of diagonals based on individual score plus a joining penalty.
- Create the alignment — Define a diagonal band around the best diagonal and perform a Smith-Waterman within that band to produce an optimal local alignment.
BLAST
A full description of BLAST was beyond the means and time of this teacher. There are so many variations, it turned out to be a bigger job than expected. However, the general process in all BLAST is the same as in FASTA.
The first main difference to FASTA is that there is a preprocessing step that removes low-entropy regions (e.g. sections with just "AAAAAAAAA").
The biggest difference is in finding the hotspots (called "seeds" in BLAST). Instead of simply mapping all database k-tuples to the query k-tuples and marking all perfect matches, BLAST first scores the query k-tuples against all possible k-tuples, storing hits above a threshold score in a search tree. This is called the neighbourhood of a match. This makes looking up the database k-tuples more well-targeted, resulting in more meaningful hits.