BLAST Overview

Aadit Mathur
Jun 30
4 min read

BLAST is one of the most important tools in molecular biology today. Also known as the Basic Local Alignment Search Tool, BLAST allows users to efficiently compare shorter query sequences (of nucleotides or amino acids) against a large database of sequences. Whether it be to help with phylogeny or the identification of organisms, BLAST's varied applications have made it essential today.

The NCBI has an implementation of BLAST on its website that can be used very easily (albeit with a dizzying array of parameters―more on them soon!). Let's explore how this tool actually works.

A screenshot of the NCBI BLAST homepage.

The Problem With Alignment

A naive approach to this problem would be to simply conduct a large-scale local alignment between the query string and the database to identify high-scoring regions. However, in practice, this is utter insanity.

If n is the length of the database sequence and m is the length of the query sequence, then local alignment uses an n by m 2D array. With a small database, this might be slow and memory-intensive yet still feasible. But with real-world databases, alignment quickly requires absurd amounts of memory and time, making such an algorithm completely unfeasible.

Instead, we make some guesses.

Optional Aside: P-values vs E-values

BLAST results are typically reported using the e-value, which estimates how many hits with a similar score one might expect to see by chance in a database of that size. While technically related to the p-value—the probability of observing a score that good or better by random chance—the e-value incorporates database size, making it more practical for filtering in large searches. For simplicity, this article will refer to p-values when discussing statistical thresholds, though e-values are what BLAST actually outputs.

Guessing Good Comparisons

BLAST is a heuristic: it can't guarantee that it made the perfect alignments. Instead, it makes a series of guesses to find parts of the database that it thinks will produce significant alignments against our query. That way, rather than comparing our query against the whole database, we only have to compare it against smaller parts of it.

Different versions of BLAST have modified the steps used to make these guesses. This article will cover BLAST as described in the original 1990 paper that proposed it since all other BLAST implementations build off of the same idea: you will be able to understand other versions of BLAST once the basic concepts are down.

Overall Steps

First, we want to find seeds, which are places in the database where we'll start off looking in.

Split the query string into overlapping w-mers, which are substrings of length w
1. The length of these w-mers is a parameter w that we pass into the algorithm
2. For example, if w is 3, the string ACTG would be split into ACT and CTG
PROTEIN BLAST ONLY: Use a scoring matrix like BLOSUM62 to assign scores to possible substitutions of each w-mer, and filter out w-mers with low scores
1. If you're confused, don't worry! This will be explored more in a future article.
2. This is only necessary with protein BLAST because some amino acids can be substituted for each other with minimal chemical changes while others can cause major changes. These chemical differences don't really exist the same way for the four nucleotides.
Search the database string for matching w-mers

Now that we have our seeds, we can create high-scoring segment pairs (HSPs). These are the portions of the database that we will align our query against.

Extend our w-mers to the left and right of the database string
1. Points are added for matches and subtracted for mismatches
2. If the score falls by a parameter, X, without increasing, then we will stop extending the HSP
Filter out HSPs with low scores
1. The minimum score required is another parameter S
Filter out HSPs that are not statistically significant matches
1. The p-value associated with the HSP must be less than a threshold parameter, often called E or Q
Combine nearby HSPs together (optional)
1. This step is not described in the original paper but is common practice today. The threshold p-value for merging two HSPs is yet another tunable parameter.

Now, we can finally do a local alignment.

Iterate over each HSP and conduct a local alignment
Return significant alignments to the user
1. This may involve using e-values, p-values, raw alignment scores, or other metrics

Next Steps

Now that we've seen the rough picture of how BLAST works, we can finally begin understanding how to implement it.

The next article in this series will discuss the BLAST parameters in more detail and their effect on the outputs of the program. Then, we'll walk through each step outlined above to build a basic implementation of the algorithm and finally begin coding!

Citations

Altschul, Stephen F., et al. "Basic local alignment search tool." Journal of molecular biology

215.3 (1990): 403-410.

Madden, Tom. "The BLAST sequence analysis tool." The NCBI handbook 2.5 (2013): 425

436.