BLAST-BLAST

9.4 TBLASTN Protocols

TBLASTN and BLASTX are very similar in that one sequence is protein and the other is nucleotide. But their usage is different. TBLASTN commonly maps a protein to a genome or searches EST databases for related proteins not yet in the protein databases.

9.4.1 Mapping a Protein to a Genome

Many avenues of investigation focus on a specific protein—for example, medical research on a genetic disease. For many proteins, there exist several closely related homologs, and understanding the role of a particular protein often means studying the near neighbors because they sometimes have interesting properties. The genomic environment of an encoded protein is often of great interest because the genomic sequence contains regulatory elements that determine where and when proteins are expressed. So, in addition to the typical BLASTP search for homologous proteins, it is also useful to do a TBLASTN search against your favorite genomes.

9.4.1.1 Approach

Even though this is conceptually a mapping experiment, we don't choose extremely insensitive parameters because we also want to identify closely related proteins that may be of interest. The seeding parameters, which require two matching words in a 40 aa window, capture a surprising amount of variability. We'll provide an additional WU-BLAST command line that uses a single, large neighborhood. It's both faster and more sensitive than the two-hit version but takes substantially more memory. We set E to a low value to cut down on the number of low scoring hits that may be prevalent in a large search space, but if your query is especially small, this value should be increased.

9.4.1.2 NCBI-BLAST parameters

blastall -p tblastn -d <genome> -i <protein> -f 999 -e 1e-5

9.4.1.3 WU-BLAST parameters

tblastn <genome> <protein> filter=seg T=999 E=1e-5
tblastn <genome> <protein> filter=seg W=5 T=25 E=1e-5

9.4.1.4 Expected results

If all goes well, you'll find your gene in the genome. You may also find several related proteins. If the genome is small, you may not find more than one, but if your source is larger, and more complex, you may find several copies. Genomic sequences in BLAST databases are sometimes not masked, so if your search takes a long time to complete, or if you find hundreds of similar genes, you may be hitting a repeat.

Some of the hits may be to pseudogenes. High stop codon penalties with ungapped extension will not remove all pseudogenes, so in addition to inspecting alignments for the presence of stop codons, also look for overlapping HSPs (from frame shifts) and single HSPs (when multiple exons are expected). Nearby repeats and poly-A tails in the genomic sequence are other useful indicators.

9.4.1.5 Optimizations and variations

We recommend using the serial search strategy described in Chapter 12 for all translating BLAST searches that employ long sequences. If you can't do this automatically, you can follow up each of the hits found here with bl2seq.

For more sensitivity, reduce the value of T to allow neighborhood words. For a less sensitivity and a lot more speed, W=5 T=999 is a useful WU-BLAST setting. It also has the added benefit of using much less memory than W=5 T=25. If you need to do a quick lookup and are only interested in identical matches, you can adapt the protocol found in Section 9.3.3 to TBLASTN.

One way to speed up this procedure is to start with a BLASTP search to identify similar proteins and then follow up each hit in its own genome with near-identity parameters. One disadvantage to this strategy is that you will have more searches to perform and a lot of sequence handling. The assumption that the protein database you're using for your BLASTP search contains all of the genome's genes is also problematic. It's much safer to assume that not all genes have been found and use TBLASTN for your search.

9.4.2 Mining ESTs (and Shotgun DNA) for Protein Similarities

Since ESTs contain fragmentary information and are often unannotated, proteins encoded in ESTs may not appear in protein databases for a while. Therefore, if you're looking for relatives of your favorite protein, search a comprehensive EST database with TBLASTN, in addition to a typical BLASTP search. You can also use this protocol to search shotgun genomic sequence.

9.4.2.1 Approach

In choosing our alignment parameters, we need to balance sensitivity and speed. We want to be able to identify a range of similar sequences, so we use the default scoring matrix and gap penalties. At the same time, EST databases (and especially shotgun genomic databases) can be quite large, so we use slightly insensitive seeding parameters.

For WU-BLAST, we include four command lines. Number 1 is approximately the same as the NCBI-BLAST parameters. Relative to the first, number 2 is slightly faster and more sensitive, number 3 is about the same speed but much more sensitive, and number 4 has the same sensitivity but is much faster.

We use the default value for E (10) because some of the EST/shotgun matches may contain only a small portion of coding sequence. We set the output parameters high so we don't miss any alignments by report truncation.

9.4.2.2 NCBI-BLAST parameters

blastall -p tblastn -d <est_db> -i <protein> -F "m S" -f 15 -b 10000 -v 10000

9.4.2.3 WU-BLAST parameters

tblastn <est_db> <protein> wordmask=seg W=3 T=15 hitdist=40 B=10000 V=10000
tblastn <est_db> <protein> wordmask=seg W=4 T=16 hitdist=40 B=10000 V=10000
tblastn <est_db> <protein> wordmask=seg W=4 T=20 B=10000 V=10000
tblastn <est_db> <protein> wordmask=seg W=4 T=99 B=10000 V=10000

9.4.2.4 Expected results

Our seeding parameters are on the insensitive side, so if you don't find what you're looking for, the first parameter to change is T (-f in NCBI-BLAST). Drop it by one or two points at a time because the search takes longer with each decrement.

Sequencing errors, especially insertions and deletions, may terminate extension. This can lead to multiple HSPs or possibly the loss of smaller HSPs. Check the coordinates of the alignments, and if large regions are missing, they may correspond to out-of-frame coding sequences. They may also be UTRs in a transcript or introns in genomic sequence. Reducing the search space to increase sensitivity enables you to recover shorter HSPs; bl2seq is convenient for this task.

9.4.2.5 Optimizations and variations

If you're looking for near identities, you can make this search much faster. See Section 9.3.3 for parameters. Because the query and database sequences are all short, you can't optimize this search with a serial strategy.

[ Team LiB ]