BLAST-BLAST

10.3 Command-Line Tutorial

Now that you've installed the NCBI-BLAST and/or WU-BLAST software, it's time to try it out. To do this, you will need sequences for both queries and databases. It's generally a good idea to start with something small and make sure it works before attempting to analyze large databases. To begin, download the book's example files from http://examples.oreilly.com/BLAST. Copy the Sequence directory to a local hard disk. This directory contains several database and sequences in FASTA format. The six files you will need for testing are described in Table 10-3.

Table 10-3. Contents of the Sequence directory

Name

Description

ESTs

2000 nucleotide sequences from the nematode Caenorhabditis elegans. These sequences originated from http://www.wormbase.org.

EST

A single EST from the previous collection.

globins

1203 protein sequences corresponding to the globin family (Pfam Version 7.6). These sequences and other protein families are available from http://www.sanger.ac.uk/Software/Pfam.

globin

A single protein from the previous collection.

AF287139

Latimeria chalumnae (Coelacanth) Hoxa-11 nucleotide sequence.

AAG39070

Latimeria chalumnae (Coelacanth) HoxA-11 protein sequence.

fugu_genomic

Takifugu rubripes (Pufferfish) genomic sequence containing globin genes.

chicken_genomic

Gallus gallus (Chicken) genomic sequence containing globin genes.

HoxDB

Nine homeobox protein sequences from different organisms.

p53

The Drosophila melanogaster p53 protein sequence.

p53DB

Twelve p53 homologous protein sequences from different organisms.

hit_file.p53

p53 pattern file for PHI-BLAST search.

10.3.1 NCBI-BLAST

In the following subsections, you'll run the various components of the BLAST software you just installed. You'll see brief descriptions of each program and its arguments, but the programs all have more options those displayed. See Chapter 13 for more information about each program. To begin, change your directory to the sequence directory.

10.3.1.1 formatdb

We'll start by formatting the ESTs database with the following command:

formatdb -i ESTs -p F -l ESTs.logfile -o

The -i ESTs indicates the name of the input file. The -p F indicates this isn't a protein database. The -l ESTs.logfile creates a log file. The -o creates indexing files so that sequences may be retrieved from the database. This command creates six files: ESTs.nhr, ESTs.nin, ESTs.nsd, ESTs.nsi, ESTs.nsq, and ESTs.logfile. With the exception of the log file, the files are in binary format and not human-readable. The log file looks like this (the version number will differ depending on your installation):

========================[ Oct 9, 2002 11:40 AM ]========================
Version 2.2.2 [Jan-08-2002]
Started database file "ESTs"
Formatted 2000 sequences
(END)

Now, format the globins database with the following command:

formatdb -i globins -p T -l globins.logfile -o

The parameters here are similar to those previous, with the exception of -p T because this is a protein database. Because this is the default condition, you can omit -p T. This command also creates six files: globins.phr, globins.pin, globins.psd, globins.psi, globins.psq, and globins.logfile.

10.3.1.2 blastn

Let's query the ESTs database with a single EST from that database and see if you can find it:

blastall -p blastn -d ESTs -i EST > ncbi-blastn_test

After the standard header, you should see output identical to this:

Query= AU108953 AU108953  Caenorhabditis elegans cDNA clone:yk701a6 :
5' end, single read. 
         (322 letters)

Database: ESTs 
           2000 sequences; 673,456 total letters

Searching....done

                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value

AU108953 AU108953  Caenorhabditis elegans cDNA clone:yk701a6 : 5...   632  0.0
AU109042 AU109042  Caenorhabditis elegans cDNA clone:yk702b3 : 5...   567  e-163
AU109359 AU109359  Caenorhabditis elegans cDNA clone:yk705g2 : 5...   436  e-124
AU110478 AU110478  Caenorhabditis elegans cDNA clone:yk718h12 : ...   383  e-107
AU109122 AU109122  Caenorhabditis elegans cDNA clone:yk703d7 : 5...    32  0.040
AU110305 AU110305  Caenorhabditis elegans cDNA clone:yk718a9 : 5...    30  0.16

The summary of the report shows that the top sequence has accession number AU108953, which is the same as the query used in the file named EST.

If you would like to benchmark the performance of your system as a way to compare different machines, you can search all the ESTs against one another. This will take about 10 minutes on a 1-GHz processor. You'll see this same benchmark in Chapter 12. Read that chapter to understand why this is or isn't a good benchmark for your system. Here is the Unix command line for the benchmarking procedure:

time blastall -p blastn -d ESTs -i ESTs > /dev/null

10.3.1.3 megablast

megablast is designed for high-performance blastn searches. Do the same two searches as in previous sections.

megablast -d ESTs -i EST > megablast-test

The default output is tabular. For more information on this format, see Appendix A.

'AU108953'=='+AU108953' (1 1 322 322) 0
'AU109042'=='+AU108953' (18 14 324 319) 5
'AU109359'=='+AU108953' (12 58 282 322) 13
'AU109359'=='+AU108953' (12 58 282 322) 13
'AU110478'=='+AU108953' (25 21 240 236) 8

If you want to see how much faster megablast is than standard blastn, time them both. To compare them on equal footing, set the blastn word size higher (-W 30) and use tabular output (-m 8).

time blastall -p blastn -d ESTs -i ESTs -W 30 -m 8 > /dev/null
time megablast -d ESTs -i ESTs > /dev/null

(megablast ran about 12 times faster on the Ian's laptop in these tests.)

10.3.1.4 blastp

Run blastp by searching a single protein against a database of proteins:

blastall -p blastp -d globins -i globin > ncbi-blastp_test

This BLAST report is fairly long, and the following is only part of the summary. The top of your report should look like this:

Query= Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.
         (145 letters)

Database: globins 
           1203 sequences; 211,084 total letters

Searching...done

                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value

Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.                                          293  6e-82
HBG_ATEGE P06891 Hemoglobin gamma chain.                              277  4e-77
HBG1_CALMO Q9GLX4 Hemoglobin gamma-1 chain.                           276  8e-77
HBG_ALOBE P56284 Hemoglobin gamma chain.                              275  1e-76
Q9GJS7 Q9GJS7 GAMMA1-GLOBIN (GAMMA2-GLOBIN).                          274  2e-76
Q9GLX7 Q9GLX7 HYBRID GAMMA1/GAMMA2-GLOBIN.                            273  4e-76

You can also benchmark blastp in the same way as blastn, by doing all-versus-all protein searches. This will take about twice as long to complete as the blastn benchmark.

time blastall -p blastp -d globins -i globin > /dev/null

10.3.1.5 blastx

It is common to use blastx as a gene-finding tool by searching a genomic sequence against a protein database:

blastall -p blastx -d globins -i fugu_genomic > ncbi-blastx_test

The genomic sequence contains a cluster of alpha globin genes, so you can expect to find quite a few matches to the globins database. Your report should look like this:

Query= gi|18463974|gb|AY016024.1| Takifugu rubripes alpha globin gene
cluster, complete sequence
         (35,793 letters)

Database: globins 
           1203 sequences; 211,084 total letters

Searching...done

                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value

Q9PVU6 Q9PVU6 EMBRYONIC ALPHA-TYPE GLOBIN.                            137  3e-46
HBA_THUTH P11748 Hemoglobin alpha chain.                              119  3e-40
HBA2_NOTCO P16308 Hemoglobin alpha-2 chain.                           130  5e-40
HBA2_TRENE P45719 Hemoglobin alpha-2 chain.                           126  6e-40
Q98974 Q98974 ALPHA-GLOBIN IV.                                        119  7e-39
Q98SE6 Q98SE6 ALPHANCP2 (ALPHANCP1).                                  140  9e-39
HBA_ELEEL P14520 Hemoglobin alpha chain.                              119  9e-39

10.3.1.6 tblastn

Run tblastn in a typical application that searches an EST databases for protein similarities.

blastall -p tblastn -d ESTs -i globin  > ncbi-tblastn_test

Here's the first alignment of the report in addition to the summary. We'll leave the question of biological and statistical significance as an exercise for you.

Query= Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.
         (145 letters)

Database: ESTs 
           2000 sequences; 673,456 total letters

Searching....done

                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value

AU109565 AU109565  Caenorhabditis elegans cDNA clone:yk708f8 : 5...    28  0.043
AU109149 AU109149  Caenorhabditis elegans cDNA clone:yk703g6 : 5...    27  0.096
AU109370 AU109370  Caenorhabditis elegans cDNA clone:yk705h2 : 5...    25  0.37
AU110425 AU110425  Caenorhabditis elegans cDNA clone:yk716a3 : 5...    24  0.62
AU109448 AU109448  Caenorhabditis elegans cDNA clone:yk706h11 : ...    20  6.9
AU109900 AU109900  Caenorhabditis elegans cDNA clone:yk712d7 : 5...    20  6.9
AU110608 AU110608  Caenorhabditis elegans cDNA clone:yk720c9 : 5...    20  9.0
AU109391 AU109391  Caenorhabditis elegans cDNA clone:yk706b11 : ...    20  9.0

>AU109565 AU109565  Caenorhabditis elegans cDNA clone:yk708f8 : 5' end,
           single read.
          Length = 327

 Score = 27.7 bits (60), Expect = 0.043
 Identities = 12/34 (35%), Positives = 20/34 (58%)
 Frame = -1

Query: 21  VEDAGGETLGRLLVVYPWTQRFFDSFGSLCSPSA 54
           + + G   + R+  V P +QRF  S  ++CSP+A
Sbjct: 144 IREVGESPVIRIFFVLPGSQRFIVSRRAICSPTA 43

10.3.1.7 tblastx

tblastx is commonly used as a gene-finding tool to identify potential coding regions between genomes. You can simulate this by aligning alpha globin clusters from chicken and fish. You haven't formatted the chicken genomic sequence database yet, so do this first and then run the search:

formatdb -i chicken_genomic -p F
blastall -p tblastx -d chicken_genomic -i fugu_genomic> ncbi-blastx_test

You should see the following error message while your search is running. This warns you that there are more than 200 significant alignments between the two sequences. You can't increase this number with a command-line switch, and the only workaround is to cut your query sequence into smaller pieces. This limitation applies only in ungapped searches (TBLASTX or any search with the setting -g F).

[blastall] WARNING:  [000.000] gi|18463974|gb|AY016024.1|: Reached max 200 HSPs in 
BlastSaveCurrentHsp, continuing with this limit

Since there's only one sequence in the database, look at the first alignment instead of the summary:

>gi|17104478|gb|AY016020.1| Gallus gallus alpha globin gene cluster,
             complete sequence
          Length = 103190

 Score =  141 bits (302), Expect = 1e-34
 Identities = 61/73 (83%), Positives = 66/73 (89%)
 Frame = -2 / -3

Query: 25076 ASSDDMTLTSPSMDNSSAELLPGGDSPLNKRITETLLASLSEHERQVILSVPAAQNPEDL 24897
             ASSDDMTLTSPSMDNSSAEL+PGGDSPLNKR+TE LLASL EHER+ IL+VPAAQNPEDL
Sbjct: 5289  ASSDDMTLTSPSMDNSSAELIPGGDSPLNKRMTENLLASLLEHEREAILNVPAAQNPEDL 5110

                          
Query: 24896 RMFAR*NHLSTKC 24858
             RMFAR*   S +C
Sbjct: 5109  RMFAR*EIGSAEC 5071

Note the stop codon (*) near the end of the alignment. The scoring matrices distributed with BLAST set the score of all stop codon matches to +1. If you want to terminate alignments at stop codons, you have to edit the scoring matrix. This procedure is described at the end of this chapter.

10.3.1.8 bl2seq

The previous tblastx test used only two sequences. Whenever you want to align just two sequences, you can use the bl2seq program to save the effort of having to format one sequence as a database and remove the database files later. Here's the command:

bl2seq -p tblastx -j chicken_genomic -i fugu_genomic

You won't see a summary, but the rest of the report will be nearly identical to the one in the previous section.

10.3.1.9 fastacmd

The fastacmd program can retrieve one or more FASTA-formatted sequences from a BLAST database. Try this by retrieving a single sequence from the ESTs database:

fastacmd -d ESTs -s AU108953 -l 60

The -d ESTs designates the database from which you retrieve the sequence. The database must have been formatted with the -o option of formatdb for fastacmd to work. The -s AU108953 is the string for which to search the database. The -l 60 specifies the sequence line lengths; the default line length is 80. The output from this test should be the following:

>lcl|AU108953 AU108953  Caenorhabditis elegans cDNA clone:yk701a6 : 5' end, sing
le read
TGGCCTACTGGANAAAACAACAATGCGTGCTTTACTATTCACCTCTGTTGTTCTTTTGGC
TTTGGCTTTTGTTGAGGCAAAGAAGCAGACTATCACTGTCAAGGGTACAACTATTTGTAA
TAAGAAGAGAATTCAGGCGGAGGTTACCCTTTGGGAGAAAGATACTCTCGACCCCGATGA
CAAGCTCGCCTCAATGCAATCGAACAAAGAAGGAGAGTTCTCACTTACCGGATCCGACGA
CGAGATCACCTCAATCTCTCCATACCTCATAATCACCCACAACAGCAACGTGAAGAAGGC
CGGATGAAGCCGTGTTTCAGAG

The beginning of the definition has a prepended lcl|. The lcl| isn't in the original FASTA definition line and doesn't show up in BLAST reports; it shows up here because the database identifier wasn't in NCBI format (see Chapter 11).

fastacmd has several other useful features. It can report a summary of a BLAST database, dump a BLAST database to FASTA format, report a subsequence of a particular sequence, and even display taxonomic information if the databases are downloaded from ftp://ftp.ncbi.nih.gov/blast/db/. See Chapter 13 for more information on fastacmd.

10.3.1.10 PSI-BLAST

PSI-BLAST identifies weak amino acid similarities and is executed using the blastpgp program. Test it by iteratively searching the Drosophila melanogaster p53 protein against a database of p53 homologs. First, format the p53 database, just as you would for any protein database:

formatdb -i p53DB -p T -l p53DB.logfile -o

Now you can search with your p53 protein{

blastpgp -d p53DB -i p53 -j 5 > ncbi-psiblast_test

You'll get a report with four iterations. All four rounds of automated searching are concatenated into one report. The following represents the summary section for these four rounds.

In round 1, there are 11 matches out of 12 total database sequences. Notice that there is an insignificant hit to a hypothetical protein at the bottom of the report with a score of 20 and an E-value near 1.

Query= gi|7211767|gb|AAF40427.1|AF224713_1 transcription factor p53
[Drosophila melanogaster]
         (385 letters)

Database: p53DB 
           12 sequences; 3745 total letters

Results from round 1

                                                                       Score    E
Sequences producing significant alignments:                           (bits) Value

gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST]                  98   3e-024
gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi...    75   2e-017
gb|AAA98563.1| p53 tumor suppressor homolog                            70   9e-016
gb|AAC24830.1| p53 homolog [Homo sapiens]                              63   1e-013
gb|AAC37335.1| p53 [Canis familiaris]                                  59   1e-012
gb|AAC05704.1| tumor suppressor p53 [Mus musculus]                     58   4e-012
gb|AAC60746.1| p53 [Xenopus laevis]                                    57   9e-012
gb|AAD12203.1| tumor suppressor protein [Canis familiaris]             43   1e-007
gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss]     42   2e-007
gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ...    35   4e-005
gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana]             20   0.98

In round 2, all 12 sequences from the p53DB have hits better than the E-value threshold of 10. All scores are higher and an annotated p53 peptide fragment from mouse (gb|AAC71764.1|) is now found with a significant score.

Results from round 2

                                                                       Score    E
Sequences producing significant alignments:                           (bits) Value
Sequences used in model and found again:

gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi...   293   4e-083
gb|AAC24830.1| p53 homolog [Homo sapiens]                             291   3e-082
gb|AAC05704.1| tumor suppressor p53 [Mus musculus]                    279   7e-079
gb|AAC37335.1| p53 [Canis familiaris]                                 279   9e-079
gb|AAA98563.1| p53 tumor suppressor homolog                           278   1e-078
gb|AAC60746.1| p53 [Xenopus laevis]                                   270   5e-076
gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST]                 251   3e-070
gb|AAD12203.1| tumor suppressor protein [Canis familiaris]            227   4e-063
gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ...   221   2e-061
gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss]    127   4e-033

Sequences not found previously or not previously below threshold:

gb|AAC71764.1| p53 protein; Trp53 [Mus musculus musculus]              40   8e-007
gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana]             25   0.028

In round 3, the scores of all hits are even better as the position-specific scoring matrix is increasingly tuned to the p53 profile.

Results from round 3

                                                                       Score    E
Sequences producing significant alignments:                            (bits) Value
Sequences used in model and found again:

gb|AAC60746.1| p53 [Xenopus laevis]                                   323   6e-092
gb|AAC05704.1| tumor suppressor p53 [Mus musculus]                    309   7e-088
gb|AAA98563.1| p53 tumor suppressor homolog                           306   5e-087
gb|AAC24830.1| p53 homolog [Homo sapiens]                             304   2e-086
gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi...   295   1e-083
gb|AAC37335.1| p53 [Canis familiaris]                                 285   2e-080
gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST]                 271   2e-076
gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ...   241   2e-067
gb|AAD12203.1| tumor suppressor protein [Canis familiaris]            231   3e-064
gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss]    124   3e-032
gb|AAC71764.1| p53 protein; Trp53 [Mus musculus musculus]              43   1e-007

Sequences not found previously or not previously below threshold:

gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana]             32   3e-004

In round 4, no new sequences are found below the level of significance (0.005). Therefore the search has converged, and no new iterations are performed.

Results from round 4

                                                                       Score    E
Sequences producing significant alignments:                           (bits) Value
Sequences used in model and found again:

gb|AAC60746.1| p53 [Xenopus laevis]                                   333   3e-095
gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi...   319   6e-091
gb|AAC24830.1| p53 homolog [Homo sapiens]                             319   6e-091
gb|AAC05704.1| tumor suppressor p53 [Mus musculus]                    313   5e-089
gb|AAA98563.1| p53 tumor suppressor homolog                           311   3e-088
gb|AAC37335.1| p53 [Canis familiaris]                                 289   8e-082
gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST]                 272   6e-077
gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ...   241   2e-067
gb|AAD12203.1| tumor suppressor protein [Canis familiaris]            230   3e-064
gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss]    125   2e-032
gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana]             70   1e-015
gb|AAC71764.1| p53 protein; Trp53 [Mus musculus musculus]              44   8e-008

Sequences not found previously or not previously below threshold:

CONVERGED!

10.3.1.11 PHI-BLAST

PHI-BLAST seeds extensions from important regions of a protein—for example, an enzyme active site or a conserved domain. It is also executed with the blastpgp program. To specify a PHI-BLAST search, a hit_file must be available (-k) and the program (-p) must be specified as either patseedp or seedp, with seedp specifying a special hit_file. You'll run the program using a conserved part of the Drosophila melanogaster p53 protein as a pattern seed to search the p53DB. The pattern is designated in hit_file.p53:

ID  p53 Pattern
PA  [YF]-[ST]-X-X-L-N-K-L-[YF]

The following command uses this pattern as a seed in a search of the p53 database:

blastpgp -d p53DB -i p53 -k hit_file.p53 -p patseedp > ncbi-phiblast_test

In the results, the position of the pattern in the query and the probability of finding the pattern are shown, followed by the summary lines and alignments. In the first alignment, asterisks show the position of the pattern in the query and subject.

1 occurrence(s) of pattern in query
 p53 Pattern
 pattern [YF]-[ST]-X-X-L-N-K-L-[YF]
 at position 107 of query sequence
effective database length=3.6e+003
 pattern probability=1.4e-008
lengthXprobability=5.0e-005

Number of occurrences of pattern in the database is 4
WARNING: There may be more matching sequences with e-values below the threshold of 
10.000000

                                                                      Score     E
                                                                     (bits)  Value

Significant matches for pattern occurrence 1 at position 107

gb|AAC37335.1| p53 [Canis familiaris]                                  47   2e-014
gb|AAC60746.1| p53 [Xenopus laevis]                                    41   2e-012
gb|AAC05704.1| tumor suppressor p53 [Mus musculus]                     38   2e-011
gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ...    16   7e-005

Significant alignments for pattern occurrence 1 at position 107

>gb|AAC37335.1| p53 [Canis familiaris]
          Length = 281

 Score = 47.2 bits (132), Expect = 2e-014
 Identities = 56/220 (25%), Positives = 94/220 (42%), Gaps = 27/220 (12%)

Query:  105 WMYSIPLNKLYIRMNKAFNVDVQFKSKMPIQPLNLRVFLCF--SNDVSAPVVRCQNHLSV 162
pattern 107   *********
            W YS  LNKL+ ++ K   V +   S  P     +R    +  S  V+  V RC +H   
Sbjct:  16  WTYSPLLNKLFCQLAKTCPVQLWVSSPPPPNTC-VRAMAIYKKSEFVTEVVRRCPHHERC 74

10.3.1.12 Environment variables and .ncbirc

If everything works, make sure the environment variables and data directory function correctly. Now, move to any random directory on your system and attempt one of the previous searches, but substitute the correct path to the query file.

If the executable path is incorrect, you will have an error message such as:

blastall: command not found

Databases that can't be found produce error messages like the following:

[blastall] WARNING: <query> Could not find index files for database <db>

If the data directory can't be found, blastall reports that it is unable to open the scoring matrix:

Searching[blastall] WARNING: <query> [000.000] Unable to open <matrix>

10.3.2 WU-BLAST

As in the previous section, you'll run the various components of the WU-BLAST software in typical sequence analysis settings.

10.3.2.1 xdformat

Start by formatting the ESTs database with the following command. The -n indicates that this is a nucleotide database. This command creates three files: ESTs.xnd, ESTs.xns, and ESTs.xnt.

xdformat -n ESTs

xdformat also prints some interesting statistics to stdout. You can always retrieve this information later with xdformat -n -r ESTs.

XDFORMAT-WashU 1.0 [02-Apr-2002] [macosx-ppc-ILP32F64 2002-04-02T01:26:43]
Start:  2002-09-29T12:10:05
Input: "ESTs"
XDF Output Database:  ESTs
 Alphabet:  NCBI2na(1)
 No. of sequences (letters) written:  2000  (673,456)
 Longest sequence written (in database):  933  (933)
 Edit Alphabet:  WUStLna(1)
 Sequences edited for ambiguity codes:  961
 No. of edits (total length) written:  1901  (2102)
 Index entries written (in database):  2000  (2000)
Total cpu time:  0.12u 0.00s 0.12t  Elapsed: 00:00:00
End:  2002-09-29T12:10:05

If you use the free version of WU-BLAST, use pressdb instead of xdformat. The output is quite brief (so it's not shown here). The files are named differently, too: ESTs.csq, ESTs.nhd, and ESTs.ntb.

pressdb ESTs

Format the globins database with the following command. The -p indicates that this is a file or protein sequences. This command also creates three files: globins.xpd, globins.xps, and globins.xpt.

xdformat -p globins

If you use the free version of WU-BLAST, use setdb and name the files globins.ahd, globins.atd, and globins.bsq:

setdb globins

10.3.2.2 blastn

Let's query the ESTs database with a single EST from that database and see if you can find it:

blastn ESTs EST > wu-blastn_test

The output of your search contains header information and a summary that should look like this:

Query=  AU108953 AU108953  Caenorhabditis elegans cDNA clone:yk701a6 : 5' end,
    single read. 
        (322 letters)

Database:  ESTs
           2000 sequences; 673,456 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done

                                                                     Smallest
                                                                       Sum
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

AU108953 AU108953  Caenorhabditis elegans cDNA clone:yk70...  1603  3.3e-69   1
AU109042 AU109042  Caenorhabditis elegans cDNA clone:yk70...  1507  7.1e-65   1
AU109359 AU109359  Caenorhabditis elegans cDNA clone:yk70...  1203  3.8e-51   1
AU110478 AU110478  Caenorhabditis elegans cDNA clone:yk71...  1052  3.4e-44   1
AU109925 AU109925  Caenorhabditis elegans cDNA clone:yk71...   310  7.9e-11   1
AU110873 AU110873  Caenorhabditis elegans cDNA clone:yk72...   121  0.998     1

The query sequence, which was in a file named EST, has the accession number AU108953, which is the highest scoring match in the report summary. There were several other hits, and you may want to browse the rest of the file for fun.

If you wish to benchmark your system for comparisons to other systems, you can search all the ESTs against one another. This will take about 10 minutes on a 1-GHz processor. We use this same benchmark in Chapter 12 for NCBI-BLAST; read that chapter to understand why this is or isn't a good benchmark for your system. Don't compare this WU-BLAST benchmark to that produced by NCBI-BLAST because the default parameters produce different results. A proper cross-comparison requires that the results be the same. Here is the command line for the benchmark (note the addition of the -warnings flag to suppress warning messages).

time blastn ESTs ESTs -warnings > /dev/null

10.3.2.3 blastp

You can run blastp as you did blastn, by searching a single protein against a database of proteins:

blastp globins globin > wu-blastp_test

The output of this search is quite a bit longer than the previous EST example. Only part of the summary is shown here:

Query=  Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.
        (145 letters)

Database:  globins
           1203 sequences; 211,084 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done

                                                                     Smallest
                                                                       Sum
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.                                   749  1.0e-76   1
HBG_ATEGE P06891 Hemoglobin gamma chain.                       710  1.4e-72   1
HBG1_CALMO Q9GLX4 Hemoglobin gamma-1 chain.                    707  2.9e-72   1
HBG_ALOBE P56284 Hemoglobin gamma chain.                       705  4.7e-72   1
Q9GJS7 Q9GJS7 GAMMA1-GLOBIN (GAMMA2-GLOBIN).                   703  7.7e-72   1
Q9GLX7 Q9GLX7 HYBRID GAMMA1/GAMMA2-GLOBIN.                     701  1.3e-71   1

You can benchmark blastp in the same way as blastn, however, it takes about twice as long to complete as the blastn benchmark.

time blastp globins globins -warnings > /dev/null

10.3.2.4 blastx

It is common to use blastx as a gene-finding tool by searching a genomic sequence against a protein database:

blastx globins fugu_genomic filter=seg  > wu-blastx_test

While the search is running, you'll see the following warning message that indicates that some of the default parameters were overrun:

WARNING:  Descriptions of 333 database sequences were not reported due to the
          limiting value of parameter V = 500.
WARNING:  HSPs involving 583 database sequences were not reported due to the
          limiting value of parameter B = 250.

The top of the report should look like this:

Query=  gi|18463974|gb|AY016024.1| Takifugu rubripes alpha globin gene cluster,
    complete sequence
        (35,793 letters)

  Translating both strands of query sequence in all 6 reading frames

Database:  globins
           1203 sequences; 211,084 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done

                                                                     Smallest
                                                                       Sum
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

Q9PVU6 Q9PVU6 EMBRYONIC ALPHA-TYPE GLOBIN.                     277  7.3e-44   3
HBAA_SERQU Q9PVM4 Hemoglobin alpha-A chain.                    275  2.6e-40   3
O13136 O13136 ALPHA-GLOBIN.                                    278  8.6e-40   3
Q90487 Q90487 AA1 ALPHA GLOBIN.                                279  1.4e-39   3
HBA_CYPCA P02016 Hemoglobin alpha chain.                       278  1.8e-39   3

10.3.2.5 tblastn

It is common to use tblastn to search EST databases for protein similarities. Here's the command:

tblastn ESTs globin filter=seg  > wu-tblastn_test

The top of the report and the first alignment are shown here. You might consider the significance of the alignment as an experiment.

Query=  Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.
        (145 letters)

Database:  ESTs
           2000 sequences; 673,456 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done

                                                                     Smallest
                                                                       Sum
                                                     Reading  High  Probability
Sequences producing High-scoring Segment Pairs:        Frame Score  P(N)      N

AU109565 AU109565  Caenorhabditis elegans cDNA clone:y... -1    60  0.42      1
AU109149 AU109149  Caenorhabditis elegans cDNA clone:y... -1    57  0.75      1
AU109885 AU109885  Caenorhabditis elegans cDNA clone:y... +1    53  0.996     1

>AU109565 AU109565  Caenorhabditis elegans cDNA clone:yk708f8 : 5' end, single
          read.
        Length = 327

  Minus Strand HSPs:

 Score = 60 (26.2 bits), Expect = 0.55, P = 0.42
 Identities = 12/34 (35%), Positives = 20/34 (58%), Frame = -1

Query:    21 VEDAGGETLGRLLVVYPWTQRFFDSFGSLCSPSA 54
             + + G   + R+  V P +QRF  S  ++CSP+A
Sbjct:   144 IREVGESPVIRIFFVLPGSQRFIVSRRAICSPTA 43

10.3.2.6 tblastx

It is common to use tblastx as a gene-finding tool to identify potential coding regions between genomes, and you can simulate this by aligning alpha globin clusters from chicken and fish. You haven't formatted the chicken genomic sequence database yet, so do this first. If you're using the free version, substitute pressdb for xdformat and don't include the -n parameter.

xdformat -n chicken_genomic
tblastx chicken_genomic fugu_genomic filter=seg  > wu-blastx_test

The following warning is reported during the search, indicating that the number of alignments has passed a default threshold. This can be avoided with hspmax=0.

WARNING:  hspmax=1000 was exceeded with 1 of the database sequences.

The first alignment is shown here:

>gb|AY016020.1| Gallus gallus alpha globin gene cluster, complete sequence
        Length = 103,190

  Plus Strand HSPs:

 Score = 183 (69.5 bits), Expect = 9.2e-67, Sum P(14) = 9.2e-67
 Identities = 40/53 (75%), Positives = 44/53 (83%), Frame = +2 / +1

Query: 24479 DWKIEITGSS*LVTTRTLRNLSNSVLRCERLMFSLYMISSRWWCPRK**SSLK 24637
             D K E+TGSS LVTT TLRNLSNS+  CER MFSLYMISS+W  PRK**SSL+
Sbjct:   496 D*KTEMTGSSWLVTTSTLRNLSNSISSCERRMFSLYMISSKWCRPRK**SSLR 654

Note that the alignment contains stop codons (*). If you wish to terminate alignments at stop codons, you will have to increase the gap penalties and align without gaps. You will learn more about this topic at the end of this chapter.

10.3.2.7 xdget

The xdget program retrieves sequences from a BLAST database. You can use this only if you have the licensed software, and you must index the databases with xdformat before you can use xdget. The ESTs database can be indexed by the following command:

xdformat -n -X ESTs

You'll see some diagnostic output, and there will be an additional file called ESTs.xni. You can index at the time of formatting if you include the -I flag. The globins database is indexed similarly:

xdformat -p -X globins

You'll also see some diagnostic output that includes 33 lines similar to the following:

 *Duplicate ID:  Q17154

This indicates that you have duplicate database identifiers in the FASTA database. The duplicates are intentional, and you can remove them with the nrdb and patdb tools described later. Now retrieve some sequences:

xdget -n ESTs AU109017

You should see the following output:

>AU109017 AU109017  Caenorhabditis elegans cDNA clone:yk701g7 : 5' end, single read. 
TGGCCTACTGGGGTTTAATTACCCAAGTTTGAGATGGCTGCTGCTTCAGTGAAAGGCTTT
TTCCAGCGGACCGGAATCAGCATCAAAGAATATTTTAAACGAATGGGAAATGATTATGCT
ACTGTAGCTAGGGAAACTGTCCAAGGATGTAAAGATAGACCTGTTAAAGCTGGAGTTGTT
TTCTCTGGGCTCGGTTTTTTAACCTATGCATATCAGACAAATCCAACAGAGCTGGAAATG
TATGATTATTTATGCGAGAGACGACAAAAGTTAGTTTTGGTCCCGAATTCTGAGCATAAT
CCGGCTACAACTAAAGAATTAACTGCTCGCGA

And for proteins, you can rely on a similar action:

xdget -p globins HBP_CANLI

You should see the following output:

>HBP_CANLI P42511 Leghemoglobin.
MGAFSEKQESLVKSSWEAFKQNVPHHSAVFYTLILEKAPAAQNMFSFLSNGVDPNNPKLK
AHAEKVFKMTVDSAVQLRAKGEVVLADPTLGSVHVQKGVLDPHFLVVKEALLKTFKEAVG
DKWNDELGNAWEVAYDELAAAIKKAMGSA

10.3.2.8 nrdb and patdb

The nrdb and patdb programs are useful for removing the redundant sequences you saw earlier. Use nrdb first:

nrdb globins > globins_nr

The additional output from the program is as follows:

           --------- Records ---------  -------------- Residues -----------
Database      Read  Duplicate  Written         Read   Duplicate     Written
globins       1203       39       1164      211,084      37,819     173,265

Totals:       1203       39       1164      211,084      37,819     173,265

No. of base word hits:  53 (53 total)
No. of 32-bit hash hits:  39
Total memory allocated:  0.500 MB
Longest comment line read:  184
Longest comment line written:  423
Longest sequence read:  1431

The report shows 39 duplicate records. When indexing the globins database, you'll find that there were 33 duplicate identifiers. However, nrdb looks for duplicate sequences not identifiers. When nrdb finds duplicate sequences, it concatenates the identifiers, separating them with a Control-A character. This is normally a whitespace character in a terminal window but is visible in some text editors and pagers.

patdb is even more aggressive than nrdb for removing redundancies; it removes identical substrings so the sequences MVLQ and MVLQKP are merged into the same entry.

10.3.2.9 Environment variables

Now it's time to make sure the environment variables function correctly. To start, move to any random directory on your system and attempt one of the previous searches, substituting the correct path to the query file:

If the executable path is incorrect, you will see an error message such as:

blastp: command not found

If the databases can't be found, WU-BLAST reports:

FATAL:  Could not open the database:  <database name>

WU-BLAST offers informative messages when it fails to find scoring matrices and complexity filters:

FATAL:  "No such file or directory" error encountered (errno=2), when attempting to 
FIND the requested sequence filter program "nseg". Please check the setting of the 
BLASTFILTER or WUBLASTFILTER environment variables and the permissions on the 
program.
EXIT CODE 32

FATAL:  Could not find or open a substitution matrix file named: BLOSUM62. Check 
file access permissions or the setting of the WUBLASTMAT
(BLASTMAT) environment variable.
EXIT CODE 8

For a complete list of errors and messages, see Chapter 14.

[ Team LiB ]