[ Team LiB ] |
10.3 Command-Line TutorialNow that you've installed the NCBI-BLAST and/or WU-BLAST software, it's time to try it out. To do this, you will need sequences for both queries and databases. It's generally a good idea to start with something small and make sure it works before attempting to analyze large databases. To begin, download the book's example files from http://examples.oreilly.com/BLAST. Copy the Sequence directory to a local hard disk. This directory contains several database and sequences in FASTA format. The six files you will need for testing are described in Table 10-3.
10.3.1 NCBI-BLASTIn the following subsections, you'll run the various components of the BLAST software you just installed. You'll see brief descriptions of each program and its arguments, but the programs all have more options those displayed. See Chapter 13 for more information about each program. To begin, change your directory to the sequence directory. 10.3.1.1 formatdbWe'll start by formatting the ESTs database with the following command: formatdb -i ESTs -p F -l ESTs.logfile -o The -i ESTs indicates the name of the input file. The -p F indicates this isn't a protein database. The -l ESTs.logfile creates a log file. The -o creates indexing files so that sequences may be retrieved from the database. This command creates six files: ESTs.nhr, ESTs.nin, ESTs.nsd, ESTs.nsi, ESTs.nsq, and ESTs.logfile. With the exception of the log file, the files are in binary format and not human-readable. The log file looks like this (the version number will differ depending on your installation): ========================[ Oct 9, 2002 11:40 AM ]======================== Version 2.2.2 [Jan-08-2002] Started database file "ESTs" Formatted 2000 sequences (END) Now, format the globins database with the following command: formatdb -i globins -p T -l globins.logfile -o The parameters here are similar to those previous, with the exception of -p T because this is a protein database. Because this is the default condition, you can omit -p T. This command also creates six files: globins.phr, globins.pin, globins.psd, globins.psi, globins.psq, and globins.logfile. 10.3.1.2 blastnLet's query the ESTs database with a single EST from that database and see if you can find it: blastall -p blastn -d ESTs -i EST > ncbi-blastn_test After the standard header, you should see output identical to this: Query= AU108953 AU108953 Caenorhabditis elegans cDNA clone:yk701a6 : 5' end, single read. (322 letters) Database: ESTs 2000 sequences; 673,456 total letters Searching....done Score E Sequences producing significant alignments: (bits) Value AU108953 AU108953 Caenorhabditis elegans cDNA clone:yk701a6 : 5... 632 0.0 AU109042 AU109042 Caenorhabditis elegans cDNA clone:yk702b3 : 5... 567 e-163 AU109359 AU109359 Caenorhabditis elegans cDNA clone:yk705g2 : 5... 436 e-124 AU110478 AU110478 Caenorhabditis elegans cDNA clone:yk718h12 : ... 383 e-107 AU109122 AU109122 Caenorhabditis elegans cDNA clone:yk703d7 : 5... 32 0.040 AU110305 AU110305 Caenorhabditis elegans cDNA clone:yk718a9 : 5... 30 0.16 The summary of the report shows that the top sequence has accession number AU108953, which is the same as the query used in the file named EST. If you would like to benchmark the performance of your system as a way to compare different machines, you can search all the ESTs against one another. This will take about 10 minutes on a 1-GHz processor. You'll see this same benchmark in Chapter 12. Read that chapter to understand why this is or isn't a good benchmark for your system. Here is the Unix command line for the benchmarking procedure: time blastall -p blastn -d ESTs -i ESTs > /dev/null 10.3.1.3 megablastmegablast is designed for high-performance blastn searches. Do the same two searches as in previous sections. megablast -d ESTs -i EST > megablast-test The default output is tabular. For more information on this format, see Appendix A. 'AU108953'=='+AU108953' (1 1 322 322) 0 'AU109042'=='+AU108953' (18 14 324 319) 5 'AU109359'=='+AU108953' (12 58 282 322) 13 'AU109359'=='+AU108953' (12 58 282 322) 13 'AU110478'=='+AU108953' (25 21 240 236) 8 If you want to see how much faster megablast is than standard blastn, time them both. To compare them on equal footing, set the blastn word size higher (-W 30) and use tabular output (-m 8). time blastall -p blastn -d ESTs -i ESTs -W 30 -m 8 > /dev/null time megablast -d ESTs -i ESTs > /dev/null (megablast ran about 12 times faster on the Ian's laptop in these tests.) 10.3.1.4 blastpRun blastp by searching a single protein against a database of proteins: blastall -p blastp -d globins -i globin > ncbi-blastp_test This BLAST report is fairly long, and the following is only part of the summary. The top of your report should look like this: Query= Q9GJY8 Q9GJY8 GAMMA2-GLOBIN. (145 letters) Database: globins 1203 sequences; 211,084 total letters Searching...done Score E Sequences producing significant alignments: (bits) Value Q9GJY8 Q9GJY8 GAMMA2-GLOBIN. 293 6e-82 HBG_ATEGE P06891 Hemoglobin gamma chain. 277 4e-77 HBG1_CALMO Q9GLX4 Hemoglobin gamma-1 chain. 276 8e-77 HBG_ALOBE P56284 Hemoglobin gamma chain. 275 1e-76 Q9GJS7 Q9GJS7 GAMMA1-GLOBIN (GAMMA2-GLOBIN). 274 2e-76 Q9GLX7 Q9GLX7 HYBRID GAMMA1/GAMMA2-GLOBIN. 273 4e-76 You can also benchmark blastp in the same way as blastn, by doing all-versus-all protein searches. This will take about twice as long to complete as the blastn benchmark. time blastall -p blastp -d globins -i globin > /dev/null 10.3.1.5 blastxIt is common to use blastx as a gene-finding tool by searching a genomic sequence against a protein database: blastall -p blastx -d globins -i fugu_genomic > ncbi-blastx_test The genomic sequence contains a cluster of alpha globin genes, so you can expect to find quite a few matches to the globins database. Your report should look like this: Query= gi|18463974|gb|AY016024.1| Takifugu rubripes alpha globin gene cluster, complete sequence (35,793 letters) Database: globins 1203 sequences; 211,084 total letters Searching...done Score E Sequences producing significant alignments: (bits) Value Q9PVU6 Q9PVU6 EMBRYONIC ALPHA-TYPE GLOBIN. 137 3e-46 HBA_THUTH P11748 Hemoglobin alpha chain. 119 3e-40 HBA2_NOTCO P16308 Hemoglobin alpha-2 chain. 130 5e-40 HBA2_TRENE P45719 Hemoglobin alpha-2 chain. 126 6e-40 Q98974 Q98974 ALPHA-GLOBIN IV. 119 7e-39 Q98SE6 Q98SE6 ALPHANCP2 (ALPHANCP1). 140 9e-39 HBA_ELEEL P14520 Hemoglobin alpha chain. 119 9e-39 10.3.1.6 tblastnRun tblastn in a typical application that searches an EST databases for protein similarities. blastall -p tblastn -d ESTs -i globin > ncbi-tblastn_test Here's the first alignment of the report in addition to the summary. We'll leave the question of biological and statistical significance as an exercise for you. Query= Q9GJY8 Q9GJY8 GAMMA2-GLOBIN. (145 letters) Database: ESTs 2000 sequences; 673,456 total letters Searching....done Score E Sequences producing significant alignments: (bits) Value AU109565 AU109565 Caenorhabditis elegans cDNA clone:yk708f8 : 5... 28 0.043 AU109149 AU109149 Caenorhabditis elegans cDNA clone:yk703g6 : 5... 27 0.096 AU109370 AU109370 Caenorhabditis elegans cDNA clone:yk705h2 : 5... 25 0.37 AU110425 AU110425 Caenorhabditis elegans cDNA clone:yk716a3 : 5... 24 0.62 AU109448 AU109448 Caenorhabditis elegans cDNA clone:yk706h11 : ... 20 6.9 AU109900 AU109900 Caenorhabditis elegans cDNA clone:yk712d7 : 5... 20 6.9 AU110608 AU110608 Caenorhabditis elegans cDNA clone:yk720c9 : 5... 20 9.0 AU109391 AU109391 Caenorhabditis elegans cDNA clone:yk706b11 : ... 20 9.0 >AU109565 AU109565 Caenorhabditis elegans cDNA clone:yk708f8 : 5' end, single read. Length = 327 Score = 27.7 bits (60), Expect = 0.043 Identities = 12/34 (35%), Positives = 20/34 (58%) Frame = -1 Query: 21 VEDAGGETLGRLLVVYPWTQRFFDSFGSLCSPSA 54 + + G + R+ V P +QRF S ++CSP+A Sbjct: 144 IREVGESPVIRIFFVLPGSQRFIVSRRAICSPTA 43 10.3.1.7 tblastxtblastx is commonly used as a gene-finding tool to identify potential coding regions between genomes. You can simulate this by aligning alpha globin clusters from chicken and fish. You haven't formatted the chicken genomic sequence database yet, so do this first and then run the search: formatdb -i chicken_genomic -p F blastall -p tblastx -d chicken_genomic -i fugu_genomic> ncbi-blastx_test You should see the following error message while your search is running. This warns you that there are more than 200 significant alignments between the two sequences. You can't increase this number with a command-line switch, and the only workaround is to cut your query sequence into smaller pieces. This limitation applies only in ungapped searches (TBLASTX or any search with the setting -g F). [blastall] WARNING: [000.000] gi|18463974|gb|AY016024.1|: Reached max 200 HSPs in BlastSaveCurrentHsp, continuing with this limit Since there's only one sequence in the database, look at the first alignment instead of the summary: >gi|17104478|gb|AY016020.1| Gallus gallus alpha globin gene cluster, complete sequence Length = 103190 Score = 141 bits (302), Expect = 1e-34 Identities = 61/73 (83%), Positives = 66/73 (89%) Frame = -2 / -3 Query: 25076 ASSDDMTLTSPSMDNSSAELLPGGDSPLNKRITETLLASLSEHERQVILSVPAAQNPEDL 24897 ASSDDMTLTSPSMDNSSAEL+PGGDSPLNKR+TE LLASL EHER+ IL+VPAAQNPEDL Sbjct: 5289 ASSDDMTLTSPSMDNSSAELIPGGDSPLNKRMTENLLASLLEHEREAILNVPAAQNPEDL 5110 Query: 24896 RMFAR*NHLSTKC 24858 RMFAR* S +C Sbjct: 5109 RMFAR*EIGSAEC 5071 Note the stop codon (*) near the end of the alignment. The scoring matrices distributed with BLAST set the score of all stop codon matches to +1. If you want to terminate alignments at stop codons, you have to edit the scoring matrix. This procedure is described at the end of this chapter. 10.3.1.8 bl2seqThe previous tblastx test used only two sequences. Whenever you want to align just two sequences, you can use the bl2seq program to save the effort of having to format one sequence as a database and remove the database files later. Here's the command: bl2seq -p tblastx -j chicken_genomic -i fugu_genomic You won't see a summary, but the rest of the report will be nearly identical to the one in the previous section. 10.3.1.9 fastacmdThe fastacmd program can retrieve one or more FASTA-formatted sequences from a BLAST database. Try this by retrieving a single sequence from the ESTs database: fastacmd -d ESTs -s AU108953 -l 60 The -d ESTs designates the database from which you retrieve the sequence. The database must have been formatted with the -o option of formatdb for fastacmd to work. The -s AU108953 is the string for which to search the database. The -l 60 specifies the sequence line lengths; the default line length is 80. The output from this test should be the following: >lcl|AU108953 AU108953 Caenorhabditis elegans cDNA clone:yk701a6 : 5' end, sing le read TGGCCTACTGGANAAAACAACAATGCGTGCTTTACTATTCACCTCTGTTGTTCTTTTGGC TTTGGCTTTTGTTGAGGCAAAGAAGCAGACTATCACTGTCAAGGGTACAACTATTTGTAA TAAGAAGAGAATTCAGGCGGAGGTTACCCTTTGGGAGAAAGATACTCTCGACCCCGATGA CAAGCTCGCCTCAATGCAATCGAACAAAGAAGGAGAGTTCTCACTTACCGGATCCGACGA CGAGATCACCTCAATCTCTCCATACCTCATAATCACCCACAACAGCAACGTGAAGAAGGC CGGATGAAGCCGTGTTTCAGAG The beginning of the definition has a prepended lcl|. The lcl| isn't in the original FASTA definition line and doesn't show up in BLAST reports; it shows up here because the database identifier wasn't in NCBI format (see Chapter 11). fastacmd has several other useful features. It can report a summary of a BLAST database, dump a BLAST database to FASTA format, report a subsequence of a particular sequence, and even display taxonomic information if the databases are downloaded from ftp://ftp.ncbi.nih.gov/blast/db/. See Chapter 13 for more information on fastacmd. 10.3.1.10 PSI-BLASTPSI-BLAST identifies weak amino acid similarities and is executed using the blastpgp program. Test it by iteratively searching the Drosophila melanogaster p53 protein against a database of p53 homologs. First, format the p53 database, just as you would for any protein database: formatdb -i p53DB -p T -l p53DB.logfile -o Now you can search with your p53 protein{ blastpgp -d p53DB -i p53 -j 5 > ncbi-psiblast_test You'll get a report with four iterations. All four rounds of automated searching are concatenated into one report. The following represents the summary section for these four rounds. In round 1, there are 11 matches out of 12 total database sequences. Notice that there is an insignificant hit to a hypothetical protein at the bottom of the report with a score of 20 and an E-value near 1. Query= gi|7211767|gb|AAF40427.1|AF224713_1 transcription factor p53 [Drosophila melanogaster] (385 letters) Database: p53DB 12 sequences; 3745 total letters Results from round 1 Score E Sequences producing significant alignments: (bits) Value gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST] 98 3e-024 gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi... 75 2e-017 gb|AAA98563.1| p53 tumor suppressor homolog 70 9e-016 gb|AAC24830.1| p53 homolog [Homo sapiens] 63 1e-013 gb|AAC37335.1| p53 [Canis familiaris] 59 1e-012 gb|AAC05704.1| tumor suppressor p53 [Mus musculus] 58 4e-012 gb|AAC60746.1| p53 [Xenopus laevis] 57 9e-012 gb|AAD12203.1| tumor suppressor protein [Canis familiaris] 43 1e-007 gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss] 42 2e-007 gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ... 35 4e-005 gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana] 20 0.98 In round 2, all 12 sequences from the p53DB have hits better than the E-value threshold of 10. All scores are higher and an annotated p53 peptide fragment from mouse (gb|AAC71764.1|) is now found with a significant score. Results from round 2 Score E Sequences producing significant alignments: (bits) Value Sequences used in model and found again: gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi... 293 4e-083 gb|AAC24830.1| p53 homolog [Homo sapiens] 291 3e-082 gb|AAC05704.1| tumor suppressor p53 [Mus musculus] 279 7e-079 gb|AAC37335.1| p53 [Canis familiaris] 279 9e-079 gb|AAA98563.1| p53 tumor suppressor homolog 278 1e-078 gb|AAC60746.1| p53 [Xenopus laevis] 270 5e-076 gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST] 251 3e-070 gb|AAD12203.1| tumor suppressor protein [Canis familiaris] 227 4e-063 gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ... 221 2e-061 gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss] 127 4e-033 Sequences not found previously or not previously below threshold: gb|AAC71764.1| p53 protein; Trp53 [Mus musculus musculus] 40 8e-007 gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana] 25 0.028 In round 3, the scores of all hits are even better as the position-specific scoring matrix is increasingly tuned to the p53 profile. Results from round 3 Score E Sequences producing significant alignments: (bits) Value Sequences used in model and found again: gb|AAC60746.1| p53 [Xenopus laevis] 323 6e-092 gb|AAC05704.1| tumor suppressor p53 [Mus musculus] 309 7e-088 gb|AAA98563.1| p53 tumor suppressor homolog 306 5e-087 gb|AAC24830.1| p53 homolog [Homo sapiens] 304 2e-086 gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi... 295 1e-083 gb|AAC37335.1| p53 [Canis familiaris] 285 2e-080 gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST] 271 2e-076 gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ... 241 2e-067 gb|AAD12203.1| tumor suppressor protein [Canis familiaris] 231 3e-064 gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss] 124 3e-032 gb|AAC71764.1| p53 protein; Trp53 [Mus musculus musculus] 43 1e-007 Sequences not found previously or not previously below threshold: gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana] 32 3e-004 In round 4, no new sequences are found below the level of significance (0.005). Therefore the search has converged, and no new iterations are performed. Results from round 4 Score E Sequences producing significant alignments: (bits) Value Sequences used in model and found again: gb|AAC60746.1| p53 [Xenopus laevis] 333 3e-095 gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi... 319 6e-091 gb|AAC24830.1| p53 homolog [Homo sapiens] 319 6e-091 gb|AAC05704.1| tumor suppressor p53 [Mus musculus] 313 5e-089 gb|AAA98563.1| p53 tumor suppressor homolog 311 3e-088 gb|AAC37335.1| p53 [Canis familiaris] 289 8e-082 gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST] 272 6e-077 gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ... 241 2e-067 gb|AAD12203.1| tumor suppressor protein [Canis familiaris] 230 3e-064 gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss] 125 2e-032 gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana] 70 1e-015 gb|AAC71764.1| p53 protein; Trp53 [Mus musculus musculus] 44 8e-008 Sequences not found previously or not previously below threshold: CONVERGED! 10.3.1.11 PHI-BLASTPHI-BLAST seeds extensions from important regions of a protein—for example, an enzyme active site or a conserved domain. It is also executed with the blastpgp program. To specify a PHI-BLAST search, a hit_file must be available (-k) and the program (-p) must be specified as either patseedp or seedp, with seedp specifying a special hit_file. You'll run the program using a conserved part of the Drosophila melanogaster p53 protein as a pattern seed to search the p53DB. The pattern is designated in hit_file.p53: ID p53 Pattern PA [YF]-[ST]-X-X-L-N-K-L-[YF] The following command uses this pattern as a seed in a search of the p53 database: blastpgp -d p53DB -i p53 -k hit_file.p53 -p patseedp > ncbi-phiblast_test In the results, the position of the pattern in the query and the probability of finding the pattern are shown, followed by the summary lines and alignments. In the first alignment, asterisks show the position of the pattern in the query and subject. 1 occurrence(s) of pattern in query p53 Pattern pattern [YF]-[ST]-X-X-L-N-K-L-[YF] at position 107 of query sequence effective database length=3.6e+003 pattern probability=1.4e-008 lengthXprobability=5.0e-005 Number of occurrences of pattern in the database is 4 WARNING: There may be more matching sequences with e-values below the threshold of 10.000000 Score E (bits) Value Significant matches for pattern occurrence 1 at position 107 gb|AAC37335.1| p53 [Canis familiaris] 47 2e-014 gb|AAC60746.1| p53 [Xenopus laevis] 41 2e-012 gb|AAC05704.1| tumor suppressor p53 [Mus musculus] 38 2e-011 gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ... 16 7e-005 Significant alignments for pattern occurrence 1 at position 107 >gb|AAC37335.1| p53 [Canis familiaris] Length = 281 Score = 47.2 bits (132), Expect = 2e-014 Identities = 56/220 (25%), Positives = 94/220 (42%), Gaps = 27/220 (12%) Query: 105 WMYSIPLNKLYIRMNKAFNVDVQFKSKMPIQPLNLRVFLCF--SNDVSAPVVRCQNHLSV 162 pattern 107 ********* W YS LNKL+ ++ K V + S P +R + S V+ V RC +H Sbjct: 16 WTYSPLLNKLFCQLAKTCPVQLWVSSPPPPNTC-VRAMAIYKKSEFVTEVVRRCPHHERC 74 10.3.1.12 Environment variables and .ncbircIf everything works, make sure the environment variables and data directory function correctly. Now, move to any random directory on your system and attempt one of the previous searches, but substitute the correct path to the query file. If the executable path is incorrect, you will have an error message such as: blastall: command not found Databases that can't be found produce error messages like the following: [blastall] WARNING: <query> Could not find index files for database <db> If the data directory can't be found, blastall reports that it is unable to open the scoring matrix: Searching[blastall] WARNING: <query> [000.000] Unable to open <matrix> 10.3.2 WU-BLASTAs in the previous section, you'll run the various components of the WU-BLAST software in typical sequence analysis settings. 10.3.2.1 xdformatStart by formatting the ESTs database with the following command. The -n indicates that this is a nucleotide database. This command creates three files: ESTs.xnd, ESTs.xns, and ESTs.xnt. xdformat -n ESTs xdformat also prints some interesting statistics to stdout. You can always retrieve this information later with xdformat -n -r ESTs. XDFORMAT-WashU 1.0 [02-Apr-2002] [macosx-ppc-ILP32F64 2002-04-02T01:26:43] Start: 2002-09-29T12:10:05 Input: "ESTs" XDF Output Database: ESTs Alphabet: NCBI2na(1) No. of sequences (letters) written: 2000 (673,456) Longest sequence written (in database): 933 (933) Edit Alphabet: WUStLna(1) Sequences edited for ambiguity codes: 961 No. of edits (total length) written: 1901 (2102) Index entries written (in database): 2000 (2000) Total cpu time: 0.12u 0.00s 0.12t Elapsed: 00:00:00 End: 2002-09-29T12:10:05 If you use the free version of WU-BLAST, use pressdb instead of xdformat. The output is quite brief (so it's not shown here). The files are named differently, too: ESTs.csq, ESTs.nhd, and ESTs.ntb. pressdb ESTs Format the globins database with the following command. The -p indicates that this is a file or protein sequences. This command also creates three files: globins.xpd, globins.xps, and globins.xpt. xdformat -p globins If you use the free version of WU-BLAST, use setdb and name the files globins.ahd, globins.atd, and globins.bsq: setdb globins 10.3.2.2 blastnLet's query the ESTs database with a single EST from that database and see if you can find it: blastn ESTs EST > wu-blastn_test The output of your search contains header information and a summary that should look like this: Query= AU108953 AU108953 Caenorhabditis elegans cDNA clone:yk701a6 : 5' end, single read. (322 letters) Database: ESTs 2000 sequences; 673,456 total letters. Searching....10....20....30....40....50....60....70....80....90....100% done Smallest Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N AU108953 AU108953 Caenorhabditis elegans cDNA clone:yk70... 1603 3.3e-69 1 AU109042 AU109042 Caenorhabditis elegans cDNA clone:yk70... 1507 7.1e-65 1 AU109359 AU109359 Caenorhabditis elegans cDNA clone:yk70... 1203 3.8e-51 1 AU110478 AU110478 Caenorhabditis elegans cDNA clone:yk71... 1052 3.4e-44 1 AU109925 AU109925 Caenorhabditis elegans cDNA clone:yk71... 310 7.9e-11 1 AU110873 AU110873 Caenorhabditis elegans cDNA clone:yk72... 121 0.998 1 The query sequence, which was in a file named EST, has the accession number AU108953, which is the highest scoring match in the report summary. There were several other hits, and you may want to browse the rest of the file for fun. If you wish to benchmark your system for comparisons to other systems, you can search all the ESTs against one another. This will take about 10 minutes on a 1-GHz processor. We use this same benchmark in Chapter 12 for NCBI-BLAST; read that chapter to understand why this is or isn't a good benchmark for your system. Don't compare this WU-BLAST benchmark to that produced by NCBI-BLAST because the default parameters produce different results. A proper cross-comparison requires that the results be the same. Here is the command line for the benchmark (note the addition of the -warnings flag to suppress warning messages). time blastn ESTs ESTs -warnings > /dev/null 10.3.2.3 blastpYou can run blastp as you did blastn, by searching a single protein against a database of proteins: blastp globins globin > wu-blastp_test The output of this search is quite a bit longer than the previous EST example. Only part of the summary is shown here: Query= Q9GJY8 Q9GJY8 GAMMA2-GLOBIN. (145 letters) Database: globins 1203 sequences; 211,084 total letters. Searching....10....20....30....40....50....60....70....80....90....100% done Smallest Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N Q9GJY8 Q9GJY8 GAMMA2-GLOBIN. 749 1.0e-76 1 HBG_ATEGE P06891 Hemoglobin gamma chain. 710 1.4e-72 1 HBG1_CALMO Q9GLX4 Hemoglobin gamma-1 chain. 707 2.9e-72 1 HBG_ALOBE P56284 Hemoglobin gamma chain. 705 4.7e-72 1 Q9GJS7 Q9GJS7 GAMMA1-GLOBIN (GAMMA2-GLOBIN). 703 7.7e-72 1 Q9GLX7 Q9GLX7 HYBRID GAMMA1/GAMMA2-GLOBIN. 701 1.3e-71 1 You can benchmark blastp in the same way as blastn, however, it takes about twice as long to complete as the blastn benchmark. time blastp globins globins -warnings > /dev/null 10.3.2.4 blastxIt is common to use blastx as a gene-finding tool by searching a genomic sequence against a protein database: blastx globins fugu_genomic filter=seg > wu-blastx_test While the search is running, you'll see the following warning message that indicates that some of the default parameters were overrun: WARNING: Descriptions of 333 database sequences were not reported due to the limiting value of parameter V = 500. WARNING: HSPs involving 583 database sequences were not reported due to the limiting value of parameter B = 250. The top of the report should look like this: Query= gi|18463974|gb|AY016024.1| Takifugu rubripes alpha globin gene cluster, complete sequence (35,793 letters) Translating both strands of query sequence in all 6 reading frames Database: globins 1203 sequences; 211,084 total letters. Searching....10....20....30....40....50....60....70....80....90....100% done Smallest Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N Q9PVU6 Q9PVU6 EMBRYONIC ALPHA-TYPE GLOBIN. 277 7.3e-44 3 HBAA_SERQU Q9PVM4 Hemoglobin alpha-A chain. 275 2.6e-40 3 O13136 O13136 ALPHA-GLOBIN. 278 8.6e-40 3 Q90487 Q90487 AA1 ALPHA GLOBIN. 279 1.4e-39 3 HBA_CYPCA P02016 Hemoglobin alpha chain. 278 1.8e-39 3 10.3.2.5 tblastnIt is common to use tblastn to search EST databases for protein similarities. Here's the command: tblastn ESTs globin filter=seg > wu-tblastn_test The top of the report and the first alignment are shown here. You might consider the significance of the alignment as an experiment. Query= Q9GJY8 Q9GJY8 GAMMA2-GLOBIN. (145 letters) Database: ESTs 2000 sequences; 673,456 total letters. Searching....10....20....30....40....50....60....70....80....90....100% done Smallest Sum Reading High Probability Sequences producing High-scoring Segment Pairs: Frame Score P(N) N AU109565 AU109565 Caenorhabditis elegans cDNA clone:y... -1 60 0.42 1 AU109149 AU109149 Caenorhabditis elegans cDNA clone:y... -1 57 0.75 1 AU109885 AU109885 Caenorhabditis elegans cDNA clone:y... +1 53 0.996 1 >AU109565 AU109565 Caenorhabditis elegans cDNA clone:yk708f8 : 5' end, single read. Length = 327 Minus Strand HSPs: Score = 60 (26.2 bits), Expect = 0.55, P = 0.42 Identities = 12/34 (35%), Positives = 20/34 (58%), Frame = -1 Query: 21 VEDAGGETLGRLLVVYPWTQRFFDSFGSLCSPSA 54 + + G + R+ V P +QRF S ++CSP+A Sbjct: 144 IREVGESPVIRIFFVLPGSQRFIVSRRAICSPTA 43 10.3.2.6 tblastxIt is common to use tblastx as a gene-finding tool to identify potential coding regions between genomes, and you can simulate this by aligning alpha globin clusters from chicken and fish. You haven't formatted the chicken genomic sequence database yet, so do this first. If you're using the free version, substitute pressdb for xdformat and don't include the -n parameter. xdformat -n chicken_genomic tblastx chicken_genomic fugu_genomic filter=seg > wu-blastx_test The following warning is reported during the search, indicating that the number of alignments has passed a default threshold. This can be avoided with hspmax=0. WARNING: hspmax=1000 was exceeded with 1 of the database sequences. The first alignment is shown here: >gb|AY016020.1| Gallus gallus alpha globin gene cluster, complete sequence Length = 103,190 Plus Strand HSPs: Score = 183 (69.5 bits), Expect = 9.2e-67, Sum P(14) = 9.2e-67 Identities = 40/53 (75%), Positives = 44/53 (83%), Frame = +2 / +1 Query: 24479 DWKIEITGSS*LVTTRTLRNLSNSVLRCERLMFSLYMISSRWWCPRK**SSLK 24637 D K E+TGSS LVTT TLRNLSNS+ CER MFSLYMISS+W PRK**SSL+ Sbjct: 496 D*KTEMTGSSWLVTTSTLRNLSNSISSCERRMFSLYMISSKWCRPRK**SSLR 654 Note that the alignment contains stop codons (*). If you wish to terminate alignments at stop codons, you will have to increase the gap penalties and align without gaps. You will learn more about this topic at the end of this chapter. 10.3.2.7 xdgetThe xdget program retrieves sequences from a BLAST database. You can use this only if you have the licensed software, and you must index the databases with xdformat before you can use xdget. The ESTs database can be indexed by the following command: xdformat -n -X ESTs You'll see some diagnostic output, and there will be an additional file called ESTs.xni. You can index at the time of formatting if you include the -I flag. The globins database is indexed similarly: xdformat -p -X globins You'll also see some diagnostic output that includes 33 lines similar to the following: *Duplicate ID: Q17154 This indicates that you have duplicate database identifiers in the FASTA database. The duplicates are intentional, and you can remove them with the nrdb and patdb tools described later. Now retrieve some sequences: xdget -n ESTs AU109017 You should see the following output: >AU109017 AU109017 Caenorhabditis elegans cDNA clone:yk701g7 : 5' end, single read. TGGCCTACTGGGGTTTAATTACCCAAGTTTGAGATGGCTGCTGCTTCAGTGAAAGGCTTT TTCCAGCGGACCGGAATCAGCATCAAAGAATATTTTAAACGAATGGGAAATGATTATGCT ACTGTAGCTAGGGAAACTGTCCAAGGATGTAAAGATAGACCTGTTAAAGCTGGAGTTGTT TTCTCTGGGCTCGGTTTTTTAACCTATGCATATCAGACAAATCCAACAGAGCTGGAAATG TATGATTATTTATGCGAGAGACGACAAAAGTTAGTTTTGGTCCCGAATTCTGAGCATAAT CCGGCTACAACTAAAGAATTAACTGCTCGCGA And for proteins, you can rely on a similar action: xdget -p globins HBP_CANLI You should see the following output: >HBP_CANLI P42511 Leghemoglobin. MGAFSEKQESLVKSSWEAFKQNVPHHSAVFYTLILEKAPAAQNMFSFLSNGVDPNNPKLK AHAEKVFKMTVDSAVQLRAKGEVVLADPTLGSVHVQKGVLDPHFLVVKEALLKTFKEAVG DKWNDELGNAWEVAYDELAAAIKKAMGSA 10.3.2.8 nrdb and patdbThe nrdb and patdb programs are useful for removing the redundant sequences you saw earlier. Use nrdb first: nrdb globins > globins_nr The additional output from the program is as follows: --------- Records --------- -------------- Residues ----------- Database Read Duplicate Written Read Duplicate Written globins 1203 39 1164 211,084 37,819 173,265 Totals: 1203 39 1164 211,084 37,819 173,265 No. of base word hits: 53 (53 total) No. of 32-bit hash hits: 39 Total memory allocated: 0.500 MB Longest comment line read: 184 Longest comment line written: 423 Longest sequence read: 1431 The report shows 39 duplicate records. When indexing the globins database, you'll find that there were 33 duplicate identifiers. However, nrdb looks for duplicate sequences not identifiers. When nrdb finds duplicate sequences, it concatenates the identifiers, separating them with a Control-A character. This is normally a whitespace character in a terminal window but is visible in some text editors and pagers. patdb is even more aggressive than nrdb for removing redundancies; it removes identical substrings so the sequences MVLQ and MVLQKP are merged into the same entry. 10.3.2.9 Environment variablesNow it's time to make sure the environment variables function correctly. To start, move to any random directory on your system and attempt one of the previous searches, substituting the correct path to the query file: If the executable path is incorrect, you will see an error message such as: blastp: command not found If the databases can't be found, WU-BLAST reports: FATAL: Could not open the database: <database name> WU-BLAST offers informative messages when it fails to find scoring matrices and complexity filters: FATAL: "No such file or directory" error encountered (errno=2), when attempting to FIND the requested sequence filter program "nseg". Please check the setting of the BLASTFILTER or WUBLASTFILTER environment variables and the permissions on the program. EXIT CODE 32 FATAL: Could not find or open a substitution matrix file named: BLOSUM62. Check file access permissions or the setting of the WUBLASTMAT (BLASTMAT) environment variable. EXIT CODE 8 For a complete list of errors and messages, see Chapter 14. |
[ Team LiB ] |