[ Team LiB ] |
11.4 Sequence Database Management StrategiesThere are many useful public sequence databases, and you may have access to some private ones as well. Because this is a book about BLAST, we assume you want to use these collections of sequences in BLAST searches. Some sequences may be used as queries, and others in databases. How are you going to manage them all in a rational way? Several possible strategies exist, and the correct one for you depends on your needs and resources. To demonstrate some of the issues, let's review a typical sequence analysis scenario. Suppose a colleague of yours has just found the gene that makes cats go crazy for catnip. She wants to learn more about this gene and comes to you for help because you are a BLAST expert. The first thing she wants to do is a BLAST search to find out what vertebrate proteins are similar to this one. Where are you going to get such a database of proteins? Once you perform the BLAST search, you find several interesting similarities. Your colleague tells you that these are probably all part of a family of proteins, and she would like to build a phylogenetic tree to determine their relationships to one another. How are you going to get the individual sequences? Finally, she decides she wants more information about the human sequences, and to do that, she would like references to the scientific literature like the ones she would find in a DDBJ/EMBL/GenBank report. How are you going to retrieve such information? You could just refuse to help her because these aren't really BLAST problems, but these are the kinds of tasks many BLAST users must face. Let's take a look at how they can be solved. This example has basically two solutions to each question: the first is to use tools available on the Internet. The second is to build the tools yourself. In general, it is much easier to use the Internet, but for high speed or high-throughput operations you'll want a local solution. After you read this chapter, you may decide that you want some services to be provided locally, while others are Internet-only operations. This section begins with a brief review of databases. 11.4.1 Queries, Indexes, and ReportsThe most common database operation is a query. One person may want to retrieve a particular sequence. Another may want all human sequences. As you have seen, sequence records have quite a bit of useful information, and a user may request nonsequence information such as all the MEDLINE references for all sequences with the word disease in the description. The efficiency with which a query is executed depends a lot on how the database is indexed. If there is no indexing, a query must operate on every record of the database. So, for example, if you want to find all the coelacanth sequences, you would have to look through millions of records to find the handful whose sequences originate from the coelacanth. Clearly, this isn't going to be efficient, so databases usually have indexes that, for example, keep lists of species and all the sequences for each species. The most straightforward kind of indexing occurs when there is a unique relationship between a property and a sequence. This is called a one-to-one mapping, and an example would be an accession number. A more complex indexing occurs when a property points to many sequences. This is called a one-to-many mapping, and an example is a species name that is shared by millions of records. Once a query is executed, the data must be reported in some format. For sequences, this is usually the FASTA format. For other kinds of data, there are other appropriate formats, such as lists, tables, and graphs. 11.4.2 Local Database ConsiderationsHaving a local sequence database has some real advantages. First, local databases are faster and more reliable because they don't rely on an Internet connection. If you're involved in high-throughput research, these reasons are sufficient. Another compelling reason is that you can combine several databases, and even include your own sequences that aren't in the public databases. The downside to creating a local sequence database is the amount of work it takes. Depending on the scale of the operation, it can be a full-time job. Here are six important issues to address when building a local sequence database:
As you can see, building a local database isn't trivial. But it doesn't have to be a full-time job if you only want a subset of the information. For example, if all you want is to retrieve records by accession number, you don't need to invest more than a couple hours of work. The following section explores the common techniques for managing sequence data. 11.4.3 Retrieving FASTA Files by AccessionThe task of retrieving FASTA files by accession number is so common and has such an easy solution that it should be a local resource. If you're using NCBI-BLAST, the fastacmd program retrieves sequences from BLAST databases singly or in batches. If you're using WU-BLAST, the xdget program does the same thing. To use these features, you must index the databases when you format them, which is as simple as including the -o or -I option (see the command-line tutorial in Chapter 10, the reference sections for formatdb and fastacmd in Chapter 13, and xdformat and xdget in Chapter 14). One limitation of this approach is that the sequences are stored in a case-insensitive format in the database. If you use lowercase to denote regions containing repeats, for example, that information will be lost. If this is a serious problem for you, use one of the flat-file indexing schemes described later. NCBI-BLAST users take note that unless you use the NCBI FASTA definition line format discussed earlier in this chapter, your definition lines may not look exactly the same when they come out of the database. For example, if you have a definition line such as this: >FOO When you retrieve it with fastacmd, it looks like: >lcl|FOO no definition found You can easily avoid such inconsistencies by using the recommended identifier format and by including descriptions on the definition line. WU-BLAST users take note: xdget doesn't support virtual databases. You can work around this limitation with a simple script, such as this one: #!/usr/bin/perl -w use strict; my (@DB, $i); for ($i = 0; $i < @ARGV; $i++) { if ($ARGV[$i] =~ /\s/) { @DB = split(/\s+/, $ARGV[$i]); last; } } exec("xdget @ARGV") unless @DB; my @pre = splice(@ARGV, 0, $i); my @post = splice(@ARGV, 1); foreach my $db (@DB) { system("xdget @pre $db @post"); } 11.4.4 Flat File IndexingOne of the most common procedures used to manage sequence data is called flat file indexing. In this approach, you keep concatenated sequence reports in their native format and store the starting position of each record in a separate file. One advantage of this approach is that you don't have to do any work when you want to reproduce the data in flat file format. Another reason why flat file indexing is so common is that it is simple to implement, at least for one-to-one mappings. To illustrate the process, we'll show you how to index identifiers in FASTA files. Here is an example of a very short FASTA file: >FOO GAATTC >BAR ATAGCGAAT This file has two records with identifiers FOO and BAR, and they begin at bytes 0 and 12, respectively (count the letters and don't forget to add one for the end of line—in Windows, the end of line is actually two characters, and this will change the positions to 0 and 14). You can now create an index file that tells where each record begins in the file: BAR 12 FOO 0 To use this index file, simply find the identifier of interest in the index and seek to the appropriate position in the FASTA file. Note that you sorted the lookup file alphabetically by identifier. This makes it much more efficient to find the record because you can use a binary search to find the identifier. If you have an index file containing 1 million records, on average, a linear search looks through 500,000 records, but a binary search looks at only 20. You can make a couple of improvements to this simplistic indexing scheme. The first is to allow the index file to support more than one FASTA file. This is a trivial modification because you can just add a filename to your index file: BAR file-A 12 FOO file-A 0 XYZ file-B 0 Another easy improvement is to use a persistent indexed data structure such as a Perl tied-hash. The Bioperl project uses this strategy in its Bio::Index classes. A slightly more complicated approach manages the indices with one of the many free or commercial database applications, such as MySQL, PostgreSQL, FileMaker, Microsoft Access, or whatever you happen to be familiar with. If you're going to do this, you might as well store a bit more data. For illustrative purposes, imagine you create a schema like that in Table 11-3. In addition to the accession number, file, and offset, this schema provides for a species and a molecule type (moltype). The actual sequence in the schema was not provided because some applications can't handle data this large. If you wish to store sequences as well, test the performance of the system with realistic data to see if the system scales well.
Using such a database you can don't only the simple accession number retrievals, but also the one-to-many relationships such as all human sequences or all DNA sequences. All you have to do is query the database and seek to the appropriate place in the appropriate file for every record. Organizing the data this way has a number of advantages over just downloading DDBJ/EMBL/GenBank by division. For example, if you want to make a database of all human transcripts, you need to identify the human ESTs from the EST division, as well as all the mRNAs from the PRI (primate) division. But if you've designated all ESTs and mRNAs as the cDNA moltype, getting all human transcripts is as easy as retrieving all records in which the species is Homo sapiens and the moltype is cDNA. You can add several more fields to the database, like date created, division, keywords, etc., and get quite a bit of functionality without much more complexity. Overall, flat file indexing is a very good strategy for sequence management because it is simple, fast, and retains the data in its original format. You don't even have to write any software, as both free and commercial software packages are designed specifically for managing flat file data. Check out the Bioperl project at http://bioperl.org, MyGenBank at http://sourceforge.net/projects/mgb, and SRS (see Table 11-4). 11.4.5 Commercial Sequence Management SoftwareSeveral commercial software packages are designed for managing biological sequence data. The database software is generally part of a much larger software suite that includes sequence analysis tools such as BLAST and visualization tools to make interpretation easier. The companies that develop these packages expend a great deal of effort to make the various sequence analysis tasks interoperable and user friendly. Table 11-4 gives a brief description of the software.
As you can see from the descriptions of the personnel and hardware requirements, using these comprehensive sequence analysis systems requires a serious commitment. For these reasons, these packages aren't recommended for small research groups. For larger groups, though, these products can save a lot of time and money. It's easy to underestimate the effort required to develop your own sequence management system, so take caution before embarking on such a task, and give the professionals a chance to show you their wares. 11.4.6 Tools on the InternetThere are good reasons to use web-based tools for sequence management rather than building a local database. First, you don't have to download more data than you need. Mirroring the entire public database isn't efficient if you need only a slice of it. Second, database providers take care of the most time-consuming and expensive tasks, namely processing, storing, and indexing the data. Third, the databases are self-updating, which means that you can always get the latest and most accurate information. Best of all, the service is completely free. Well, maybe not completely free since the databases are supported from taxes, but let's all thank the various governments and funding agencies for putting our hard-earned money toward a worthy cause, and let's especially recognize all the people that make it actually happen. The downside to using web-based tools is that you have to spend time learning how to query the database efficiently and accurately, but that's going to be true of any sequence management system, even your own. A more serious issue is that you will depend on the computers and network between you and the database provider, but this will improve over time. Still, even if you have to put up with a few glitches here and there, the total cost in time and money is probably cheaper than building your own local mirror. |
[ Team LiB ] |