BLAST-BLAST

11.4 Sequence Database Management Strategies

There are many useful public sequence databases, and you may have access to some private ones as well. Because this is a book about BLAST, we assume you want to use these collections of sequences in BLAST searches. Some sequences may be used as queries, and others in databases. How are you going to manage them all in a rational way? Several possible strategies exist, and the correct one for you depends on your needs and resources. To demonstrate some of the issues, let's review a typical sequence analysis scenario.

Suppose a colleague of yours has just found the gene that makes cats go crazy for catnip. She wants to learn more about this gene and comes to you for help because you are a BLAST expert. The first thing she wants to do is a BLAST search to find out what vertebrate proteins are similar to this one. Where are you going to get such a database of proteins? Once you perform the BLAST search, you find several interesting similarities. Your colleague tells you that these are probably all part of a family of proteins, and she would like to build a phylogenetic tree to determine their relationships to one another. How are you going to get the individual sequences? Finally, she decides she wants more information about the human sequences, and to do that, she would like references to the scientific literature like the ones she would find in a DDBJ/EMBL/GenBank report. How are you going to retrieve such information? You could just refuse to help her because these aren't really BLAST problems, but these are the kinds of tasks many BLAST users must face. Let's take a look at how they can be solved.

This example has basically two solutions to each question: the first is to use tools available on the Internet. The second is to build the tools yourself. In general, it is much easier to use the Internet, but for high speed or high-throughput operations you'll want a local solution. After you read this chapter, you may decide that you want some services to be provided locally, while others are Internet-only operations. This section begins with a brief review of databases.

11.4.1 Queries, Indexes, and Reports

The most common database operation is a query. One person may want to retrieve a particular sequence. Another may want all human sequences. As you have seen, sequence records have quite a bit of useful information, and a user may request nonsequence information such as all the MEDLINE references for all sequences with the word disease in the description.

The efficiency with which a query is executed depends a lot on how the database is indexed. If there is no indexing, a query must operate on every record of the database. So, for example, if you want to find all the coelacanth sequences, you would have to look through millions of records to find the handful whose sequences originate from the coelacanth. Clearly, this isn't going to be efficient, so databases usually have indexes that, for example, keep lists of species and all the sequences for each species.

The most straightforward kind of indexing occurs when there is a unique relationship between a property and a sequence. This is called a one-to-one mapping, and an example would be an accession number. A more complex indexing occurs when a property points to many sequences. This is called a one-to-many mapping, and an example is a species name that is shared by millions of records.

Once a query is executed, the data must be reported in some format. For sequences, this is usually the FASTA format. For other kinds of data, there are other appropriate formats, such as lists, tables, and graphs.

11.4.2 Local Database Considerations

Having a local sequence database has some real advantages. First, local databases are faster and more reliable because they don't rely on an Internet connection. If you're involved in high-throughput research, these reasons are sufficient. Another compelling reason is that you can combine several databases, and even include your own sequences that aren't in the public databases. The downside to creating a local sequence database is the amount of work it takes. Depending on the scale of the operation, it can be a full-time job. Here are six important issues to address when building a local sequence database:

Downloading: Each database you support must be downloaded from time to time to keep the data current. For example, GenBank has five to six major releases each year, as well as daily updates. Other databases have their own update schedule. Managing updates can be a chore if you download a lot of databases, so automating the procedure is a good idea. In addition, you may want to take measures to ensure that during updates, which can take some time, the database that's presented to users isn't actually changing. This may require keeping a mirror of some data. Notice that having a local database doesn't mean you can completely insulate yourself from the Internet.
Processing records: Each database you support must have a parser to read the various fields of each record. This may be as simple as pulling out the accession number for a sequence, or it may be much more complicated, such when you record specific keywords. You can build your own parsers but it takes less time to use one already created, such as a parser from the Bioperl project.
Storing data: Your database schema will determine how each record is stored and what kinds of relationships exist between various pieces of data. Designing an appropriate schema is a difficult problem because it takes people who understand the data (biologists), the data models (software engineers), and the storage/backup of the data (systems administrators).
Indexing: The efficiency of queries will largely depend on what data is indexed. You may choose to index everything, but your indexes could grow much larger than your data. So you may have to make compromises. This is another place where users and engineers must interact to determine the appropriate solution.
Querying: Not all databases are queried in the same way. Relational databases usually employ SQL as the query language, but many popular databases have their own querying mechanisms. The details of how you interacts with the database may depend on what kind of database you use. Regardless of the underlying architecture, you may decide to present a different interface to users, such as a form in a web browser or a script/program interface that connects directly to the database.
Formatting: You'll definitely want to create FASTA files, but what other report formats will you want to support? The DDBJ/EMBL/GenBank flat file formats are sometimes used to exchange data, so this would be useful, as would tabular format and some kind of HTML that looks good in browsers. For each output format, you may need some specialized code to generate the report.

As you can see, building a local database isn't trivial. But it doesn't have to be a full-time job if you only want a subset of the information. For example, if all you want is to retrieve records by accession number, you don't need to invest more than a couple hours of work. The following section explores the common techniques for managing sequence data.

11.4.3 Retrieving FASTA Files by Accession

The task of retrieving FASTA files by accession number is so common and has such an easy solution that it should be a local resource. If you're using NCBI-BLAST, the fastacmd program retrieves sequences from BLAST databases singly or in batches. If you're using WU-BLAST, the xdget program does the same thing. To use these features, you must index the databases when you format them, which is as simple as including the -o or -I option (see the command-line tutorial in Chapter 10, the reference sections for formatdb and fastacmd in Chapter 13, and xdformat and xdget in Chapter 14). One limitation of this approach is that the sequences are stored in a case-insensitive format in the database. If you use lowercase to denote regions containing repeats, for example, that information will be lost. If this is a serious problem for you, use one of the flat-file indexing schemes described later.

NCBI-BLAST users take note that unless you use the NCBI FASTA definition line format discussed earlier in this chapter, your definition lines may not look exactly the same when they come out of the database. For example, if you have a definition line such as this:

>FOO

When you retrieve it with fastacmd, it looks like:

>lcl|FOO no definition found

You can easily avoid such inconsistencies by using the recommended identifier format and by including descriptions on the definition line.

WU-BLAST users take note: xdget doesn't support virtual databases. You can work around this limitation with a simple script, such as this one:

#!/usr/bin/perl -w
use strict;

my (@DB, $i);
for ($i = 0; $i < @ARGV; $i++) {
    if ($ARGV[$i] =~ /\s/) {
        @DB = split(/\s+/, $ARGV[$i]);
        last;
    }
}

exec("xdget @ARGV") unless @DB;

my @pre = splice(@ARGV, 0, $i);
my @post = splice(@ARGV, 1);
foreach my $db (@DB) {
    system("xdget @pre $db @post");
}

11.4.4 Flat File Indexing

One of the most common procedures used to manage sequence data is called flat file indexing. In this approach, you keep concatenated sequence reports in their native format and store the starting position of each record in a separate file. One advantage of this approach is that you don't have to do any work when you want to reproduce the data in flat file format. Another reason why flat file indexing is so common is that it is simple to implement, at least for one-to-one mappings. To illustrate the process, we'll show you how to index identifiers in FASTA files. Here is an example of a very short FASTA file:

>FOO
GAATTC
>BAR
ATAGCGAAT

This file has two records with identifiers FOO and BAR, and they begin at bytes 0 and 12, respectively (count the letters and don't forget to add one for the end of line—in Windows, the end of line is actually two characters, and this will change the positions to 0 and 14). You can now create an index file that tells where each record begins in the file:

BAR 12
FOO 0

To use this index file, simply find the identifier of interest in the index and seek to the appropriate position in the FASTA file. Note that you sorted the lookup file alphabetically by identifier. This makes it much more efficient to find the record because you can use a binary search to find the identifier. If you have an index file containing 1 million records, on average, a linear search looks through 500,000 records, but a binary search looks at only 20.

You can make a couple of improvements to this simplistic indexing scheme. The first is to allow the index file to support more than one FASTA file. This is a trivial modification because you can just add a filename to your index file:

BAR file-A 12
FOO file-A 0
XYZ file-B 0

Another easy improvement is to use a persistent indexed data structure such as a Perl tied-hash. The Bioperl project uses this strategy in its Bio::Index classes.

A slightly more complicated approach manages the indices with one of the many free or commercial database applications, such as MySQL, PostgreSQL, FileMaker, Microsoft Access, or whatever you happen to be familiar with. If you're going to do this, you might as well store a bit more data. For illustrative purposes, imagine you create a schema like that in Table 11-3. In addition to the accession number, file, and offset, this schema provides for a species and a molecule type (moltype). The actual sequence in the schema was not provided because some applications can't handle data this large. If you wish to store sequences as well, test the performance of the system with realistic data to see if the system scales well.

Table 11-3. Sequence database example

Accession

Species

Moltype

File

Offset

A

Homo sapiens

AA

file-1

12024

B

Homo sapiens

AA

file-1

250

C

Homo sapiens

DNA

file-2

28223

AF287139

Latimeria chalumnae

cDNA

file-3

0

Using such a database you can don't only the simple accession number retrievals, but also the one-to-many relationships such as all human sequences or all DNA sequences. All you have to do is query the database and seek to the appropriate place in the appropriate file for every record. Organizing the data this way has a number of advantages over just downloading DDBJ/EMBL/GenBank by division. For example, if you want to make a database of all human transcripts, you need to identify the human ESTs from the EST division, as well as all the mRNAs from the PRI (primate) division. But if you've designated all ESTs and mRNAs as the cDNA moltype, getting all human transcripts is as easy as retrieving all records in which the species is Homo sapiens and the moltype is cDNA. You can add several more fields to the database, like date created, division, keywords, etc., and get quite a bit of functionality without much more complexity.

Overall, flat file indexing is a very good strategy for sequence management because it is simple, fast, and retains the data in its original format. You don't even have to write any software, as both free and commercial software packages are designed specifically for managing flat file data. Check out the Bioperl project at http://bioperl.org, MyGenBank at http://sourceforge.net/projects/mgb, and SRS (see Table 11-4).

11.4.5 Commercial Sequence Management Software

Several commercial software packages are designed for managing biological sequence data. The database software is generally part of a much larger software suite that includes sequence analysis tools such as BLAST and visualization tools to make interpretation easier. The companies that develop these packages expend a great deal of effort to make the various sequence analysis tasks interoperable and user friendly. Table 11-4 gives a brief description of the software.

Table 11-4. Commercial sequence management software

Company

Product and description

Accelrys

The popular Wisconsin GCG package is now owned by Accelrys, which provides the SeqStore software for managing sequence data. The system uses an Oracle database and allows daily/weekly updates. To install and maintain the system, you must have personnel with experience in Unix systems administration and Oracle database administration. Accelrys recommends a computer with at least 4 CPUs, at 4-GB RAM, and 40- GB disk space.

http://www.accelrys.com

Informax

The Genomax software suite provides sequence management along with a comprehensive set of interoperable tools. Informax recommends a project manager, a Unix systems administrator, and an Oracle database administrator to manage and maintain the system, as well as a life sciences expert to respond to users' questions. Informax uses a three-tiered architecture and recommends that the three computers be configured with 4 CPUs and 4-8 GB RAM, and the database server have 400-GB disk space.

http://www.informaxinc.com

LION Biosciences

LION Biosciences offers the Sequence Retrieval System (SRS). SRS is probably the most popular sequence management software in use today and is used by both DDBJ and EMBL. SRS is free for academic users. LION produces a separate, related product PRISMA2, which is an automatic databank-updating and maintenance tool. SRS requires a person with competent Unix skills to install and maintain and a server with enough storage for the various databases and indexes.

http://lionbioscience.com

As you can see from the descriptions of the personnel and hardware requirements, using these comprehensive sequence analysis systems requires a serious commitment. For these reasons, these packages aren't recommended for small research groups. For larger groups, though, these products can save a lot of time and money. It's easy to underestimate the effort required to develop your own sequence management system, so take caution before embarking on such a task, and give the professionals a chance to show you their wares.

11.4.6 Tools on the Internet

There are good reasons to use web-based tools for sequence management rather than building a local database. First, you don't have to download more data than you need. Mirroring the entire public database isn't efficient if you need only a slice of it. Second, database providers take care of the most time-consuming and expensive tasks, namely processing, storing, and indexing the data. Third, the databases are self-updating, which means that you can always get the latest and most accurate information. Best of all, the service is completely free. Well, maybe not completely free since the databases are supported from taxes, but let's all thank the various governments and funding agencies for putting our hard-earned money toward a worthy cause, and let's especially recognize all the people that make it actually happen.

The downside to using web-based tools is that you have to spend time learning how to query the database efficiently and accurately, but that's going to be true of any sequence management system, even your own. A more serious issue is that you will depend on the computers and network between you and the database provider, but this will improve over time. Still, even if you have to put up with a few glitches here and there, the total cost in time and money is probably cheaper than building your own local mirror.

[ Team LiB ]