8.11 Expect Contaminants in EST Databases
A simple view is that ESTs are
sequencing reads from cDNAs, cDNAs are derived from mRNAs, and mRNAs
are derived from genes. Theoretically, this is true, but in practice
ESTs frequently don't correspond to genes (e.g.,
rather than match an exon or UTR, they overlap part of a repeat on
the wrong strand within an intron). The fraction of nontranscript
sequence depends on the way the library was created. Some libraries
are nearly devoid of extra-genic material, while others are
essentially random shotgun sequence. How can you tell the difference?
It's difficult to determine directly from the EST
sequences.
Before the human genome was completed, the number of genes was
estimated at 100,000 to 200,000. Current estimates are 25,000 to
30,000. One of the reasons for the initial high figure was that EST
clustering experiments found many clusters, and people believed each
cluster was a gene. One of the best ways to sort out real transcripts
from pollutants is to align ESTs back to their genome. See Section 9.1.5 for more details.
|