8.12 Use Caution When Searching Raw Sequencing Reads
The largest source of raw sequencing
reads comes from the early stages of genome projects and from EST
sequencing. Most sequencing reads have an error rate of about 1
percent. This rate isn't uniform; there is a spike
near the beginning and a gradual increase towards the end of the
read. In addition, some regions have intrinsically high error rates
due to compositional properties such as high GC content. DNA
sequencing involves several steps, and there are abundant
opportunities for mechanical and human error. Thus, you will need to
be careful when using large word sizes. For redundant sequence
collections, such as 3x shotgun coverage of a genome, large word
sizes are fine, but if the absence of a single alignment is
troublesome, scale down the word size to keep sequencing errors from
preventing seeding.
Raw sequencing
reads may be contaminated from a variety of sources. Cloning vectors
are one expected source. Depending on the sequencing center, the
vectors may or may not have been clipped from the sequence. Other
kinds of contamination are also possible. Nuclear DNA is sometimes
contaminated with mitochondrial or viral DNA, and any collection of
sequence can be contaminated from another organism (genome centers
usually sequence more than one entity at a time, and sometimes
there's a mix up of who did what and when). ESTs
sometimes have their poly-A tail intact, and whether or not this is
contamination is a matter of perspective. Taken together, there are
many opportunities for contamination, and it's a
good idea to be cautious when using raw sequencing reads.
|