4.5 Sequence Similarity
Sequence similarity is a simple
extension of amino acid or nucleotide similarity. To determine it,
sum up the individual pair-wise scores in an alignment. For example,
the raw score of the following BLAST alignment under the BLOSUM62
matrix is 72. Converting 72 to a normalized score is as simple as
multiplying by lambda. (Note that for BLAST statistical calculations,
the normalized score is lS - lnk.)
Query: 885 QCPVCHKKYSNALVLQQHIRLHTGE 909
+C VC K ++ L++H RLHTGE
Sbjct: 267 ECDVCSKSFTTKYFLKKHKRLHTGE 291
Recall from Chapter 3 that the score of each pair of letters is
considered independently from the rest of the alignment. This is the
same idea. There is a convenient synergy between alignment algorithms
and alignment scores. However, when treating the letters
independently of one another, you lose contextual information. Can
you assume that the probability of A followed by G is the same as the
probability of G followed by A? In a natural language such as
English, you know that this doesn't make sense. In
English, Q is always followed by U. If you treat these letters
independently, you lose this restriction. The context rules for
biological sequences aren't as strict as for
English, but there are tendencies. For example, low entropy sequences
appear by chance much more frequently than expected. To avoid
becoming sidetracked by the details, accept that
you're using an approximation, and note that in
practice, it works well.
|