Statistics of optimal sequence alignments

Grossmann, Steffen

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Thesis

Statistics of optimal sequence alignments

MPS-Authors

Grossmann, Steffen
Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Grossmann, S. (2003). Statistics of optimal sequence alignments. PhD Thesis, Johann Wolfgang Goethe-Universität, Frankfurt am Main.

Cite as: https://hdl.handle.net/11858/00-001M-0000-0010-8A49-C

Abstract

This thesis has been motivated by the problem of assessing the statistical significance of the outcomes from sequence alignment algorithms. Finding similarities between sequences with entries from a finite alphabet has become an important task in recent times-especially in biology, where a fast growing number of sequences from the 4-letter DNA-alphabet or the 20-letter amino acid-alphabet waits to be analyzed and understood. In this thesis we are concerned with a class of similarity criteria, that is widely used in biology. It expresses the similarity of two sequences by giving an optimized local alignment. A local alignment basically gives the information which parts of the sequences are similar and what the similarity looks like in detail. This local alignment is optimized in the following sense. First a score is assigned to each possible local alignment of the two sequences. Here, the scoring scheme should be designed in such a way that scores increase with the degree of similarity in the alignment. Then the maximizing alignment is determined. This alignment, together with its score, measures the degree of similarity between the two sequences. We will see shortly that the scoring schemes are chosen in such a way that determining the optimal alignment can be done in reasonable time. However there remains the problem to understand the similarity scale that is given by this procedure: How high does a score have to be in order to indicate that the two sequences share some extraordinary similarity? A natural way to answer this question would be to determine which scores one would observe when comparing two typical unrelated sequences. If a score observed from real data is very high compared to those typical scores, then this would indicate a strong similarity. Put more mathematically, the term typical unrelated has to be read as random independent. Thus, the statistical way to answer the above question, is to take two sequences of independently chosen random letters and determine the distribution of the optimal local alignment score. Although this sounds simple, this is by no means the case. The optimal local alignment score is a complicated quantity, since in its calculation a maximization over the set of all possible local alignments of the two sequences is involved. This makes it difficult to calculate its distribution. In fact, from the point of view of probability theory, the resulting model is closely related to so-called percolation models-a class of models that has been actively researched during the past two decades, but for which many challenging problems still wait for their solution.