Statistics of optimal sequence alignments

Grossmann, Steffen

Local TagsRelease HistoryDetailsSummary

Statistics of optimal sequence alignments

Grossmann, S. (2003). Statistics of optimal sequence alignments. PhD Thesis, Johann Wolfgang Goethe-Universität, Frankfurt am Main.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/11858/00-001M-0000-0010-8A49-C Version Permalink: https://hdl.handle.net/11858/00-001M-0000-0010-8A4A-A

Genre: Thesis

Files

show Files

hide Files

:

PhD-Grossmann.pdf (Any fulltext), 650KB

File Permalink:
-

Name:
PhD-Grossmann.pdf

Description:
-

OA-Status:

Visibility:
Restricted (Max Planck Institute for Molecular Genetics, MBMG; )

MIME-Type / Checksum:
application/pdf

Technical Metadata:

Copyright Date:
-

Copyright Info:
eDoc_access: MPG

License:
-

Locators

show

Creators

show

hide

Creators:
Grossmann, Steffen¹, Author

Affiliations:
1Max Planck Society, ou_persistent13

Content

show

hide

Free keywords: -

Abstract: This thesis has been motivated by the problem of assessing the statistical significance of the outcomes from sequence alignment algorithms. Finding similarities between sequences with entries from a finite alphabet has become an important task in recent times-especially in biology, where a fast growing number of sequences from the 4-letter DNA-alphabet or the 20-letter amino acid-alphabet waits to be analyzed and understood. In this thesis we are concerned with a class of similarity criteria, that is widely used in biology. It expresses the similarity of two sequences by giving an optimized local alignment. A local alignment basically gives the information which parts of the sequences are similar and what the similarity looks like in detail. This local alignment is optimized in the following sense. First a score is assigned to each possible local alignment of the two sequences. Here, the scoring scheme should be designed in such a way that scores increase with the degree of similarity in the alignment. Then the maximizing alignment is determined. This alignment, together with its score, measures the degree of similarity between the two sequences. We will see shortly that the scoring schemes are chosen in such a way that determining the optimal alignment can be done in reasonable time. However there remains the problem to understand the similarity scale that is given by this procedure: How high does a score have to be in order to indicate that the two sequences share some extraordinary similarity? A natural way to answer this question would be to determine which scores one would observe when comparing two typical unrelated sequences. If a score observed from real data is very high compared to those typical scores, then this would indicate a strong similarity. Put more mathematically, the term typical unrelated has to be read as random independent. Thus, the statistical way to answer the above question, is to take two sequences of independently chosen random letters and determine the distribution of the optimal local alignment score. Although this sounds simple, this is by no means the case. The optimal local alignment score is a complicated quantity, since in its calculation a maximization over the set of all possible local alignments of the two sequences is involved. This makes it difficult to calculate its distribution. In fact, from the point of view of probability theory, the resulting model is closely related to so-called percolation models-a class of models that has been actively researched during the past two decades, but for which many challenging problems still wait for their solution.

Details

show

hide

Language(s): eng - English

Dates: Accepted: 2003-05-07

Publication Status: Accepted / In Press

Pages: 89 pp

Publishing info: Frankfurt am Main : Johann Wolfgang Goethe-Universität

Table of Contents: 1 Introduction 1
1.1 Alignments and biology . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Alignments and genealogies . . . . . . . . . . . . . . . 1
1.1.2 The stochastic process of mutations . . . . . . . . . . 2
1.1.3 Getting rid of the ancestral sequence . . . . . . . . . . 4
1.1.4 Local similarities . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Alignments in biological practice . . . . . . . . . . . . 6
1.1.6 The main problem: statistical signi cance . . . . . . . 7
1.1.7 Many sequences vs. two sequences . . . . . . . . . . . 8
1.2 Optimal alignments and percolation theory . . . . . . . . . . 9
1.2.1 Classical rst-passage percolation . . . . . . . . . . . . 9
1.2.2 Optimal alignments as a percolation model . . . . . . 10
1.2.3 Questions arising from the model of optimal scores . . 12
2 Basic ideas and concepts 15
2.1 The level of large deviations . . . . . . . . . . . . . . . . . . . 15
2.1.1 The gapless case . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 The gapped case . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 Rate functions and concentration inequalities . . . . . 22
2.2 Convergence rates . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Superadditive global maxima 27
3.1 Independent superadditive processes . . . . . . . . . . . . . . 27
3.2 Large deviations . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Large deviations for the global maximum . . . . . . . . . . . 36
4 Alignment speci c results 43
4.1 Alignments and combinatorial optimization . . . . . . . . . . 43
4.2 Introducing randomness . . . . . . . . . . . . . . . . . . . . . 47
4.3 Concentration inequalities . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Martingale technique . . . . . . . . . . . . . . . . . . . 49
4.3.2 Talagrand's concentration inequality . . . . . . . . . . 51
4.4 Large deviations for optimal global scores . . . . . . . . . . . 53
4.5 Maximizing with a xed starting point . . . . . . . . . . . . . 54
4.6 Large deviations for optimal local scores . . . . . . . . . . . . 55
4.7 Phase transition . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7.1 The logarithmic phase . . . . . . . . . . . . . . . . . . 56
4.7.2 The linear phase . . . . . . . . . . . . . . . . . . . . . 58
4.8 Rate of convergence to the growth constant . . . . . . . . . . 60
4.9 Rate of convergence to . . . . . . . . . . . . . . . . . . . . 63
4.10 Multiple splitting . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.10.1 A relation between LMGFs . . . . . . . . . . . . . . . 67
4.10.2 Exploiting the upper bound . . . . . . . . . . . . . . . 68
5 Conclusions and outlook 71
5.1 Finer tail asymptotics . . . . . . . . . . . . . . . . . . . . . . 71
5.1.1 The gapless case . . . . . . . . . . . . . . . . . . . . . 71
5.1.2 An intermediate case . . . . . . . . . . . . . . . . . . . 72
5.1.3 The gapped case . . . . . . . . . . . . . . . . . . . . . 73
5.2 Assessing the statistical signi cance in practice . . . . . . . . 74
5.2.1 Calculating the parameters . . . . . . . . . . . . . . . 75
A Appendix 77
A.1 Asymptotic relations . . . . . . . . . . . . . . . . . . . . . . . 77
A.2 Convex conjugation . . . . . . . . . . . . . . . . . . . . . . . . 78
A.3 Some calculations . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.4 A lemma for almost additive sequences . . . . . . . . . . . . . 81
A.5 More calculations . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.5.1 Proof of Lemma 16 . . . . . . . . . . . . . . . . . . . . 81
A.5.2 Calculation of ~ f (x) . . . . . . . . . . . . . . . . . . . 83
Bibliography 85
List of Figures
1.1 An alignment of two DNA sequences . . . . . . . . . . . . . . 2
1.2 Time-reversibility . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Path-representation of alignments . . . . . . . . . . . . . . . . 11
2.1 A random walk with negative drift . . . . . . . . . . . . . . . 17
2.2 The two convex conjugate functions r and
. . . . . . . . . . 23
2.3 Splitting paths . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 The problem in Kingman's proof . . . . . . . . . . . . . . . . 33
3.2 The problem with Gartner-Ellis . . . . . . . . . . . . . . . . . 35
3.3 Illustrating Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Illustrating Theorem 2 . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Calculating alignment scores . . . . . . . . . . . . . . . . . . 45
4.2 More splitting paths . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Even more splitting paths . . . . . . . . . . . . . . . . . . . . 61
A.1 Convex conjugation . . . . . . . . . . . . . . . . . . . . . . . . 78

Rev. Type: -

Identifiers: eDoc: 194865

Degree: PhD

Event

show

Legal Case

show

Project information

show

Source

show