Statistics of optimal sequence alignments

Grossmann, Steffen

Statistics of optimal sequence alignments

Grossmann, S. (2003). Statistics of optimal sequence alignments. PhD Thesis, Johann Wolfgang Goethe-Universität, Frankfurt am Main.

Item is 公開

表示: 全項目非表示: 全項目

基本情報

表示: 非表示:

アイテムのパーマリンク: https://hdl.handle.net/11858/00-001M-0000-0010-8A49-C 版のパーマリンク: https://hdl.handle.net/11858/00-001M-0000-0010-8A4A-A

資料種別: 学位論文

ファイル

表示: ファイル

非表示: ファイル

:

PhD-Grossmann.pdf (全文テキスト（全般）), 650KB

ファイルのパーマリンク:
-

ファイル名:
PhD-Grossmann.pdf

説明:
-

OA-Status:

閲覧制限:
制限付き (Max Planck Institute for Molecular Genetics, MBMG; )

MIMEタイプ / チェックサム:
application/pdf

技術的なメタデータ:

著作権日付:
-

著作権情報:
eDoc_access: MPG

CCライセンス:
-

作成者

表示:

非表示:

作成者:
Grossmann, Steffen¹, 著者

所属:
1Max Planck Society, ou_persistent13

内容説明

表示:

非表示:

キーワード: -

要旨: This thesis has been motivated by the problem of assessing the statistical significance of the outcomes from sequence alignment algorithms. Finding similarities between sequences with entries from a finite alphabet has become an important task in recent times-especially in biology, where a fast growing number of sequences from the 4-letter DNA-alphabet or the 20-letter amino acid-alphabet waits to be analyzed and understood. In this thesis we are concerned with a class of similarity criteria, that is widely used in biology. It expresses the similarity of two sequences by giving an optimized local alignment. A local alignment basically gives the information which parts of the sequences are similar and what the similarity looks like in detail. This local alignment is optimized in the following sense. First a score is assigned to each possible local alignment of the two sequences. Here, the scoring scheme should be designed in such a way that scores increase with the degree of similarity in the alignment. Then the maximizing alignment is determined. This alignment, together with its score, measures the degree of similarity between the two sequences. We will see shortly that the scoring schemes are chosen in such a way that determining the optimal alignment can be done in reasonable time. However there remains the problem to understand the similarity scale that is given by this procedure: How high does a score have to be in order to indicate that the two sequences share some extraordinary similarity? A natural way to answer this question would be to determine which scores one would observe when comparing two typical unrelated sequences. If a score observed from real data is very high compared to those typical scores, then this would indicate a strong similarity. Put more mathematically, the term typical unrelated has to be read as random independent. Thus, the statistical way to answer the above question, is to take two sequences of independently chosen random letters and determine the distribution of the optimal local alignment score. Although this sounds simple, this is by no means the case. The optimal local alignment score is a complicated quantity, since in its calculation a maximization over the set of all possible local alignments of the two sequences is involved. This makes it difficult to calculate its distribution. In fact, from the point of view of probability theory, the resulting model is closely related to so-called percolation models-a class of models that has been actively researched during the past two decades, but for which many challenging problems still wait for their solution.

資料詳細

表示:

非表示:

言語: eng - English

日付: 受理: 2003-05-07

出版の状態: 受理 / 印刷中

ページ: 89 pp

出版情報: Frankfurt am Main : Johann Wolfgang Goethe-Universität

目次: 1 Introduction 1
1.1 Alignments and biology . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Alignments and genealogies . . . . . . . . . . . . . . . 1
1.1.2 The stochastic process of mutations . . . . . . . . . . 2
1.1.3 Getting rid of the ancestral sequence . . . . . . . . . . 4
1.1.4 Local similarities . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Alignments in biological practice . . . . . . . . . . . . 6
1.1.6 The main problem: statistical signi cance . . . . . . . 7
1.1.7 Many sequences vs. two sequences . . . . . . . . . . . 8
1.2 Optimal alignments and percolation theory . . . . . . . . . . 9
1.2.1 Classical rst-passage percolation . . . . . . . . . . . . 9
1.2.2 Optimal alignments as a percolation model . . . . . . 10
1.2.3 Questions arising from the model of optimal scores . . 12
2 Basic ideas and concepts 15
2.1 The level of large deviations . . . . . . . . . . . . . . . . . . . 15
2.1.1 The gapless case . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 The gapped case . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 Rate functions and concentration inequalities . . . . . 22
2.2 Convergence rates . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Superadditive global maxima 27
3.1 Independent superadditive processes . . . . . . . . . . . . . . 27
3.2 Large deviations . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Large deviations for the global maximum . . . . . . . . . . . 36
4 Alignment speci c results 43
4.1 Alignments and combinatorial optimization . . . . . . . . . . 43
4.2 Introducing randomness . . . . . . . . . . . . . . . . . . . . . 47
4.3 Concentration inequalities . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Martingale technique . . . . . . . . . . . . . . . . . . . 49
4.3.2 Talagrand's concentration inequality . . . . . . . . . . 51
4.4 Large deviations for optimal global scores . . . . . . . . . . . 53
4.5 Maximizing with a xed starting point . . . . . . . . . . . . . 54
4.6 Large deviations for optimal local scores . . . . . . . . . . . . 55
4.7 Phase transition . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7.1 The logarithmic phase . . . . . . . . . . . . . . . . . . 56
4.7.2 The linear phase . . . . . . . . . . . . . . . . . . . . . 58
4.8 Rate of convergence to the growth constant . . . . . . . . . . 60
4.9 Rate of convergence to . . . . . . . . . . . . . . . . . . . . 63
4.10 Multiple splitting . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.10.1 A relation between LMGFs . . . . . . . . . . . . . . . 67
4.10.2 Exploiting the upper bound . . . . . . . . . . . . . . . 68
5 Conclusions and outlook 71
5.1 Finer tail asymptotics . . . . . . . . . . . . . . . . . . . . . . 71
5.1.1 The gapless case . . . . . . . . . . . . . . . . . . . . . 71
5.1.2 An intermediate case . . . . . . . . . . . . . . . . . . . 72
5.1.3 The gapped case . . . . . . . . . . . . . . . . . . . . . 73
5.2 Assessing the statistical signi cance in practice . . . . . . . . 74
5.2.1 Calculating the parameters . . . . . . . . . . . . . . . 75
A Appendix 77
A.1 Asymptotic relations . . . . . . . . . . . . . . . . . . . . . . . 77
A.2 Convex conjugation . . . . . . . . . . . . . . . . . . . . . . . . 78
A.3 Some calculations . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.4 A lemma for almost additive sequences . . . . . . . . . . . . . 81
A.5 More calculations . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.5.1 Proof of Lemma 16 . . . . . . . . . . . . . . . . . . . . 81
A.5.2 Calculation of ~ f (x) . . . . . . . . . . . . . . . . . . . 83
Bibliography 85
List of Figures
1.1 An alignment of two DNA sequences . . . . . . . . . . . . . . 2
1.2 Time-reversibility . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Path-representation of alignments . . . . . . . . . . . . . . . . 11
2.1 A random walk with negative drift . . . . . . . . . . . . . . . 17
2.2 The two convex conjugate functions r and
. . . . . . . . . . 23
2.3 Splitting paths . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 The problem in Kingman's proof . . . . . . . . . . . . . . . . 33
3.2 The problem with Gartner-Ellis . . . . . . . . . . . . . . . . . 35
3.3 Illustrating Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Illustrating Theorem 2 . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Calculating alignment scores . . . . . . . . . . . . . . . . . . 45
4.2 More splitting paths . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Even more splitting paths . . . . . . . . . . . . . . . . . . . . 61
A.1 Convex conjugation . . . . . . . . . . . . . . . . . . . . . . . . 78

査読: -

識別子（DOI, ISBNなど）: eDoc: 194865

学位: 博士号 (PhD)

アイテム詳細

基本情報

ファイル

関連URL

作成者

内容説明

資料詳細

関連イベント

訴訟

Project information

出版物