Context-specific independence mixture models for cluster analysis of biological 
data

Georgi, Benjamin

Lokale TagsFreigabegeschichteDetailsÜbersicht

Context-specific independence mixture models for cluster analysis of biological data

Georgi, B. (in preparation). Context-specific independence mixture models for cluster analysis of biological data.

Item is Freigegeben

einblenden: alle ausblenden: alle

Basisdaten

einblenden: ausblenden:

Datensatz-Permalink: https://hdl.handle.net/11858/00-001M-0000-0010-7D6F-0 Versions-Permalink: https://hdl.handle.net/11858/00-001M-0000-0010-7D70-A

Genre: Hochschulschrift

Dateien

einblenden: Dateien

ausblenden: Dateien

:

Georgi.zip (beliebiger Volltext), 3MB

Öffnen Speichern

Datei-Permalink:
https://hdl.handle.net/11858/00-001M-0000-0010-7D6E-1

Name:
Georgi.zip

Beschreibung:
-

OA-Status:

Sichtbarkeit:
Öffentlich

MIME-Typ / Prüfsumme:
application/zip / [MD5]

Technische Metadaten:

Öffnen

Copyright Datum:
-

Copyright Info:
eDoc_access: PUBLIC

Lizenz:
-

Externe Referenzen

einblenden:

Urheber

einblenden:

ausblenden:

Urheber:
Georgi, Benjamin¹, Autor

Affiliations:
1Max Planck Society, ou_persistent13

Inhalt

einblenden:

ausblenden:

Schlagwörter: Clustering mixture models context-specific independence transcription factors proteins heart disease

Zusammenfassung: Clustering is a crucial first step in the exploratory analysis of biological data. This thesis is concerned with cluster analysis of biological data using mixture models. Mixture models is a class of powerful and versatile statistical models. We develop an extension to the conventional mixtures in form of the context-specific independence (CSI) framework. CSI mixtures are particularly suited for the analysis of biological data since they perform robustly in the presence of noise and uninformative features in the data. This is achieved by adapting the model complexity to the degree of variation observed in a given data set. We present a learning algorithm for CSI mixtures in a Bayesian framework. We apply CSI mixture clustering on data sets of transcription factor binding sites, protein sequences and complex disease phenotype data.

Details

einblenden:

ausblenden:

Sprache(n): eng - English

Datum: Geändert: 2009-06-10

Publikationsstatus: Keine Angabe

Seiten: XII, 124

Ort, Verlag, Ausgabe: Berlin : Freie Universität Berlin

Inhaltsverzeichnis: Preface xi
1 Introduction 1
1.1 Biological Mass data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Finite Mixture Models 7
2.1 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Atomic Distributions . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Mixture Models from Different Perspectives . . . . . . . . . . . 11
2.1.3 Sampling from a Mixture . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Expectation Maximization (EM) Algorithm . . . . . . . . . . . . . . . . 12
2.2.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 EM for Mixture Models . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Parameter Estimators . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Drawbacks of the EM Algorithm . . . . . . . . . . . . . . . . . . 18
2.3 Mixture Models for Clustering . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Clustering Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Handling of Missing Data . . . . . . . . . . . . . . . . . . . . . 22
2.3.4 Dealing with Noisy Data Sets . . . . . . . . . . . . . . . . . . . 23
2.4 Bayesian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Parameter Estimators . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Partially-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Context-specific Independence Mixture Models 31
3.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Context-specific Independence (CSI) . . . . . . . . . . . . . . . . . . . . 32
3.3 CSI for Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 CSI from Different Perspectives . . . . . . . . . . . . . . . . . . 35
3.4 Bayesian CSI Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Structural EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.2 Structural EM for Bayesian CSI Mixture Models . . . . . . . . . 38
Contents
3.5.3 Structure Parameter Estimators . . . . . . . . . . . . . . . . . . . 38
3.6 CSI Mixtures and Clustering . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 Interpretation of the CSI Structure . . . . . . . . . . . . . . . . . 42
3.6.2 Feature Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Structure Learning Algorithm 45
4.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Combinatorial Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Structure Space Search Strategies . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Choosing the Structure Prior . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Search Strategy Evaluation . . . . . . . . . . . . . . . . . . . . . 48
4.4 Running Time Optimization . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Feature-wise Caching . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Candidate Structure Graph . . . . . . . . . . . . . . . . . . . . . 51
4.4.3 Posterior bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.4 Structure Learning Running Time . . . . . . . . . . . . . . . . . 55
5 Mixture Modeling for Transcription Factor Binding Sites 59
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 TFBS Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 Analysis of TF LEU3 . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.3 Conservation Statistics . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.4 Examples of Binding Site Subgroups . . . . . . . . . . . . . . . 68
6 Clustering of Protein Families Using Mixtures 71
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Dirichlet Mixture Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Prior Parameter Derivation . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Feature Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.5.1 L-lactate Dehydrogenase Family . . . . . . . . . . . . . . . . . . 76
6.5.2 Protein Kinase Family . . . . . . . . . . . . . . . . . . . . . . . 78
6.5.3 Nucleotidyl Cyclase Family . . . . . . . . . . . . . . . . . . . . 80
6.5.4 Partially-supervised Protein Clustering . . . . . . . . . . . . . . 82
7 Clustering of Heart Disease Phenotype Data 85
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Contents
8 Discussion 93
8.1 CSI Mixture Models & Structure Learning . . . . . . . . . . . . . . . . . 93
8.2 Transcription Factor Data . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3 Protein Family Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.4 Heart Disease Phenotype Data . . . . . . . . . . . . . . . . . . . . . . . 96
Bibliography 99
A Notation 115
B Abbreviations 117
C Nucleotide & Amino Acid Codes 119
D Random CSI Models 121
E Zusammenfassung 123

Art der Begutachtung: -

Identifikatoren: eDoc: 446312

Art des Abschluß: Doktorarbeit

Datensatz

Basisdaten

Dateien

Externe Referenzen

Urheber

Inhalt

Details

Veranstaltung

Entscheidung

Projektinformation

Quelle