English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
  Context-specific independence mixture models for cluster analysis of biological data

Georgi, B. (in preparation). Context-specific independence mixture models for cluster analysis of biological data.

Item is

Files

show Files
hide Files
:
Georgi.zip (Any fulltext), 3MB
Name:
Georgi.zip
Description:
-
OA-Status:
Visibility:
Public
MIME-Type / Checksum:
application/zip / [MD5]
Technical Metadata:
Copyright Date:
-
Copyright Info:
eDoc_access: PUBLIC
License:
-

Locators

show

Creators

show
hide
 Creators:
Georgi, Benjamin1, Author
Affiliations:
1Max Planck Society, ou_persistent13              

Content

show
hide
Free keywords: Clustering mixture models context-specific independence transcription factors proteins heart disease
 Abstract: Clustering is a crucial first step in the exploratory analysis of biological data. This thesis is concerned with cluster analysis of biological data using mixture models. Mixture models is a class of powerful and versatile statistical models. We develop an extension to the conventional mixtures in form of the context-specific independence (CSI) framework. CSI mixtures are particularly suited for the analysis of biological data since they perform robustly in the presence of noise and uninformative features in the data. This is achieved by adapting the model complexity to the degree of variation observed in a given data set. We present a learning algorithm for CSI mixtures in a Bayesian framework. We apply CSI mixture clustering on data sets of transcription factor binding sites, protein sequences and complex disease phenotype data.

Details

show
hide
Language(s): eng - English
 Dates: 2009-06-10
 Publication Status: Not specified
 Pages: XII, 124
 Publishing info: Berlin : Freie Universität Berlin
 Table of Contents: Preface xi
1 Introduction 1
1.1 Biological Mass data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Finite Mixture Models 7
2.1 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Atomic Distributions . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Mixture Models from Different Perspectives . . . . . . . . . . . 11
2.1.3 Sampling from a Mixture . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Expectation Maximization (EM) Algorithm . . . . . . . . . . . . . . . . 12
2.2.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 EM for Mixture Models . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Parameter Estimators . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Drawbacks of the EM Algorithm . . . . . . . . . . . . . . . . . . 18
2.3 Mixture Models for Clustering . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Clustering Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Handling of Missing Data . . . . . . . . . . . . . . . . . . . . . 22
2.3.4 Dealing with Noisy Data Sets . . . . . . . . . . . . . . . . . . . 23
2.4 Bayesian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Parameter Estimators . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Partially-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Context-specific Independence Mixture Models 31
3.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Context-specific Independence (CSI) . . . . . . . . . . . . . . . . . . . . 32
3.3 CSI for Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 CSI from Different Perspectives . . . . . . . . . . . . . . . . . . 35
3.4 Bayesian CSI Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Structural EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.2 Structural EM for Bayesian CSI Mixture Models . . . . . . . . . 38
Contents
3.5.3 Structure Parameter Estimators . . . . . . . . . . . . . . . . . . . 38
3.6 CSI Mixtures and Clustering . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 Interpretation of the CSI Structure . . . . . . . . . . . . . . . . . 42
3.6.2 Feature Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Structure Learning Algorithm 45
4.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Combinatorial Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Structure Space Search Strategies . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Choosing the Structure Prior . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Search Strategy Evaluation . . . . . . . . . . . . . . . . . . . . . 48
4.4 Running Time Optimization . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Feature-wise Caching . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Candidate Structure Graph . . . . . . . . . . . . . . . . . . . . . 51
4.4.3 Posterior bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.4 Structure Learning Running Time . . . . . . . . . . . . . . . . . 55
5 Mixture Modeling for Transcription Factor Binding Sites 59
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 TFBS Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 Analysis of TF LEU3 . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.3 Conservation Statistics . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.4 Examples of Binding Site Subgroups . . . . . . . . . . . . . . . 68
6 Clustering of Protein Families Using Mixtures 71
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Dirichlet Mixture Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Prior Parameter Derivation . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Feature Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.5.1 L-lactate Dehydrogenase Family . . . . . . . . . . . . . . . . . . 76
6.5.2 Protein Kinase Family . . . . . . . . . . . . . . . . . . . . . . . 78
6.5.3 Nucleotidyl Cyclase Family . . . . . . . . . . . . . . . . . . . . 80
6.5.4 Partially-supervised Protein Clustering . . . . . . . . . . . . . . 82
7 Clustering of Heart Disease Phenotype Data 85
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Contents
8 Discussion 93
8.1 CSI Mixture Models & Structure Learning . . . . . . . . . . . . . . . . . 93
8.2 Transcription Factor Data . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3 Protein Family Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.4 Heart Disease Phenotype Data . . . . . . . . . . . . . . . . . . . . . . . 96
Bibliography 99
A Notation 115
B Abbreviations 117
C Nucleotide & Amino Acid Codes 119
D Random CSI Models 121
E Zusammenfassung 123
 Rev. Type: -
 Identifiers: eDoc: 446312
 Degree: PhD

Event

show

Legal Case

show

Project information

show

Source

show