Context-specific independence mixture models for cluster analysis of biological 
data

Georgi, Benjamin

Local TagsRelease HistoryDetailsSummary

Context-specific independence mixture models for cluster analysis of biological data

Georgi, B. (in preparation). Context-specific independence mixture models for cluster analysis of biological data.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/11858/00-001M-0000-0010-7D6F-0 Version Permalink: https://hdl.handle.net/11858/00-001M-0000-0010-7D70-A

Genre: Thesis

Files

show Files

hide Files

:

Georgi.zip (Any fulltext), 3MB

View Save

File Permalink:
https://hdl.handle.net/11858/00-001M-0000-0010-7D6E-1

Name:
Georgi.zip

Description:
-

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/zip / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
eDoc_access: PUBLIC

License:
-

Locators

show

Creators

show

hide

Creators:
Georgi, Benjamin¹, Author

Affiliations:
1Max Planck Society, ou_persistent13

Content

show

hide

Free keywords: Clustering mixture models context-specific independence transcription factors proteins heart disease

Abstract: Clustering is a crucial first step in the exploratory analysis of biological data. This thesis is concerned with cluster analysis of biological data using mixture models. Mixture models is a class of powerful and versatile statistical models. We develop an extension to the conventional mixtures in form of the context-specific independence (CSI) framework. CSI mixtures are particularly suited for the analysis of biological data since they perform robustly in the presence of noise and uninformative features in the data. This is achieved by adapting the model complexity to the degree of variation observed in a given data set. We present a learning algorithm for CSI mixtures in a Bayesian framework. We apply CSI mixture clustering on data sets of transcription factor binding sites, protein sequences and complex disease phenotype data.

Details

show

hide

Language(s): eng - English

Dates: Modified: 2009-06-10

Publication Status: Not specified

Pages: XII, 124

Publishing info: Berlin : Freie Universität Berlin

Table of Contents: Preface xi
1 Introduction 1
1.1 Biological Mass data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Finite Mixture Models 7
2.1 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Atomic Distributions . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Mixture Models from Different Perspectives . . . . . . . . . . . 11
2.1.3 Sampling from a Mixture . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Expectation Maximization (EM) Algorithm . . . . . . . . . . . . . . . . 12
2.2.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 EM for Mixture Models . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Parameter Estimators . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Drawbacks of the EM Algorithm . . . . . . . . . . . . . . . . . . 18
2.3 Mixture Models for Clustering . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Clustering Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Handling of Missing Data . . . . . . . . . . . . . . . . . . . . . 22
2.3.4 Dealing with Noisy Data Sets . . . . . . . . . . . . . . . . . . . 23
2.4 Bayesian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Parameter Estimators . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Partially-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Context-specific Independence Mixture Models 31
3.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Context-specific Independence (CSI) . . . . . . . . . . . . . . . . . . . . 32
3.3 CSI for Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 CSI from Different Perspectives . . . . . . . . . . . . . . . . . . 35
3.4 Bayesian CSI Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Structural EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.2 Structural EM for Bayesian CSI Mixture Models . . . . . . . . . 38
Contents
3.5.3 Structure Parameter Estimators . . . . . . . . . . . . . . . . . . . 38
3.6 CSI Mixtures and Clustering . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 Interpretation of the CSI Structure . . . . . . . . . . . . . . . . . 42
3.6.2 Feature Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Structure Learning Algorithm 45
4.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Combinatorial Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Structure Space Search Strategies . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Choosing the Structure Prior . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Search Strategy Evaluation . . . . . . . . . . . . . . . . . . . . . 48
4.4 Running Time Optimization . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Feature-wise Caching . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Candidate Structure Graph . . . . . . . . . . . . . . . . . . . . . 51
4.4.3 Posterior bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.4 Structure Learning Running Time . . . . . . . . . . . . . . . . . 55
5 Mixture Modeling for Transcription Factor Binding Sites 59
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 TFBS Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 Analysis of TF LEU3 . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.3 Conservation Statistics . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.4 Examples of Binding Site Subgroups . . . . . . . . . . . . . . . 68
6 Clustering of Protein Families Using Mixtures 71
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Dirichlet Mixture Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Prior Parameter Derivation . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Feature Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.5.1 L-lactate Dehydrogenase Family . . . . . . . . . . . . . . . . . . 76
6.5.2 Protein Kinase Family . . . . . . . . . . . . . . . . . . . . . . . 78
6.5.3 Nucleotidyl Cyclase Family . . . . . . . . . . . . . . . . . . . . 80
6.5.4 Partially-supervised Protein Clustering . . . . . . . . . . . . . . 82
7 Clustering of Heart Disease Phenotype Data 85
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Contents
8 Discussion 93
8.1 CSI Mixture Models & Structure Learning . . . . . . . . . . . . . . . . . 93
8.2 Transcription Factor Data . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3 Protein Family Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.4 Heart Disease Phenotype Data . . . . . . . . . . . . . . . . . . . . . . . 96
Bibliography 99
A Notation 115
B Abbreviations 117
C Nucleotide & Amino Acid Codes 119
D Random CSI Models 121
E Zusammenfassung 123

Rev. Type: -

Identifiers: eDoc: 446312

Degree: PhD

Event

show

Legal Case

show

Project information

show

Source

show