Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture 
Models

Dubey, A; Hwang, S; Rangel, C; Rasmussen, CE; Ghahramani, Z; Wild, DL

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models

MPS-Authors

/persons/resource/persons84156

Rasmussen, CE
Department Empirical Inference, Max Planck Institute for Biological Cybernetics, Max Planck Society;
Max Planck Institute for Biological Cybernetics, Max Planck Society;

External Resource

http://psb.stanford.edu/previous/psb04/
(Table of contents)

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

pdf2373.pdf
(Any fulltext), 182KB

Supplementary Material (public)

There is no public supplementary material available

Citation

Dubey, A., Hwang, S., Rangel, C., Rasmussen, C., Ghahramani, Z., & Wild, D. (2004). Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models. In Pacific Symposium on Biocomputing (PSB 2004) (pp. 399-410). Singapore: World Scientific Publishing.

Cite as: https://hdl.handle.net/11858/00-001M-0000-0013-F3A7-5

Abstract

We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the thoery of infinite Gaussian mixture models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known tree-dimensional structures and G-pretein coupled receptor sequences. The consistency of the clusters indicate that that our methods is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which reflects and extends their SCOP classifications.

A supplementary web site containing larger versions of the figures is available at http://public.kgi.edu/~wild/PSB04