Help Guide Disclaimer Contact us Login
  Advanced SearchBrowse




Conference Paper

Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources


Seldin,  Y
Department Empirical Inference, Max Planck Institute for Biological Cybernetics, Max Planck Society;

There are no locators available
Fulltext (public)
There are no public fulltexts available
Supplementary Material (public)
There is no public supplementary material available

Seldin, Y., Bejerano, G., & Tishby, N. (2001). Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources. In 18th International Conference on Machine Learning (ICML 2001) (pp. 513-520).

Cite as:
We present a novel information theoretic algorithm for unsupervised segmentation of sequences into alternating Variable Memory Markov sources. The algorithm is based on competitive learning between Markov models, when implemented as Prediction Suffix Trees (Ron et al., 1996) using the MDL principle. By applying a model clustering procedure, based on rate distortion theory combined with deterministic annealing, we obtain a hierarchical segmentation of sequences between alternating Markov sources. The algorithm seems to be self regulated and automatically avoids over segmentation. The method is applied successfully to unsupervised segmentation of multilingual texts into languages where it is able to infer correctly both the number of languages and the language switching points. When applied to protein sequence families, we demonstrate the method‘s ability to identify biologically meaningful sub-sequences within the proteins, which correspond to important functional sub-units called domains.