hide
Free keywords:
-
Abstract:
We describe a novel algorithm for unsupervised segmentation of sequences into alternating Variable Memory Markov sources, first presented in [SBT01]. The algorithm is based on competitive learning between Markov models, when
implemented as Prediction Suffix Trees [RST96] using the MDL principle. By applying a model clustering procedure, based on rate distortion theory combined with deterministic annealing, we obtain a hierarchical segmentation of
sequences between alternating Markov sources. The method is applied successfully to unsupervised segmentation of multilingual texts into languages
where it is able to infer correctly both the number of languages and the language switching points. When applied to protein sequence families (results of the [BSMT01] work), we demonstrate the methods ability to identify biologically meaningful sub-sequences within the proteins, which correspond to
signatures of important functional sub-units called domains. Our approach to proteins classification (through the obtained signatures) is shown to have both conceptual and practical advantages over the currently used methods.