de.mpg.escidoc.pubman.appbase.FacesBean
English
 
Help Guide Privacy Policy Disclaimer Contact us
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
 
 
 
 
DownloadE-Mail
  Algorithms for efficient alignment-free sequence comparison

Domazet-Lošo, M. (2010). Algorithms for efficient alignment-free sequence comparison. PhD Thesis, University of Zagreb, Zagreb.

Item is

Basic

show hide
Item Permalink: http://hdl.handle.net/11858/00-001M-0000-000F-D451-8 Version Permalink: http://hdl.handle.net/11858/00-001M-0000-000F-D452-6
Genre: Thesis
Alternative Title : Algoritmi za učinkovitu usporedbu sekvenci bez korištenja sravnjivanja

Files

show Files
hide Files
:
Thesis_Mirjana Domazet-Loso.pdf (Publisher version), 2MB
Description:
-
Visibility:
Public
MIME-Type / Checksum:
application/pdf / [MD5]
Technical Metadata:
Copyright Date:
-
Copyright Info:
-
License:
-

Locators

show

Creators

show
hide
 Creators:
Domazet-Lošo, Mirjana1, Author              
Haubold, Bernhard1, Advisor              
Ristov, Strahil, Referee
Affiliations:
1Research Group Bioinformatics, Department Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Max Planck Society, escidoc:1445644              

Content

show
hide
Free keywords: alignment-free method; evolutionary distance; local sequence homology; genome comparison; HIV; horizontal gene transfer; suffix tree; suffix array; shortest unique substringing
 Abstract: Sequence comparison is an essential tool in modern biology. It is used to identify homologous regions between sequences, and to detect evolutionary relationships between organisms. Sequence comparison is usually based on alignments. However, aligning whole genomes is computationally difficult. As an alternative approach, alignment-free sequence comparison can be used. In my thesis, I concentrate on two problems that can be solved without alignment: (i) estimation of substitution rates between nucleotide sequences, and (ii) detection of local sequence homology. In the first part of my thesis, I developed and implemented a new algorithm for the efficient alignment-free computation of the number of nucleotide substitutions per site, and applied it to the analysis of large data sets of complete genomes. In the second part of my thesis, I developed and implemented a new algorithm for detecting matching regions between nucleotide sequences. I applied this solution to the classification of circulating recombinant forms of HIV, and to the analysis of bacterial genomes subject to horizontal gene transfer.

Details

show
hide
Language(s): eng - English
 Dates: 2010-10-07
 Publication Status: Accepted / In Press
 Pages: III, 127 S.
 Publishing info: Zagreb : University of Zagreb
 Table of Contents: Table of Contents
1. GENERAL INTRODUCTION.........................................................................1
1.1. Suffix trees and other index data structures used in biological sequence analysis.....................................................................................................................9
1.1.1. Suffix Tree..........................................................................................11
1.1.2. The space and the time complexity of the algorithms for the suffix tree construction.......................................................................................................13
1.1.3. Suffix Array........................................................................................14
1.1.4. The space and the time complexity of the algorithms for suffix array construction.......................................................................................................15
1.1.5. Enhanced Suffix Array.......................................................................17
1.1.6. The 64-bit implementation of the lightweight suffix array construction algorithm 21
1.1.7. Self-index...........................................................................................22
1.1.8. Burrows-Wheeler transform..............................................................23
1.1.9. The FM-Index and the backward search algorithm..........................25
1.1.10. The space and the time-complexity of the FM-index.........................29
2. EFFICIENT ESTIMATION OF PAIRWISE DISTANCES BETWEEN GENOMES...............................................................................................................31
2.1. Introduction................................................................................................31
2.2. Methods.....................................................................................................33
2.2.1. Definition of an alignment-free estimator of the rate of substitution, Kr 33
2.2.2. Problem statement.............................................................................35
2.2.3. Time complexity analysis of the previous approach (kr 1)................35
2.2.4. Time complexity analysis of the new approach (kr 2).......................37
2.2.5. Algorithm 1: Computation of all Kr values during the traversal of a generalized suffix tree of n sequences................................................................38
2.2.6. The implementation of kr version 2...................................................44
2.3. Analysis of Kr on simulated data sets........................................................45
2.3.1. Auxiliary programs............................................................................45
2.3.2. Consistency of Kr...............................................................................46 i
2.3.3. The affect of horizontal gene transfer on the accuracy of Kr............48
2.3.4. The effect of genome duplication on the accuracy of Kr....................49
2.3.5. Run time comparison of kr 1 and kr 2...............................................50
2.4. Application of kr version 2........................................................................53
2.4.1. Auxililary software used for the analysis of real data sets................56
2.4.2. The analysis of 12 Drosophila genomes............................................57
2.4.3. The analysis of 13 Escherichia coli and Shigella genomes...............58
2.4.4. The analysis of 825 HIV-1 pure subtype genomes.............................61
2.5. Discussion..................................................................................................62
3. EFFICIENT ALIGNMENT-FREE DETECTION OF LOCAL SEQUENCE HOMOLOGY....................................................................................66
3.1. Introduction................................................................................................66
3.2. Methods.....................................................................................................69
3.2.1. Problem statement – determining subtype(s) of a query sequence....69
3.2.2. Construction of locally homologous segments..................................71
3.2.3. Time complexity of computing a list of intervals Ii............................72
3.2.4. Algorithm 2: Construction of an interval tree...................................73
3.2.5. Computing a list of segements Gi.......................................................80
3.3. Analysis of st on simulated data sets.........................................................82
3.3.1. Run-time and memory usage analysis of st........................................82
3.3.2. Consistency of st................................................................................85
3.3.3. Comparison to SCUEAL on simulated data sets...............................92
3.4. Application of st.........................................................................................97
3.4.1. The analysis of Neisseria meningitidis..............................................98
3.4.2. The analysis of a recombinant form of HIV-1...................................99
3.4.3. The analysis of circulating recombinant forms of HIV-1................103
3.4.4. The analysis of an avian pathogenic Escherichia coli strain..........104
3.5. Discussion................................................................................................107
4. CONCLUSION..............................................................................................110
5. REFERENCES..............................................................................................112
6. ELECTRONIC SOURCES...........................................................................121
7. LIST OF ABBREVIATIONS AND SYMBOLS.........................................122 ii
iii
ABSTRACT............................................................................................................124
SAŽETAK..............................................................................................................125
CURRICULUM VITAE........................................................................................126
ŽIVOTOPIS...........................................................................................................127
 Rev. Method: -
 Identifiers: eDoc: 544378
Other: Diss/12255
 Degree: PhD

Event

show

Legal Case

show

Project information

show

Source

show