Algorithms for efficient alignment-free sequence comparison

Domazet-Lošo, Mirjana

Local TagsRelease HistoryDetailsSummary

Algorithms for efficient alignment-free sequence comparison

Domazet-Lošo, M. (2010). Algorithms for efficient alignment-free sequence comparison. PhD Thesis, University of Zagreb, Zagreb.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/11858/00-001M-0000-000F-D451-8 Version Permalink: https://hdl.handle.net/11858/00-001M-0000-000F-D452-6

Genre: Thesis

Alternative Title : Algoritmi za učinkovitu usporedbu sekvenci bez korištenja sravnjivanja

Files

show Files

hide Files

:

Thesis_Mirjana Domazet-Loso.pdf (Publisher version), 2MB

View Save

File Permalink:
https://hdl.handle.net/11858/00-001M-0000-000F-D450-A

Name:
Thesis_Mirjana Domazet-Loso.pdf

Description:
-

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
-

License:
-

Locators

show

Creators

show

hide

Creators:
Domazet-Lošo, Mirjana¹, Author
Haubold, Bernhard¹, Advisor
Ristov, Strahil, Referee

Affiliations:
1Research Group Bioinformatics, Department Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Max Planck Society, ou_1445644

Content

show

hide

Free keywords: alignment-free method; evolutionary distance; local sequence homology; genome comparison; HIV; horizontal gene transfer; suffix tree; suffix array; shortest unique substringing

Abstract: Sequence comparison is an essential tool in modern biology. It is used to identify homologous regions between sequences, and to detect evolutionary relationships between organisms. Sequence comparison is usually based on alignments. However, aligning whole genomes is computationally difficult. As an alternative approach, alignment-free sequence comparison can be used. In my thesis, I concentrate on two problems that can be solved without alignment: (i) estimation of substitution rates between nucleotide sequences, and (ii) detection of local sequence homology. In the first part of my thesis, I developed and implemented a new algorithm for the efficient alignment-free computation of the number of nucleotide substitutions per site, and applied it to the analysis of large data sets of complete genomes. In the second part of my thesis, I developed and implemented a new algorithm for detecting matching regions between nucleotide sequences. I applied this solution to the classification of circulating recombinant forms of HIV, and to the analysis of bacterial genomes subject to horizontal gene transfer.

Details

show

hide

Language(s): eng - English

Dates: Accepted: 2010-10-07

Publication Status: Accepted / In Press

Pages: III, 127 S.

Publishing info: Zagreb : University of Zagreb

Table of Contents: Table of Contents
1. GENERAL INTRODUCTION.........................................................................1
1.1. Suffix trees and other index data structures used in biological sequence analysis.....................................................................................................................9
1.1.1. Suffix Tree..........................................................................................11
1.1.2. The space and the time complexity of the algorithms for the suffix tree construction.......................................................................................................13
1.1.3. Suffix Array........................................................................................14
1.1.4. The space and the time complexity of the algorithms for suffix array construction.......................................................................................................15
1.1.5. Enhanced Suffix Array.......................................................................17
1.1.6. The 64-bit implementation of the lightweight suffix array construction algorithm 21
1.1.7. Self-index...........................................................................................22
1.1.8. Burrows-Wheeler transform..............................................................23
1.1.9. The FM-Index and the backward search algorithm..........................25
1.1.10. The space and the time-complexity of the FM-index.........................29
2. EFFICIENT ESTIMATION OF PAIRWISE DISTANCES BETWEEN GENOMES...............................................................................................................31
2.1. Introduction................................................................................................31
2.2. Methods.....................................................................................................33
2.2.1. Definition of an alignment-free estimator of the rate of substitution, Kr 33
2.2.2. Problem statement.............................................................................35
2.2.3. Time complexity analysis of the previous approach (kr 1)................35
2.2.4. Time complexity analysis of the new approach (kr 2).......................37
2.2.5. Algorithm 1: Computation of all Kr values during the traversal of a generalized suffix tree of n sequences................................................................38
2.2.6. The implementation of kr version 2...................................................44
2.3. Analysis of Kr on simulated data sets........................................................45
2.3.1. Auxiliary programs............................................................................45
2.3.2. Consistency of Kr...............................................................................46 i
2.3.3. The affect of horizontal gene transfer on the accuracy of Kr............48
2.3.4. The effect of genome duplication on the accuracy of Kr....................49
2.3.5. Run time comparison of kr 1 and kr 2...............................................50
2.4. Application of kr version 2........................................................................53
2.4.1. Auxililary software used for the analysis of real data sets................56
2.4.2. The analysis of 12 Drosophila genomes............................................57
2.4.3. The analysis of 13 Escherichia coli and Shigella genomes...............58
2.4.4. The analysis of 825 HIV-1 pure subtype genomes.............................61
2.5. Discussion..................................................................................................62
3. EFFICIENT ALIGNMENT-FREE DETECTION OF LOCAL SEQUENCE HOMOLOGY....................................................................................66
3.1. Introduction................................................................................................66
3.2. Methods.....................................................................................................69
3.2.1. Problem statement – determining subtype(s) of a query sequence....69
3.2.2. Construction of locally homologous segments..................................71
3.2.3. Time complexity of computing a list of intervals Ii............................72
3.2.4. Algorithm 2: Construction of an interval tree...................................73
3.2.5. Computing a list of segements Gi.......................................................80
3.3. Analysis of st on simulated data sets.........................................................82
3.3.1. Run-time and memory usage analysis of st........................................82
3.3.2. Consistency of st................................................................................85
3.3.3. Comparison to SCUEAL on simulated data sets...............................92
3.4. Application of st.........................................................................................97
3.4.1. The analysis of Neisseria meningitidis..............................................98
3.4.2. The analysis of a recombinant form of HIV-1...................................99
3.4.3. The analysis of circulating recombinant forms of HIV-1................103
3.4.4. The analysis of an avian pathogenic Escherichia coli strain..........104
3.5. Discussion................................................................................................107
4. CONCLUSION..............................................................................................110
5. REFERENCES..............................................................................................112
6. ELECTRONIC SOURCES...........................................................................121
7. LIST OF ABBREVIATIONS AND SYMBOLS.........................................122 ii
iii
ABSTRACT............................................................................................................124
SAŽETAK..............................................................................................................125
CURRICULUM VITAE........................................................................................126
ŽIVOTOPIS...........................................................................................................127

Rev. Type: -

Identifiers: eDoc: 544378
Other: Diss/12255

Degree: PhD

Event

show

Legal Case

show

Project information

show

Source

show