Multiple Genome Alignment: Chaining Algorithms Revisited

Multiple Genome Alignment: Chaining Algorithms Revisited www.phwiki.com

Multiple Genome Alignment: Chaining Algorithms Revisited

Christie, Jeanne, Freelance Journalist has reference to this Academic Journal, PHwiki organized this Journal Multiple Genome Alignment: Chaining Algorithms Revisited Mohamed I. Abouelhoda University of Bielefeld Joint work with Enno Ohlebusch University of Ulm Germany 2003 Comparative Genomics The practice of analysing the genomic material of a species by comparing it with the genomic material of other species. Why is this important The next logical step after the high throughput sequencing projects. Deducing the mechanism in addition to history of genome evolution. Discovering genes in addition to regulatory elements. Identifying exons in eukaryotic genes. Revealing the role of non-coding conserved sequences. The Strategy Closely-related organisms No (or few) genome rearrangements.

Andrews University US www.phwiki.com

This Particular University is Related to this Particular Journal

Multiple Genome Alignment is Difficult St in addition to ard dynamic programming takes O(N k) inpractical even as long as k=2 N is very large Mega bases Given k genomes, each of average length N Heuristic algorithms are there as long as e employed These tools use anchor-based alignment method MGA MGA uses a strategy composed of three steps: First Genome G1 Second Genome G2 Third Genome G3 Computation of fragments (maximal multiple exact matches). Computation of an optimal chain of colinear non-overlapping fragments. Detailed alignment of the regions between the fragments of the optimal chain. MGA MGA uses a strategy composed of three steps: Computation of fragments (maximal multiple exact matches). Computation of an optimal chain of colinear non-overlapping fragments. Detailed alignment of the regions between the fragments of the optimal chain.

MGA MGA uses a strategy composed of three steps: Computation of fragments (maximal multiple exact matches). Computation of an optimal chain of colinear non-overlapping fragments (anchors). Detailed alignment of the regions between the fragments of the optimal chain. anchors The Chaining Problem Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments such that its total score is maximum over all other chains. score(C)= i [ fi+1 .weight – g(fi+1, fi)] where g(fi+1, fi) is the gap cost of connecting fi+1 to fi The Chaining Problem Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments such that its total score is maximum over all other chains. First Genome G1 Second Genome G2 Third Genome G3 score(C)= i [ fi+1 .weight – g(fi+1, fi)] fi+1 fi where g(fi+1, fi) is the gap cost of connecting fi+1 to fi

Previous Work Graph based solution takes O(n2) time. Geometric based algorithm is subquadratic (sparse dynamic programming): Zhang et al. (1994) used space division with a kd-tree (no complexity analysis was given). Myers in addition to Miller (1995) used orthogonal range search with a range tree yielding a complexity of O(n log k n) time in addition to O(n log k-1 n) space. Soble-Martinez, 1986 Wilbur-Lipman, 1983 Eppstein-Giancarlo, 1992 Previous Work For two genomes the complexities are also higher than those of known 2-dim. chaining algorithms O(n log 2 n) time in addition to O(n log n) space. Here, it is improved by almost two log factors in time in addition to one log factor in space O(n log n) time in addition to O(n) space. For k=2 For k>2 O(n log k-2 n log log n) time in addition to O(n log k-2 n) space. Myers & Miller The Problem Revisited fi fi+1: end( fi).xr < start(fi+1).xr as long as all r, 0 < r <= k Any kind of fragment can be used (fragments can contain also mismatches, insertions/deletions). A fragment fi is represented as a hyper-rectangle in a k-dimensional space. A fragment fi is identified with its start in addition to end points: start(fi) in addition to end( fi). We add two imaginary fragments O in addition to t with weight zero. Any two fragments fi in addition to fi+1 in the chain must be colinear in addition to non-overlapping The Solution fj.score=fj.weight+max{fi.score-g( fi , fj ): fi fj} where fi fj : end( fi).xr < start(fj).xr as long as all r, 0 < r <= k g( fi , fj ) is the gap cost of connecting fi to fj score(C)= i [ fi .weight - g(fi, fi-1)] An optimal chain is a chain of maximum score The score of a chain C is The maximum score can be computed by the recurrence fj A graph based solution takes O(n2) time. Geometric-based Solution The recurrence fj.score=fj.weight+max{fi.score-g( fi , fj ): fi fj} can be written as RMQ (Range Maximum Query) Retrieves the fragment fi whose end point Iies in the hyper-rectangle bounded by start(fj) in addition to O such that fi.score-g( fi , fj ) is maximum. fj.score=fj.weight+RMQ{O, start(fj)} fj Overview of the Algorithm The algorithm uses techniques from computational geometry 1. Line-sweep algorithm. 2. The algorithm works on the start in addition to end points of the fragments. 3. RMQ using a semi-dynamic data structure: the range tree. 4. Proper inclusion of the gap costs into the fragment weight. The Algorithm without Gap Costs fj.score=fj.weight+RMQ{O, start(fj)} The recurrence is Line-sweep algorithm 1. Sort the start in addition to end points of the fragments w.r.t. x1 2. If a start point of a fragment, say fj, is scanned apply the RMQ(O, (start(fj).x2, , start(fj).xk)) to the set of active end points in addition to update the score of the end point of fragment fj. 3. Otherwise, add the end point to the set of active end points (already scanned end points). The first step reduces the dimension of the RMQ to k-1. If the gap cost is zero, a RMQ returns the end point of the fragment fi such that is maximum. The Complexity of the Algorithm The complexity of the algorithm depends on the complexity of the RMQ in d= k-1 dimensions The required data structure D to manipulate the set of points is a semi-dynamic data structure over all end points in d= k-1 dimensions supports the operations: Activate an end point. Per as long as m a RMQ. D is implemented as a range tree O(n log k-2 n log log n) time in addition to O(n log k-2 n) space For n fragments in addition to dimension d, the RMQ in addition to activation takes: Since d= k-1>1, the complexity of the algorithm is O(n log d-1 n log log n) time in addition to O(n log d-1 n) space supported by fractional cascading. enhanced with priority queues. Willard, 1985 van Emde Boas, 1977 Johnson, 1982 Including Gap Costs The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic. fj.score=fj.weight+max{fi.score-g( fi , fj ): fi fj} fj.score=fj.weight+RMQ{O, start(fj)} Recall the recurrence How to define the gap costs How to include the gap costs without affecting the complexity

Types of Gap Costs L1 L The gap costs g can be described geometrically: ACC XX – – – – – ACC ACC – – YYY – – ACC ACC -XX ACC ACC YYY ACC ACC – – – – – ZZ ACC ACC – ZZ ACC The L in addition to the sum-of-pairs gap cost follow the same idea as the L1 Including Gap Costs The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic. fj.score=fj.weight+max{fi.score-g( fi , fj ): fi fj} fj.score=fj.weight+RMQ{O, start(fj)} Recall the recurrence Including Gap Costs in L1 gc( f) = d1(t, end(f)) We define the geometric cost of a fragment f as follows: where d1(t, end(f) is the distance in the L1 metric between t in addition to end(f). f 1.score – g( f 1 , f ) > f 2.score – g( f 2 , f ) iff f 1.score – gc( f 1) > f 2.score – gc( f 2) f 1 f 2 gc( f) is a constant that can be precomputed in addition to attached to the fragment’s weight O(n log k-2 n log log n) time in addition to O(n log k-2 n) space For k>2, the complexity is O(n log n) time in addition to O(n) space For k=2, the complexity is f

Gap Costs in L gc( f) = d(t, end(f)) The geometric cost of a fragment f is then: f 1.score – g( f 1 , f ) > f 3.score – g( f 3 , f ) iff f 1.score – gc( f 1) > f 3.score – gc( f 3) f 1 f 2 gc( f) is a constant that can be precomputed in addition to attached to the fragment’s weight In the octant O1, In the octant O2, f 3 O1 O2 RMQ must be per as long as med in every octant in addition to the point of maximum score is chosen. RMQ on Octants Because the RMQ requires an orthogonal range, we use the octant-to-quadrant trans as long as mation: For the octant O1 For the octant O2 The total complexity of the algorithm depends on the space divison. O(k! n log k-2 n log log n) time in addition to O(k! n log k-2 n) space For k>2, the complexity is O(2 n log n) time in addition to O(2 n) space For k=2, the complexity is Example: 4 Strains Staphylococcus 1. Staphlyococcus aureus N315 NC-002745 (2853924) 2. Staphlyococcus aureus Mu50 NC-002758 (2919236) 3. Staphlyococcus aureus MW2 NC-003923 (2860842) 4. Staphlyococcus epidermidis NC-004461 (2535068) Fragments min. len. 15 of 1-2 200,000 Fragments

Christie, Jeanne Christie, Jeanne Freelance Journalist www.phwiki.com

Example: 4 Strains Staphylococcus 1-2 1-3 1-4 1. Staphlyococcus aureus N315 NC-002745 (2853924) 2. Staphlyococcus aureus Mu50 NC-002758 (2919236) 3. Staphlyococcus aureus MW2 NC-003923 (2860842) 4. Staphlyococcus epidermidis NC-004461 (2535068) 3-4 2-4 2-3 Example: 3 Strains E. coli 1-2 1-3 2-3 1: E.coli O157:H7 EDL 993(5608027bp) 2: Ecoli O157:H7 (5577057bp) 3: E.coli k12 (4705567) Fragments min. len. 30 of 1-2 60,000 Fragments Conclusions Our algorithm solves an open problem w.r.t. the chaining of n fragments of k genomes. Other data structures than the range tree can be used, e.g., the kd-tree. It takes as long as k>2 The sum-of-pairs gap cost is addressed in the paper. We reduced the time complexity by almost two log factors in addition to the space complexity by one log factor.

Example Local chains Fragments min. len. 12 M. Genetalium M. Pneimonia M. Pneimonia M. Genetalium M. Genetalium Abouelhoda-Ohlebusch, WABI 2003 Example Local chains Fragments min. len. 15 E. coli V. cholera Acknowledgement Enno Ohlebusch, Ulm University Stefan Kurtz, , Hamburg University Robert Giegerich, Bielefeld University Janina Scholz, Bielefeld University Thanks as long as your attention This work was funded by the Graduiertenkolleg Bioin as long as matik, Universität Bielefeld, Germany.

Christie, Jeanne Freelance Journalist

Christie, Jeanne is from United States and they belong to Christie, Jeanne and they are from  Tuscaloosa, United States got related to this Particular Journal. and Christie, Jeanne deal with the subjects like Wine

Journal Ratings by Andrews University

This Particular Journal got reviewed and rated by Andrews University and short form of this particular Institution is US and gave this Journal an Excellent Rating.