AssemblyAssembling with RepeatsMate PairsWhole genome shotgunArachne: DetailsAlignment ModuleOverlap detectionK-mer based overlapSorting k-mersPhase 2-4 of Alignment moduleDetecting Chimeric readsRepeatsContig assemblyDetecting Repeat Contigs 1: Read DensityCreating Super ContigsSupercontig assemblySupercontig mergingRepeat SupercontigsFilling gaps in SupercontigsConsenus DerivationSummaryThe central dogma againMuch other analysis is possibleA Static picture of the cell is insufficientAssemblyAssemblyAssembling with RepeatsAssembling with RepeatsMate PairsMate PairsWhole genome Whole genome shotgunshotgunInput: Input: Shotgun sequence fragments (reads)Shotgun sequence fragments (reads)Mate pairsMate pairsOutput:Output:A single sequence created by consensus of overlapping readsA single sequence created by consensus of overlapping readsFirst generation of assemblers did not include mate-pairs First generation of assemblers did not include mate-pairs (Phrap, CAP..)(Phrap, CAP..)Second generation: CA, Arachne, EulerSecond generation: CA, Arachne, EulerWe will discuss Arachne, a freely available sequence We will discuss Arachne, a freely available sequence assembler (2nd generation)assembler (2nd generation)Arachne: DetailsArachne: DetailsInitial processingInitial processingAlignment moduleAlignment moduleAlignment ModuleAlignment ModuleInput: Collection of DNA sequences of Input: Collection of DNA sequences of arbitrary lengtharbitrary lengthOutput: Pairwise alignments between Output: Pairwise alignments between them.them.Overlap detectionOverlap detectionOption 1: Compute an alignment between Option 1: Compute an alignment between every pair.every pair.G = 150Mb, L=500G = 150Mb, L=500Coverage LN/G = 10Coverage LN/G = 10N = 10*150*10N = 10*150*1066/500 = 3*10/500 = 3*1066Not good! (Only a small fraction are true Not good! (Only a small fraction are true overlaps)overlaps)K-mer based overlapK-mer based overlapA 25-bp sequence appears at most once A 25-bp sequence appears at most once in the genome!in the genome!Two overlapping sequences should share Two overlapping sequences should share a 25-mera 25-merTwo non-overlapping sequences should Two non-overlapping sequences should not!not!Sorting k-mersSorting k-mersBuild a list of k-mers that appear in the Build a list of k-mers that appear in the sequences and their reverse complementssequences and their reverse complementsCreate a record with 4 entries:Create a record with 4 entries:K-merK-merSequence numberSequence numberPosition in the sequencePosition in the sequenceReverse complementation flagReverse complementation flagSort a vector of these according to k-merSort a vector of these according to k-merIf number of records exceeds threshold, discard If number of records exceeds threshold, discard (why?)(why?)Phase 2-4 of Alignment modulePhase 2-4 of Alignment module Coalesce k-mer hits into Coalesce k-mer hits into longer, gap-free partial longer, gap-free partial alignments.alignments.These extended k-mer These extended k-mer hits are saved.hits are saved.For each pair of For each pair of sequences, form a sequences, form a directed graph. directed graph. For each maximal path For each maximal path in the graph, construct in the graph, construct an alignment.an alignment.Refine alignment via Refine alignment via banded DPbanded DPDetecting Chimeric readsDetecting Chimeric readsChimeric reads: Reads that Chimeric reads: Reads that contain sequence from two contain sequence from two genomic locations.genomic locations.Good overlaps: G(a,b) if a,b Good overlaps: G(a,b) if a,b overlap with a high scoreoverlap with a high scoreTransitive overlap: T(a,c) if Transitive overlap: T(a,c) if G(a,b), and G(b,c) G(a,b), and G(b,c) Find a point x across which Find a point x across which only transitive overlaps occur. only transitive overlaps occur. X is a point of chimerismX is a point of chimerismRepeatsRepeatsContig assemblyContig assemblyReads are merged into contigs Reads are merged into contigs upto repeat boundaries.upto repeat boundaries.(a,b) & (a,c) overlap, (b,c) (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, should overlap as well. Also, shift(a,c)=shift(a,b)+shift(b,c)shift(a,c)=shift(a,b)+shift(b,c)Most of the contigs are unique Most of the contigs are unique pieces of the genome, and end pieces of the genome, and end at some Repeat boundary.at some Repeat boundary.Some contigs might be entirely Some contigs might be entirely within repeats. These must be within repeats. These must be detecteddetectedDetecting Repeat Contigs 1: Read DensityDetecting Repeat Contigs 1: Read DensityCompute the log-odds Compute the log-odds ratio of two ratio of two hypotheses:hypotheses:H1: The contig is from H1: The contig is from a unique region of the a unique region of the genome.genome.The contig is from a The contig is from a region that is region that is repeated at least repeated at least twicetwiceCreating Super ContigsCreating Super ContigsSupercontig assemblySupercontig assemblySupercontigs are built incrementallySupercontigs are built incrementallyInitially, each contig is a supercontig.Initially, each contig is a supercontig.In each round, a pair of super-contigs is In each round, a pair of super-contigs is merged until no more can be performed.merged until no more can be performed.Create a Priority Queue with a score for Create a Priority Queue with a score for every pair of ‘mergeable supercontigs’.every pair of ‘mergeable supercontigs’.Score has two terms:Score has two terms:A reward for multiple mate-pair linksA reward for multiple mate-pair linksA penalty for distance between the links.A penalty for distance between the links.Supercontig mergingSupercontig mergingRemove the top scoring pair (S1,S2) from Remove the top scoring pair (S1,S2) from the priority queue.the priority queue.Merge (SMerge (S11,S,S22) to form contig T.) to form contig T.Remove all pairs in Q containing SRemove all pairs in Q containing S11 or S or S22Find all supercontigs W that share mate-Find all supercontigs W that share mate-pair links with T and insert (T,W) into the pair links with T and insert (T,W) into the priority queue.priority queue.Detect Repeated Supercontigs and removeDetect Repeated Supercontigs and removeRepeat SupercontigsRepeat SupercontigsIf the
View Full Document