DOC PREVIEW
UCSD CSE 182 - Assembly

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

AssemblyAssembling with RepeatsMate PairsWhole genome shotgunArachne: DetailsAlignment ModuleOverlap detectionK-mer based overlapSorting k-mersPhase 2-4 of Alignment moduleDetecting Chimeric readsRepeatsContig assemblyDetecting Repeat Contigs 1: Read DensityCreating Super ContigsSupercontig assemblySupercontig mergingRepeat SupercontigsFilling gaps in SupercontigsConsenus DerivationSummaryThe central dogma againMuch other analysis is possibleA Static picture of the cell is insufficientAssemblyAssemblyAssembling with RepeatsAssembling with RepeatsMate PairsMate PairsWhole genome Whole genome shotgunshotgunInput: Input: Shotgun sequence fragments (reads)Shotgun sequence fragments (reads)Mate pairsMate pairsOutput:Output:A single sequence created by consensus of overlapping readsA single sequence created by consensus of overlapping readsFirst generation of assemblers did not include mate-pairs First generation of assemblers did not include mate-pairs (Phrap, CAP..)(Phrap, CAP..)Second generation: CA, Arachne, EulerSecond generation: CA, Arachne, EulerWe will discuss Arachne, a freely available sequence We will discuss Arachne, a freely available sequence assembler (2nd generation)assembler (2nd generation)Arachne: DetailsArachne: DetailsInitial processingInitial processingAlignment moduleAlignment moduleAlignment ModuleAlignment ModuleInput: Collection of DNA sequences of Input: Collection of DNA sequences of arbitrary lengtharbitrary lengthOutput: Pairwise alignments between Output: Pairwise alignments between them.them.Overlap detectionOverlap detectionOption 1: Compute an alignment between Option 1: Compute an alignment between every pair.every pair.G = 150Mb, L=500G = 150Mb, L=500Coverage LN/G = 10Coverage LN/G = 10N = 10*150*10N = 10*150*1066/500 = 3*10/500 = 3*1066Not good! (Only a small fraction are true Not good! (Only a small fraction are true overlaps)overlaps)K-mer based overlapK-mer based overlapA 25-bp sequence appears at most once A 25-bp sequence appears at most once in the genome!in the genome!Two overlapping sequences should share Two overlapping sequences should share a 25-mera 25-merTwo non-overlapping sequences should Two non-overlapping sequences should not!not!Sorting k-mersSorting k-mersBuild a list of k-mers that appear in the Build a list of k-mers that appear in the sequences and their reverse complementssequences and their reverse complementsCreate a record with 4 entries:Create a record with 4 entries:K-merK-merSequence numberSequence numberPosition in the sequencePosition in the sequenceReverse complementation flagReverse complementation flagSort a vector of these according to k-merSort a vector of these according to k-merIf number of records exceeds threshold, discard If number of records exceeds threshold, discard (why?)(why?)Phase 2-4 of Alignment modulePhase 2-4 of Alignment module Coalesce k-mer hits into Coalesce k-mer hits into longer, gap-free partial longer, gap-free partial alignments.alignments.These extended k-mer These extended k-mer hits are saved.hits are saved.For each pair of For each pair of sequences, form a sequences, form a directed graph. directed graph. For each maximal path For each maximal path in the graph, construct in the graph, construct an alignment.an alignment.Refine alignment via Refine alignment via banded DPbanded DPDetecting Chimeric readsDetecting Chimeric readsChimeric reads: Reads that Chimeric reads: Reads that contain sequence from two contain sequence from two genomic locations.genomic locations.Good overlaps: G(a,b) if a,b Good overlaps: G(a,b) if a,b overlap with a high scoreoverlap with a high scoreTransitive overlap: T(a,c) if Transitive overlap: T(a,c) if G(a,b), and G(b,c) G(a,b), and G(b,c) Find a point x across which Find a point x across which only transitive overlaps occur. only transitive overlaps occur. X is a point of chimerismX is a point of chimerismRepeatsRepeatsContig assemblyContig assemblyReads are merged into contigs Reads are merged into contigs upto repeat boundaries.upto repeat boundaries.(a,b) & (a,c) overlap, (b,c) (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, should overlap as well. Also, shift(a,c)=shift(a,b)+shift(b,c)shift(a,c)=shift(a,b)+shift(b,c)Most of the contigs are unique Most of the contigs are unique pieces of the genome, and end pieces of the genome, and end at some Repeat boundary.at some Repeat boundary.Some contigs might be entirely Some contigs might be entirely within repeats. These must be within repeats. These must be detecteddetectedDetecting Repeat Contigs 1: Read DensityDetecting Repeat Contigs 1: Read DensityCompute the log-odds Compute the log-odds ratio of two ratio of two hypotheses:hypotheses:H1: The contig is from H1: The contig is from a unique region of the a unique region of the genome.genome.The contig is from a The contig is from a region that is region that is repeated at least repeated at least twicetwiceCreating Super ContigsCreating Super ContigsSupercontig assemblySupercontig assemblySupercontigs are built incrementallySupercontigs are built incrementallyInitially, each contig is a supercontig.Initially, each contig is a supercontig.In each round, a pair of super-contigs is In each round, a pair of super-contigs is merged until no more can be performed.merged until no more can be performed.Create a Priority Queue with a score for Create a Priority Queue with a score for every pair of ‘mergeable supercontigs’.every pair of ‘mergeable supercontigs’.Score has two terms:Score has two terms:A reward for multiple mate-pair linksA reward for multiple mate-pair linksA penalty for distance between the links.A penalty for distance between the links.Supercontig mergingSupercontig mergingRemove the top scoring pair (S1,S2) from Remove the top scoring pair (S1,S2) from the priority queue.the priority queue.Merge (SMerge (S11,S,S22) to form contig T.) to form contig T.Remove all pairs in Q containing SRemove all pairs in Q containing S11 or S or S22Find all supercontigs W that share mate-Find all supercontigs W that share mate-pair links with T and insert (T,W) into the pair links with T and insert (T,W) into the priority queue.priority queue.Detect Repeated Supercontigs and removeDetect Repeated Supercontigs and removeRepeat SupercontigsRepeat SupercontigsIf the


View Full Document

UCSD CSE 182 - Assembly

Download Assembly
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Assembly and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Assembly 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?