Mizzou INFOINST 8010 - Genomic sequencing and its data analysis

Unformatted text preview:

Genomic sequencing and its data analysisLecture OutlineWhat is DNA Sequencing?SequencingImportance of SequencingSlide 6New SequencersIllumina (Solexa) WorkflowSlide 9Slide 10Slide 11Pair-end ReadsAccelerating Technology & Plummeting CostSlide 14Analysis tasksInitial Data Analysis workflowShort read mappingMultiple mappingInexact matchingShort-read analysis softwareSlide 21PowerPoint PresentationSlide 23Slide 24Repeat ProblemsSlide 26Slide 27Lander-Waterman ModelLander–Waterman AssumptionsSlide 30Slide 31Slide 32Slide 33In practice…Sequence Assembly AlgorithmsSequence Reconstruction AlgorithmGreedy Algorithm for the Shortest Superstring ProblemSlide 38Slide 39Celera AssemblerScreening readsOverlapperUnitigsCelera ScaffoldingScaffold pictureSlide 46Assembly for short readsCurrent approachesDe Bruijn graph methodDe Bruijn graphSummaryHomeworkSlide 53AcknowledgmentsGenomic sequencing and its data analysisDong Xu Digital Biology LaboratoryComputer Science Department Christopher S. Life Sciences CenterUniversity of Missouri, ColumbiaE-mail: [email protected]://digbio.missouri.eduLecture OutlineIntroduction to sequencing Next-generation sequencersRole of bioinformatics in sequencingTheory of sequence assemblyCelera assemblerAssembly of short readsWhat is DNA Sequencing?A DNA sequence is the order of the bases on one strand.By convention, we order the DNA sequence from 5’ to 3’, from left to right.Often, only one strand of the DNA sequence is written, but usually both strands have been sequenced as a check.SequencingBacteriaFungi, yeastInsects: mosquito, fruit fly, moth, honey beePlants: Arabidopsis, rice, corn, grapevine, …Animals: mouse, hedgehog, armadillo, cat, dog, horse, cow, elephant, platypus, … HumansImportance of SequencingBasic blueprint for lifeFoundation of genomic studiesVision: personalized medicineGenetic disorders DiagnosticsTherapies$1000 genomeLecture OutlineIntroduction to sequencing Next-generation sequencersRole of bioinformatics in sequencingTheory of sequence assemblyCelera assemblerAssembly of short readsNew SequencersIllumina / Solexa Genetic AnalyzerApplied Biosystems ABI 3730XLRoche / 454 Genome Sequencer FLXApplied BiosystemsSOLiDIllumina (Solexa) WorkflowIllumina (Solexa) WorkflowIllumina (Solexa) WorkflowIllumina (Solexa) WorkflowPair-end ReadsPaired-end sequencing (Mate pairs)Sequence two ends of a fragment of known size.Currently fragment length (insert size) can range from 200 bps – 10,000 bpsAccelerating Technology & Plummeting CostNext Generation SequencingLecture OutlineIntroduction to sequencing Next-generation sequencersRole of bioinformatics in sequencingTheory of sequence assemblyCelera assemblerAssembly of short readsAnalysis tasksInitial analysis: base callingMapping to a reference genomeDe novo or assisted genome assemblySNP, detection/insertion, copy number Transcriptome profilingDNA methylation studiesCHIP-SeqInitial Data Analysis workflowImages (.tif)Analysis PipelineImage AnalysisBase CallingSequence AnalysisFor each tile:-Cluster intensities-Cluster noiseFor each tile:-Cluster sequence-Cluster probabilities-Corrected cluster intensitiesFor all data:-Quality filtering-Sequence Alignment-Statistics VisualizationInstrument PC Analysis PCShort read mappingInput:A reference genomeA collection of many 25-100bp tagsUser-specified parametersOutput:One or more genomic coordinates for each tagIn practice, only 70-75% of tags successfully map to the reference genome.Multiple mappingA single tag may occur more than once in the reference genome.The user may choose to ignore tags that appear more than n times.As n gets large, you get more data, but also more noise in the data.Inexact matchingAn observed tag may not exactly match any position in the reference genome.Sometimes, the tag almost matchesSuch mismatches may represent a SNP or a bad read-out.The user can specify the maximum number of mismatches, or a quality score threshold.As the number of allowed mismatches goes up, the number of mapped tags increases, but so does the number of incorrectly mapped tags.?Short-read analysis softwareLecture OutlineIntroduction to sequencing Next-generation sequencersRole of bioinformatics in sequencingTheory of sequence assemblyCelera assemblerAssembly of short readsLibrary CreationSequencingAssemblyGap ClosureFinishingAnnotationSequencing ProcedureGenome Sequence Analysis - Step OneAssemble Sequences into ContigsSequenced fragmented DNAAAACGCGATCGATCGATCGAAAACGCGATCGATCGATCGATCGATCGATCGATCGTAGCGATCGATCGATCGATCGTAGAAACGCGATCGATCGATCGAAssembled DNA SequenceCONTIG 1 CONTIG 2 CONTIG 3Repeat ProblemsRepeats at read ends can be assembled in multiple ways.correctincorrectGenome Sequence Analysis - Step OneInitial Problem with AssemblySequenced fragmented DNAIncorrectly Assembled DNA SequenceCONTIG 1CONTIG 2Genome Sequence Analysis - Step OneNeed to Mask RepeatsSequenced fragmented DNAMasked DNA SequenceCONTIG 1CONTIG 3CONTIG 5CONTIG 2CONTIG 4Assembled DNA SequenceLander-Waterman ModelPoisson EstimateNumber of readsAverage length of a readProbability of base readLander ES, Waterman MS (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis“ Genomics 2 (3): 231- 239Lander–Waterman Assumptions1. Sequencing reads will be randomly distributed in the genome2. The ability to detect an overlap between two truly overlapping reads does not vary from clone to cloneIn practice…Lander-Waterman is almost always an underestimate-cloning biases in shotgun libraries-repeats-GC/AT rich regions-other low complexity regionsSequence Assembly AlgorithmsDifferent than similarity searchingLook for ungapped overlaps at end of fragments (method of Wilbur and Lipman, (SIAM J. Appl. Math. 44; 557-567, 1984)High degree of identity over a short regionWant to exclude chance matches, but not be thrown off by sequencing errorsSequence Reconstruction Algorithm•In the shotgun approach to sequencing, small fragments of DNA are reassembled back into the original sequence. This is an example of the Shortest Common Superstring (SCS) problem where we are given fragments and we wish to find the shortest sequence containing all the fragments.•A superstring of the set P is a single string that contains every string in P as a


View Full Document

Mizzou INFOINST 8010 - Genomic sequencing and its data analysis

Download Genomic sequencing and its data analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Genomic sequencing and its data analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Genomic sequencing and its data analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?