Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA MicroarraysMotivationTalk OverviewSlide 5Slide 6Probe Design ConstraintsAlgorithmAlgorithm Probe GenerationSuffix TreeProbe GenerationSlide 13Algorithm Hybridization PredictionHybridization PredictionSlide 16Algorithm Probe SelectionNegative Probe SelectionSlide 20Slide 22ExperimentationDiscussionSlide 26Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tasctas@csCMSC 838 PresentationCMSC 838T – PresentationMotivationDNA microarrays techniques are used intensely for identification of biological agentsGene Expression StudiesDiagnostic PurposesIdentification of Microorganisms in samplesItem ExtractionComplex ProblemFind the necessary probes and the temperatureProbe sets should be reliably detect and differentiate target sequencesLarge DatabasesNEW!! Homologous Genes (how to find specific probes)CMSC 838T – PresentationTalk OverviewOverview of talkMotivationProblem StatementAlgorithmMathematical AspectsExperimentationDiscussionCMSC 838T – PresentationProblem StatementPositive ProbesDatabase set S0Target S1For each sequence in S1, find at least one probeFor S0 - S1 try to avoid it (but do not care if happens)High Specificity: # of non-target matches are minimizedHigh Sensitivity: # of covered target seq. is maximized S0S1CMSC 838T – PresentationProblem StatementNegative ProbesDetermine as few as possible probes which together hybridizes with all sequences in S0 - S1 but with NONE in S1.High Specificity: No seq. in S1 may hybridizeHigh Sensitivity: Max # of seq. in S0 - S1 be coveredS1S0CMSC 838T – PresentationProbe Design ConstraintsSequence RelatedLength of probesDeviation of melting temperature of probe-target hybrids must be low (for physical reasons)No self complementary regions longer than four nucleotides (not descriptive enough)Melting temperatures of target and non-target seq. must be larger than a predefined (too close, too hard to identify)Ensuring a minimum number of mismatches is enough (homologous sequences)System RelatedExecution TimeUsabilityCMSC 838T – PresentationAlgorithmOverviewProbe GenerationHybridization PredictionProbe SelectionCMSC 838T – PresentationAlgorithmProbe GenerationSubproblem: Generate probe candidates for the sequencesKeep the set as small as possible without losing any optimal candidate (exclude infeasible ones)Suffix Tree Why?Allows fast recognition of repetitive subsequencesIdentifies non-unique probes (i.e. with more than one target)Efficient for memory and for T computation (reduce time)How?Tree is constructed from the sequencesTraversed (Watson-Crick complement)CMSC 838T – PresentationSuffix TreeInput: TACTACATACTACAACTACACTACATACAACACAA$ denotes end of stringConstructed in linear timeCMSC 838T – PresentationProbe GenerationFurther ImprovementsFilters applied for cut offProbe length (predefined)G-C content (for temperature)Self-complementarity Probes should not contain complements as subsequencesFinally, remove highly conserved (non-specific) regions Insert into hashtables according to their lengthsCMSC 838T – PresentationAlgorithmCMSC 838T – PresentationAlgorithmHybridization PredictionSubproblem:Search for the right probeSearch is expensive, Intelligent Hashing usedDesignA frame is moved over target and nontarget seqs. with several lengthsPrevious algorithm (Kaderali 2002): Use the suffix treeAt each step, hash values are calculated. If hit, predict melting temperature, store in hybridization matrix.If there are too many hits for a probe, then it is not unique, remove itWhy intelligent?Hash time is linearAllows inexact matching because of hashing (No analysis)ParallelizationSeveral threads are searching for probe targets.Tree and hashtables are fixed.One thread writes to the final matrixCMSC 838T – PresentationHybridization PredictionEmpirical Simulation:One million random probe-target pairings generatedFour mismatches or one insertion or deletion plus one strong central mismatch chosenT<20 C for 93%Complexity is O ( |S0| |S1| )Possible probe candidates is |S1| (linear)Each position of database S0 must be checkedCMSC 838T – PresentationAlgorithmCMSC 838T – PresentationAlgorithmProbe SelectionSubproblem:Use the hybridization matrix to finalize the probe selectionWe have positive probes and negative probes to proceedAlgorithm Analysis:For each probe candidateg: #of matches in S1b: #of matches in S0 - S1t: highest melting point in S1Probes for which g or b values is too large, are removedSort according to g,b and t.Apply Depth First SearchAdvantagesPerforms well (No comparison though)Guarantees to choose all specific probes if any were found.Disadvantagescan NOT guarantee optimal selection in terms of coverageCMSC 838T – PresentationNegative Probe SelectionLet S2 =S0 - S1 and B subset of S2 . The probes that detect S1 also detects some of B elements.Algorithm for Negative ProbesApply probe generating and preselection for B.Conduct hybridization on B U S1 .Remove the probes which hybridizes with S1 .Sort the remaining probes according to their hit number.Successively select the probes which covers most target seq.Guarantees optimal solution for coverage and probe number usageCMSC 838T – PresentationAlgorithmProbe SelectionCMSC 838T – PresentationMathematical AspectsCMSC 838T – PresentationExperimentationParallelized on SMP platformClassic worker-producerIntel Dual Pentium III 933 MHz, 1 GB memoryTest datassu rRNA of ARB project20.282 ssu rRNA sequences1.401 < lengths < 4.179 %97 of them are shorter than 2.000 basesCMSC 838T – PresentationDiscussionHigh PerformanceExecution is linear with size of database, decreases if longer probes are usedLow Memory ConsumptionDepends on the size of the sequence selection, NOT database sizeAutomatic Design of Group Probes and negative probesHigh Quality Probe DesignCMSC 838T – PresentationDiscussionComparison with previous workvs. ARBNot suited for large scale probe designvs. LCFDoes not consider highly conserved dataMemory
View Full Document