UMD CMSC 838T - Accurate method for fast design of diagnostic oligonucleotide probe sets for DNA microarrays - D1865415

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 838T> Accurate method for fast design of diagnostic oligonucleotide probe sets for DNA microarrays

UMD CMSC 838T - Accurate method for fast design of diagnostic oligonucleotide probe sets for DNA microarrays

School name University of Maryland, College Park

Course Cmsc 838t- Advanced Topics in Programming Languages

Pages 9

Download Save

Unformatted text preview:

Accurate method for fast design of diagnostic oligonucleotide probe setsfor DNA microarraysAndreas Krause, Markus Kr¨autner, Harald Meier∗Lehrstuhl f¨ur Rechnertechnik und Rechnerorganisation (LRR)Institut f¨ur Informatik, Technische Universit¨at M¨unchen,Boltzmannstrasse 3, 85748 Garching, Germany, *email: [email protected]: +49 (89) 289-19476, fax: +49 (89) 289-17662AbstractWe present a method for the automatic generation ofoligonucleotide probe sets for DNA microarrays. This ap-proach is well suited particularly for specificity evaluationof designed probes in large data sets. Algorithms for probepreselection, hybridization prediction and probe selectionare presented. Combinatorial techniques are introducedfor the selection of probe sets of high differentiation ca-pability even from sequence databases of homologous con-served genes. These techniques include the automatic gen-eration of group specific probes and the design of exlud-ing probes. A basic prototype was implemented including ashared memory parallelization and distribution. The prin-cipal applicability of our method to a database of very con-served sequence data was shown and the run-time perfor-mance estimated.1. IntroductionOligonucleotide microarrays are currently used in geneexpression studies [7] as well as for diagnostic purposes,e.g. the identification of microorganisms in clinical, foodand environmental samples [2, 12, 20].Apart from engineering aspects of microarray technolo-gies, the main problem is to determine suitable sets ofoligonucleotide probes. These probe sets should reliablydetect and differentiate target sequences of interest (ratherthe respective genes, gene transcripts or organisms). Largedatabases have, therefore, to be analyzed efficiently.Algorithms and their implementations have been pub-lished, which solve these problems automatically [10, 15,16]. The resulting programs work well for data sets that en-able the finding of sequence specific probes due to a highvariability within the analyzed target sequences. None ofthese algorithms and their implementations, however, solvethe problems that occur when sequence databases of homol-ogous genes need to be analyzed.This problem occurs for instance, when a microarray isdesigned for the identification of organisms on the basis ofsequence variabilities of conserved genes, e.g. the riboso-mal ribonucleic acid sequences. In these cases additionalfunctionalities of the chip designing program are requiredto fulfill the identification demands of the DNA microar-ray. If it is impossible to design sequence specific probesdue to high similarities, it could be advantageous to deter-mine probes that are specific for groups of closely relatedsequences. In some cases it could even be helpful to selectprobes that detect target sequences as well as some non tar-get sequences. These probes have identification value whenapplied in combination with probes that react specificallywith the respective non-target sequences, so called exclud-ing or negative probes. In this paper we describe an algo-rithm and its parallel implementation that meets the require-ments mentioned above and is, therefore, well suited for thefast design of oligonucleotide probe sets from nucleic acidsequence databases of both, variable and highly conservedgenes.2. Problem statementProbe design is a combinatorial optimization problem,where sensitivity and specificity must be optimized whileseveral qualitative criteria, e.g. guanine/cytosine (G/C) con-tent and melting temperature range of the probes, have to bemaintained. Maximizing specificity and sensitivity are oftenconflicting goals. In practice it is often impossible to designspecific probes for each selected sequence, especially whendealing with highly conserved data. To nevertheless obtain1a feasible working solution, one has to make compromises.Thus we extended the typical probe design problem 2.1 byanother problem 2.2 which allows to relax criteria if the datais highly conserved.Problem 2.1 (Positive probes). Given a selected subset S1in a database S0of sequences, find for each sequence sin S1at least one positive probe p which hybridizes withinS1only with s; it may however cross-hybridize with somesequences B, where B ⊆ S2:= S0\ S1if this can’t beavoided. High specificity means that the number of non-target matches is minimized, while high sensitivity meansthat a maximum number of selected target sequences is cov-ered.Problem 2.2 (Negative probes). Given the positive probesidentified in Problem 2.1, determine as few as possible neg-ative probes which together hybridize with all sequences inB but with none in S1. High specificity means in this con-text, that no sequences in S1may cross-hybridize with anynegative probe, while high sensitivity means, that a maxi-mum number of sequences in B must be covered.2.1. Probe design constraintsThe following constraints are imposed onto the probe se-lection process:• Minimum and maximum length of the probes.• The melting temperature of the probe-target hybridsmust not differ more than a maximum value. Thiscan be accomplished e.g. by specifying a range ofpercentage of G/C content.• Probes should not contain self complementary re-gions that are longer than four sequential nucleotides.• There must be a minimum difference in the meltingtemperatures of target and non-target sequences. Thiscan be accomplished by ensuring a minimum numberof mismatches (stronger than G-U/T) to all other se-quences.2.2. System constraintsFor practical reasons, other constraints are imposed ontothe development of such a system. The most important are:• Execution time: A probe set for a chip must be com-putable in at most a few hours.• Usability: The software must be able to read differentstandardized sequence and alignment data formats; auser interface should allow the selection of sequencesand provide methods to visualize the specificity of de-signed probe sets.3. AlgorithmicsOur approach on the generation of oligonucleotide probesets comprises three steps (as shown in Figure 1). Firstly, apool of suitable probe candidates is generated (s. 3.1). Inthe next step the hybridization behaviour of these probes ispredicted (s. 3.2). Based on the results the probes are se-lected (s. 3.3).3.1. Probe preselectionAll possible probe candidates are generated. For perfor-mance reasons these should be as few as possible, while alloptimal candidates must be kept.3.1.1 Attributizing a suffix treeTo find suitable

View Full Document