Predicting E. Coli Promoters Using SVMPurposeDatasetData preprocessingApproachResultsObservation/ConclusionPredicting E. Coli Predicting E. Coli Promoters Using SVMPromoters Using SVMDIEP MAI ([email protected])Course: CS/ECE 539 – Fall 2008Instructor: Prof. Yu Hen HuPurposePurposeBuild and train a SVM system to predict E. Coli promoters based on the given gene sequences.Example:Given a gene sequenceaagcaaagaaatgcttgactctgtagcgggaaggcgtattatgcacaccgccgcgccIs it an E. Coli promoter?For more theoretical information about E. Coli promoter:http://homepages.cae.wisc.edu/~ece539/data/gene/theory.txtDatasetDatasetData file is obtained fromhttp://homepages.cae.wisc.edu/~ece539/data/gene/data.txtDataset information:◦Number of instances: 106◦Attributes:Number of attributes: 57Type: Non-numeric nominal values (A, C, G, or T)◦Classes:Number of classes: 2Type: Positive (+1) or Negative (-1)Data preprocessingData preprocessingRandomly partition the dataset to TRAINSET and TESTSETRatio = TESTSET / (TRAINSET + TESTSET)Encode non-numeric attributesA 00012 = 110C 00102 = 210G 01002 = 410T 10002 = 810Scaling each feature to [-1, 1] to avoid the domination of large on small values.ApproachApproachRBF kernel is used need to find “good” C (cost) and G (gamma) parameters.Parameter scanning:Set the range of C to [2-15, 25] and G to [2-15, 22]For each pair (C, G), use leave-one-out method to determine the pairs that yield high accuracy rates◦This process is repeated a few times; the pair that "often" produces high accuracy rates is more likely to be selected.Training/Testing:◦Use selected parameters and the whole TRAINSET to train the system.◦Use the trained system to predict the TRAINSET.preferred accuracy rate = 100%◦Use the trained system to predict the TESTSET.ResultsResultsConfiguration:◦Ratio of partitioning dataset = 1/5Split the dataset to 5 roughly equal sets; one is preserved as TESTSET◦K-fold = 15 (15 folds in total)◦Number of repetitions to select paras. = 10After running the system several times:Accuracy rate of the testing processOccr. freq.TRAINSET TESTSETOftenSometimesRare85/85 = 100%85/85 = 100%85/85 = 100%85/85 = 100%85/85 = 100%19/21 = 90.48%18/21 = 85.71%20/21 = 95.23%21/21 = 100%15/21 = 71.43%Training resultAccuracy rate “Best” (C, G)Avg. Best C G84.35%84.23%88.82%83.52%85.88%85.88%90.59%85.88%0.70711.18921.00001.18920.03710.03130.07430.0625Observation/ConclusionObservation/ConclusionSVM:◦For this dataset, the number of attributes is not large, the use of RBF kernel seems appropriate to map the feature to a higher dimensionScanning (C, G) takes a large amount of time. One of approaches to speed up this process:◦Split the range to “large” equal intervals◦Pick the interval that yields high accuracy rates◦Divide this range to smaller equal intervals◦RepeatK-fold method:◦The larger the number of folds is, the more time the process requires◦For this dataset, the number of instances is not large, large numbers of folds seem to work
View Full Document