Title pageFinal ProjectBackgroundData and objectiveMethodsTuning parametersBaseline model and model comparisonResultsOptimal numbers of SNPs and centers for RBFOptimal number of SNPs for LMPrediction error on test dataECE 539, Fall 2008, Instructor: Prof. Yu Hen Hu Final ProjectPredicting body weight in chicken using SNP markers :application of generalized radial basis functionsNanye LongDept. of Animal [email protected] 11, 20081 / 8BackgroundBackgroundGenetic variants (e.g., single nucleotide polymorphisms, or SNPs) candetermine variation of phenotypes.Standard linear model assumes linear and additive relationshipbetween genetic variants and phenotype, which may not beappropriate.We need a tool to do non-linear mapping.2 / 8Data, objectiveHow do SNPs determine chicken body weight?d1702... … … … … …10170▪▪▪▪▪▪... … … … … …▪▪▪▪▪▪▪▪▪d20... … … … … …112d11... … … … … …201Body weightSNP10000... … … … … …SNP2SNP1ObservationSNP: input features, taking three values: 0 (aa), 1 (Aa), 2 (AA).∼ 10, 000 SNPsBody weight: a continuous outcome, from 170 chickens.Problem formulation: Rp→ R1, where p is number of SNPsWhy RBF?INonlinear approximationICan handle a large number of input features3 / 8Data, objectiveHow do SNPs determine chicken body weight?d1702... … … … … …10170▪▪▪▪▪▪... … … … … …▪▪▪▪▪▪▪▪▪d20... … … … … …112d11... … … … … …201Body weightSNP10000... … … … … …SNP2SNP1ObservationSNP: input features, taking three values: 0 (aa), 1 (Aa), 2 (AA).∼ 10, 000 SNPsBody weight: a continuous outcome, from 170 chickens.Problem formulation: Rp→ R1, where p is number of SNPsWhy RBF?INonlinear approximationICan handle a large number of input features3 / 8Methods Tuning parametersTuning parameters: m and kGeneralized Gaussian RBF:F (x) =kXi=1wiϕ(kx − tik), ϕ(kx − tik) = exp(−kx − tik2/2σ2)The number of basis functions < the number of data pointsm, number of SNPs used to computing distance in ϕRank all SNPs by ANOVA p-values, select top 25, 30, . . . , 100k, number of centers ti, i = 1, 2, . . . , kk-medoids clustering, try k = 2, 3, . . . , 100σ =dmax√2k, dmax: maximum distance between chosen centersChoose optimal values for m and k by 10-fold cross validation4 / 8Methods Tuning parametersTuning parameters: m and kGeneralized Gaussian RBF:F (x) =kXi=1wiϕ(kx − tik), ϕ(kx − tik) = exp(−kx − tik2/2σ2)The number of basis functions < the number of data pointsm, number of SNPs used to computing distance in ϕRank all SNPs by ANOVA p-values, select top 25, 30, . . . , 100k, number of centers ti, i = 1, 2, . . . , kk-medoids clustering, try k = 2, 3, . . . , 100σ =dmax√2k, dmax: maximum distance between chosen centersChoose optimal values for m and k by 10-fold cross validation4 / 8Methods Tuning parametersTuning parameters: m and kGeneralized Gaussian RBF:F (x) =kXi=1wiϕ(kx − tik), ϕ(kx − tik) = exp(−kx − tik2/2σ2)The number of basis functions < the number of data pointsm, number of SNPs used to computing distance in ϕRank all SNPs by ANOVA p-values, select top 25, 30, . . . , 100k, number of centers ti, i = 1, 2, . . . , kk-medoids clustering, try k = 2, 3, . . . , 100σ =dmax√2k, dmax: maximum distance between chosen centersChoose optimal values for m and k by 10-fold cross validation4 / 8Methods Tuning parametersTuning parameters: m and kGeneralized Gaussian RBF:F (x) =kXi=1wiϕ(kx − tik), ϕ(kx − tik) = exp(−kx − tik2/2σ2)The number of basis functions < the number of data pointsm, number of SNPs used to computing distance in ϕRank all SNPs by ANOVA p-values, select top 25, 30, . . . , 100k, number of centers ti, i = 1, 2, . . . , kk-medoids clustering, try k = 2, 3, . . . , 100σ =dmax√2k, dmax: maximum distance between chosen centersChoose optimal values for m and k by 10-fold cross validation4 / 8Methods Baseline, model comparisonBaseline, model comparisonBaseline model: linear re gression on SNPsF (xj) = xTjβIxTj: input SNP vector of the jth chicken, whose length also tunedIβ: a vector of weights associated with each SNPComparison between RBF and linear model: prediction on test setRep 1 Rep 2 Rep 3 Rep 4 Rep 5Test setTraining set, find optimal m and k via 10-fold CV5 / 8Methods Baseline, model comparisonBaseline, model comparisonBaseline model: linear re gression on SNPsF (xj) = xTjβIxTj: input SNP vector of the jth chicken, whose length also tunedIβ: a vector of weights associated with each SNPComparison between RBF and linear model: prediction on test setRep 1 Rep 2 Rep 3 Rep 4 Rep 5Test setTraining set, find optimal m and k via 10-fold CV5 / 8Results Optimal numbers of SNPs and centers for RBFRBF: optimal values for m and k0 20 40 60 80 10055 60 65 70 75 80Replication 1CV errorNumber of centers0 20 40 60 80 10060 70 80 90Replication 2CV errorNumber of centers0 20 40 60 80 10060 65 70 75 80Replication 3CV errorNumber of centers0 20 40 60 80 10055 60 65 70Replication 4CV errorNumber of centers0 20 40 60 80 10060 80 100 120 140 160 180Replication 5CV errorNumber of centers253035404550556065707580859095100Number of SNPs6 / 8Results Optimal number of SNPs for LMLinear model: optimal numbers of SNPs●●●●●●●●●●●●●●●●40 60 80 100100 120 140 160 180Replication 1CV errorNumber of SNPs●●●●●●●●●●●●●●●●40 60 80 100100 120 140 160Replication 2CV errorNumber of SNPs●●●●●●●●●●●●●●●●40 60 80 10090 100 110 120 130 140 150Replication 3CV errorNumber of SNPs●●●●●●●●●●●●●●●●40 60 80 100100 120 140 160Replication 4CV errorNumber of SNPs●●●●●●●●●●●●●●●●40 60 80 100100 120 140 160Replication 5CV errorNumber of SNPs7 / 8Results Prediction error on test dataPrediction error on test dataTable: Mean squared errors of test data predicted by RBF and linear model.Rep etit ion 1 2 3 4 5RBF 73.67 64.17 47.47 71.45 40.37Linear model 111.35 83.67 90.10 106.58 56.43RBF model exceeds substantially the linear model in generalizationperformance.“Optimal” values for m and k: w.r.t a range of reasonablecandidates. May not be truly optimal.8 /
View Full Document