MIT HST 950J - Linking Genotypes and Phenotypes - D2682650

Home> Schools> Massachusetts Institute of Technology> (HST) > HST 950J> Linking Genotypes and Phenotypes

MIT HST 950J - Linking Genotypes and Phenotypes

School name Massachusetts Institute of Technology

Pages 38

Download Save

Unformatted text preview:

Linking Genotypes and Phenotypes Peter J. Park, PhD Children’s Hospital Informatics Program Harvard Medical School HST 950 Lecture #21 Harvard-MIT Division of Health Sciences and TechnologyHST.950J: Medical ComputingPeter Szolovits, PhD Isaac Kohane, MD, PhD Lucila Ohno-Machado, MD, PhDIntroduction •There is an increasingly large amount of gene expression data; other types of genomic data, e.g., single nucleotide polymorphisms, are accumulating rapidly. •A large amount of phenotypic data exists as well, especially in clinical setting, e.g., diagnosis, age, gender, race, survival time, smoking history, clinical stage of tumor, size of tumor, type of tumor, treatment parameters. •We need to find relationships between genomic and phenotypic data. What genes or variables are correlated with a particular phenotype? What should we use as predictors?Introduction •We need to correlate predictor variables with response variables. A classic example: is smoking related to lung cancer? •The one of the difficulties with genomic data is that there are many possible predictors •Eventually, we would like to have a comprehensive and coherent statistical framework for relating different types of predictors with outcome variables. •Today: we will use micro-array data as an example.Overview •Microarrays have become an essential tool •cDNA arrays - basic biology labs with their own arrays (competitive hybridization – measures ratio between the sample of interest and the reference sample) •Oligonucleotide arrays (Affymetrix) – everyone else (attempts to measure absolute abundance level) •There are few other types (SAGE, commercial arrays) •Biological validation is necessary •northern blots; RT-PCR; RNAi •A crude analysis may be sufficient for finding prominent features in the data, e.g., genes with very large fold ratios •More sophisticated analysis is important for getting the most out of your dataAn Observation •There is a disconnect between statisticians/mathematicians/ computer scientists who invent techniques and biologists/ clinicians who use them. •There have been numerous models for describing microarray data, but most of them are not used in practice. •Biologists/clinicians are justifiably reluctant in applying method they do not understand. •Trade-off between complexity and adoptabilityUseful Techniques Dimensionality Reduction •Principal components analysis •Singular value decomposition Discrimination and Classification •Binary and discrete response variable •Continuous response variable •Parametric vs. nonparametric tests •Partial least squares Censored Data •Kaplan-Meier estimator •Cox’s proportional hazards model •Generalized linear modelsStatistical challenges •People have been studying the relationship between predictors and responses for a long time. So what’s new? p observations p observations n variables n variables •The usual paradigm in a clinical study is having few variables and many samples •Many statistical methods may not be valid without modifications; methods need to be applied with cautionToo many variables (genes) Underdetermined system: e.g. fitting a cubic polynomial through two points Multivariate normal distribution: But the covariance matrix is singular!Statistical challenges •One example: we need to be careful with P-values •Suppose you flip a coin 10 times and get all heads. Is it biased? What if there are 10,000 people flipping coins and one person gets 10 heads? •Even if the null hypothesis is true, 500 out of 10000 genes will be significant at .05 level by chance. •We are testing 10,000 hypotheses at the same time; need to perform “Multiple-testing adjustment”Dimensionality Reduction •There are too many genes in the expression data •“Feature selection” in computer science •Filter genes •software built-in filters •threshold value for minimum expression •variational filtering •use information from replicates •Principal components •Singular value decomposition •Multi-dimensional scalingPrincipal Component Analysis We want to describe the covariance structure of a set of variables through a few linear combinations of these variables. Geometrically, principal components represent a new coordinate system, with axes in the directions with maximum variability. Provides a more parsimonious description + + + + ++ + + + + + ++ + + + + + +We want maximum variance and orthogonality: eigenvectors!8Principal Component Analysis •Identify directions with greatest variation. •Linear combinations are given by eigenvectors of the covariance matrix. •Eigenvectors and eigenvalues. •Total variation explained is related to the eigen values. Proportion of total variance due to the Kth component. •Reduces data volumne by projecting into lower dimensions •Can be applied to rows or columns.Singular Value Decomposition SVD is a matrix factorization that reveals many important properties of a matrix. U, V are orthonormal; D is diagonal Let ui be the ith column of U. Then the best vector that captures the column space of A is u1; the best two column vectors that capture the columns of A are u1 and u2, etc. These vectors show the dominant underlying behavior. In PCA, the factorization is applied to the covariance matrix rather than the data matrix itself.Classification •Binary classification problem using gene expression data has been studied extensively. normal vs. cancer genes Typical Questions: What genes best discriminate the two classes? Can we divide the samples correctly into two classes if the labels were unknown? Can we make accurate predictions on new samples? Are the unknown subclasses?Discrimination: Variable Selection by T-test Are the means in the two populations significantly different? (two independent sample case) normal follows a t-distribution Requires normality! Otherwise p-values can be misleading! distribution t-distributionVariable selection: Wilcoxon Test •Nonparametric or “distribution-free” test Actual value: 26 28 52 70 77 80 115 130 141 170 rank: 1 2 3 4 5 6 7 8 9 10 Under H0: 2+3+4+6+9 =24 1+5+7+8+10 =31 p=.547An aside: hypothesis testing •The usual form of a hypothesis testing is •For large samples, this often converges to N(0,1) under the null hypothesis.Parametric vs. Nonparametric Tests Parametric tests assume certain distributions. (they may be robust to deviations from Gaussian

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT HST 950J - Linking Genotypes and Phenotypes

Sign up for free to view:

Please select your school