1From Promoter Sequence to Expression: A Probabilistic FrameworkBy Eran Segal, Yoseph Barash, Itamar Simon, Nir Friedman, and Daphne KollerPresented at RECOMB 2002Eugene KeBioinformatics ProgramMay 29, 2002Key PointsBiology is in the post-genomic era.We can sequence the whole genome or DNA library of organisms.The challenge now is to understand how DNA works on a detailed level.This paper attempts to model the mechanics of gene expression. The model is ambitious as it incorporates data from a multiple of experimental sources.2Central Dogma of Molecular BiologyDNA is a long text composed from a 4-letter alphabet (A,C,G,T).Genes are the meaningful portions of DNA.DNA is converted into messenger RNA (mRNA) via transcription.mRNA is used to build proteins, via translation.Proteins perform all the work in the cell.Different cell types perform different functions.Therefore, cell types must have different proteins.Transcription Factors (TFs)Proteins that bind to DNA are called TFs.TF binding must be specific.Where do TFs bind? Before encoding portion of genes.Close enough to affect expression.Sites of binding are called promoters.A promoter region is the sequence before a gene, where promoter(s) are.3Measuring Gene ExpressionExpression level represents the amount of mRNA present in a cell.One DNA array can measure thousands of genes simultaneously.Each array is lined with DNA “probes,” for each specific gene.mRNA from a cell is extracted from cells and placed on array.If a DNA probe responds, corresponding gene is being expressed.Clustering Expression LevelsUsing expression data, cluster similarly expressed genes.Genes probably have related function.After clustering, we can search promoter regions of clusters.Genes in a cluster are affected by same TFs, therefore will havesame promoters.Search promoter regions for similar strings, which is called motifs.A motif is a putative promoter.Identify probable TFs using motifs.4Finding Promoters Via SequenceWe now the entire genome of some organisms.We can search directly for motifs.Step 1:Search promoter regions of known genes, find motifs.Step 2:Group genes by similar motifs.Logic is that if genes are controlled by same TFs, the genes will have similar promoters.Step 3:Using databases of known transcription factors, search for probable matches.Step 4:Experimentally verify using expression levels of multiple TF combinations.Experimentally Finding Binding SitesLocalization arrays measure DNA-protein binding.They are similar to DNA arrays.Run two experimentsRatio of intensities show true binding.However…Only indicates if TF can bind to promoter, not if TF actually does.Very noisy5Authors’ GoalsAnalyze two different types of information simultaneouslyExpression dataSequence dataBoth methods are trying to answer the same question:What genes are co-regulated by the same transcription factors?Logically, it is advantageous to combine data.Expression data provides gene expression with respect to time.Sequence data provides hints whether a TF binds to a gene.By combining data, it should be possible to determine whether a transcription factor regulates a gene AND under what context.Probabilistic Relational Model (PRM)PRM is an organizational tool.Separate expression and sequence data.Method of relating expression and sequence data.GeneSequence dataLocalization dataExperiment & ExpressionExpression dataR(t) = Hidden Variablewhether a transcription factor regulates a gene6Understanding the gene objectsTranscription Factor tImplicitEnumerated and known at beginning. t1…tmDescribed by a Position Specific Scoring Matrix (PSSM)Gene object giContains a promoter region, divided into individual bases S1..Sn.Contains a Regulates variable R(tj)Whether a TF tjregulates a geneR(tj)value for every TFMay contain Localization variable L(tj)for a TF tjOrganizing expression dataDNA array aEach array has multiple clusters, called ACluster.Each array comes from a specific phase of the cell cycle, denoted by PhaseSpecific to data setExpression eContains expression levels of a gene cluster.Level isthe expression level of under a specific context.Array describes the parent experiment.Genecorrelates the gene and expression level.7Expression ModelExpression level depends on three factorsGene clusterCell-cycle phaseTF regulation, R(t)Dependency is modeled as tree-structured conditional distributionsContext specific effects, i.e. phasesCombinatorial interactions, such as not R(Swi6) and not R(Fkh2)Expression levels are shown at leavesUnivariate Gaussian distributionsUnderstanding the Expression ModelFor all genes cluster 3, when they not in the S phase, and are not bound by TF Swi6 nor TF Fkh1 have an expression level centered at 0.2.8Position Specific Scoring Matrix (PSSM)Binding sites are “degenerate”Specific but not absolutely so.Some mutations in the motif are acceptable while others are not.Some position in the motif are highly conserved.In diagram at left, height of letter represents degree of conservation.PSSM models acceptable TF binding sites.Each position is represented by a probability of being A,C,G, or T.PSSM is a 4xN matrix, where N is the length of the motif.L(t) is noisy evidence concerning R(t)Localization data is labeled as g.L(t).Experimental data gives a p-value for each L(t).If R(t) is true, we want L(t) to be small.This means we have high confidence as L(t) is a p-valueIf R(t) is false, we want L(t) to be largeData is due to background noise, we have low confidence.We assume the probability distribution function is:p is the experimental p-valuew is an arbitrary weighting factorc is a normalizing constant equal to In otherwords, L(t) is a noisy sensor, used only as “guidance” for R(t).wpcetruetRptLpdf−=== ))(|)((wewc−−=19Expression Model LearningTwo main goals of the expression modelLearn distributions of expression levelsLearn qualitative aspects of the tree structureTree StructureScoring FunctionData Set, Tree Structure,Gaussian distribution parameters, Greedy local searchTrim operation removes nodesSplit operation adds nodesSequence ModelIn essence, a
View Full Document