UCSD CSE 254 - From Promoter Sequence to Expression - D146508

Home> Schools> University of California, San Diego> Computer Science & Engineering (CSE) > CSE 254> From Promoter Sequence to Expression

UCSD CSE 254 - From Promoter Sequence to Expression

School name University of California, San Diego

Course Cse 254- Seminar on Learning Algorithms

Pages 14

Download Save

Unformatted text preview:

1From Promoter Sequence to Expression: A Probabilistic FrameworkBy Eran Segal, Yoseph Barash, Itamar Simon, Nir Friedman, and Daphne KollerPresented at RECOMB 2002Eugene KeBioinformatics ProgramMay 29, 2002Key PointsBiology is in the post-genomic era.We can sequence the whole genome or DNA library of organisms.The challenge now is to understand how DNA works on a detailed level.This paper attempts to model the mechanics of gene expression. The model is ambitious as it incorporates data from a multiple of experimental sources.2Central Dogma of Molecular BiologyDNA is a long text composed from a 4-letter alphabet (A,C,G,T).Genes are the meaningful portions of DNA.DNA is converted into messenger RNA (mRNA) via transcription.mRNA is used to build proteins, via translation.Proteins perform all the work in the cell.Different cell types perform different functions.Therefore, cell types must have different proteins.Transcription Factors (TFs)Proteins that bind to DNA are called TFs.TF binding must be specific.Where do TFs bind? Before encoding portion of genes.Close enough to affect expression.Sites of binding are called promoters.A promoter region is the sequence before a gene, where promoter(s) are.3Measuring Gene ExpressionExpression level represents the amount of mRNA present in a cell.One DNA array can measure thousands of genes simultaneously.Each array is lined with DNA “probes,” for each specific gene.mRNA from a cell is extracted from cells and placed on array.If a DNA probe responds, corresponding gene is being expressed.Clustering Expression LevelsUsing expression data, cluster similarly expressed genes.Genes probably have related function.After clustering, we can search promoter regions of clusters.Genes in a cluster are affected by same TFs, therefore will havesame promoters.Search promoter regions for similar strings, which is called motifs.A motif is a putative promoter.Identify probable TFs using motifs.4Finding Promoters Via SequenceWe now the entire genome of some organisms.We can search directly for motifs.Step 1:Search promoter regions of known genes, find motifs.Step 2:Group genes by similar motifs.Logic is that if genes are controlled by same TFs, the genes will have similar promoters.Step 3:Using databases of known transcription factors, search for probable matches.Step 4:Experimentally verify using expression levels of multiple TF combinations.Experimentally Finding Binding SitesLocalization arrays measure DNA-protein binding.They are similar to DNA arrays.Run two experimentsRatio of intensities show true binding.However…Only indicates if TF can bind to promoter, not if TF actually does.Very noisy5Authors’ GoalsAnalyze two different types of information simultaneouslyExpression dataSequence dataBoth methods are trying to answer the same question:What genes are co-regulated by the same transcription factors?Logically, it is advantageous to combine data.Expression data provides gene expression with respect to time.Sequence data provides hints whether a TF binds to a gene.By combining data, it should be possible to determine whether a transcription factor regulates a gene AND under what context.Probabilistic Relational Model (PRM)PRM is an organizational tool.Separate expression and sequence data.Method of relating expression and sequence data.GeneSequence dataLocalization dataExperiment & ExpressionExpression dataR(t) = Hidden Variablewhether a transcription factor regulates a gene6Understanding the gene objectsTranscription Factor tImplicitEnumerated and known at beginning. t1…tmDescribed by a Position Specific Scoring Matrix (PSSM)Gene object giContains a promoter region, divided into individual bases S1..Sn.Contains a Regulates variable R(tj)Whether a TF tjregulates a geneR(tj)value for every TFMay contain Localization variable L(tj)for a TF tjOrganizing expression dataDNA array aEach array has multiple clusters, called ACluster.Each array comes from a specific phase of the cell cycle, denoted by PhaseSpecific to data setExpression eContains expression levels of a gene cluster.Level isthe expression level of under a specific context.Array describes the parent experiment.Genecorrelates the gene and expression level.7Expression ModelExpression level depends on three factorsGene clusterCell-cycle phaseTF regulation, R(t)Dependency is modeled as tree-structured conditional distributionsContext specific effects, i.e. phasesCombinatorial interactions, such as not R(Swi6) and not R(Fkh2)Expression levels are shown at leavesUnivariate Gaussian distributionsUnderstanding the Expression ModelFor all genes cluster 3, when they not in the S phase, and are not bound by TF Swi6 nor TF Fkh1 have an expression level centered at 0.2.8Position Specific Scoring Matrix (PSSM)Binding sites are “degenerate”Specific but not absolutely so.Some mutations in the motif are acceptable while others are not.Some position in the motif are highly conserved.In diagram at left, height of letter represents degree of conservation.PSSM models acceptable TF binding sites.Each position is represented by a probability of being A,C,G, or T.PSSM is a 4xN matrix, where N is the length of the motif.L(t) is noisy evidence concerning R(t)Localization data is labeled as g.L(t).Experimental data gives a p-value for each L(t).If R(t) is true, we want L(t) to be small.This means we have high confidence as L(t) is a p-valueIf R(t) is false, we want L(t) to be largeData is due to background noise, we have low confidence.We assume the probability distribution function is:p is the experimental p-valuew is an arbitrary weighting factorc is a normalizing constant equal to In otherwords, L(t) is a noisy sensor, used only as “guidance” for R(t).wpcetruetRptLpdf−=== ))(|)((wewc−−=19Expression Model LearningTwo main goals of the expression modelLearn distributions of expression levelsLearn qualitative aspects of the tree structureTree StructureScoring FunctionData Set, Tree Structure,Gaussian distribution parameters, Greedy local searchTrim operation removes nodesSplit operation adds nodesSequence ModelIn essence, a

View Full Document


School:
Email:
New Password:
Confirm Password:

UCSD CSE 254 - From Promoter Sequence to Expression

Sign up for free to view:

Please select your school