DARTMOUTH BIOL 039 - Semantic Similarity

Unformatted text preview:

Semantic similarity and its use inbiologyViktor MartyanovBio 39Feb 21, 2008What is similarity?• Wikipedia defines it as ‘some degree ofsymmetry in either analogy or resemblancebetween two or more concepts or objects.The notion of similarity rests on either exactor approximate repetitions of patterns in thecompared items.’Types of protein similarity• There are two major similarity assessmentschemes (Ouzounis et al. 2003):– Structural classification measures similarity onthe basis of protein sequence or tertiary structure– Functional classification assesses similarity interms of functional featuresFunctional classification• It includes enzyme reaction mechanisms,biochemical pathways, functional roles,cellular localization, etc.• This categorization can be based on thenotion of semantic similaritySemantic similarity• A set of objects is assigned a metric based on thelikeness of their meaning (similarity inknowledge content)• Linguistics – two words with one or moresense(s)• Biology – two gene products with one or moredescriptive term(s)• Semantic similarity is based on statistical andtopological information about terms and theirinterrelationship in a vocabulary (GeneOntology)Introduction to Gene Ontology• Amount of biological data is enormous• There are many species-specific databases• There is an understanding of a singleuniverse of genes and proteins• It is important to organize this sharedknowledge in an organism-independentmanner• Therefore, we need to consistentlydescribe gene products in differentdatabases by using a common languageApplications of Gene Ontology• Integrating proteomic information fromdifferent organisms• Assigning functions to protein domains• Finding functional similarities indisease/age-related processes• Analyzing groups of co-expressed genes• Automating literature search for genefunction• Verifying models of genetic, metabolic andprotein interaction networksGene Ontology structure• It is a shared and structuredsystem of annotating molecularfeatures across organisms• It comprises three ontologies: BP,MF and CC• Each category is represented as ahierarchical network• Multiple inheritance and multipleprogeny are allowed• Two types of child-to-parentrelationships: is-a (subclass) andpart-of (component)Wang et al. ‘07• As of February 2008:– 40 model organisms including:• A. thaliana, B. anthracis, B. taurus, C. elegans, D.rerio, D. melanogaster, G. gallus, H. sapiens, M.musculus, O. sativa, R. norvegicus, S. cerevisiae,S. pombe, V. cholerae– 24911 descriptive terms, including:• 14589 Biological Process:– cellular physiological process, pyrimidine metabolism• 8258 Molecular Function:– catalytic activity, adenylate cyclase activity• 2064 Cellular Component:– nucleus, proteasomeGene Ontology statisticsSample GO treeBiology exampleCluster genes by their expression profilesand estimate cluster coherenceCalculate p-value andevaluate functional enrichmentUse semantic similarity to quantify functional relationshipbetween gene productsHan et al. ‘05Variable semantic similarityapproaches• Edge-based metrics:– Measure distance between nodes in terms of number of edges– More advanced metrics include additional criteria:• Network density, node depth, link type• Node-based metrics:– Measure similarity between nodes as the extent to which theyshare information in common– Look at either parent node only or at both parent and childnodes• Combined metrics:– Try to account for both approachesWhat things to look for?Number of edges in the shortest pathParent/child node depthLocal/average network densityLink typeParent node information contentParent/child node information contentSemantic similarity timeline1989 Rada andBicknellSimple edge counting1995 Resnik Information content of theimmediate parent1998 Lin Information content of the parentand children1998 Jiang andConrathCombination of edge-based andinformation content metrics2003-present Biological application2006-present Development of biology-orientedsemantic similarity metricsRada and Bicknell• First edge-counting method• Distance between two terms is the shortestpath with the minimal number of edges:• More intuitive, natural way of evaluatingsemantic similarity• Assumes that nodes and edges are uniformlydistributed and all edges have equal weightRada and BicknellResnik• First information content method• Similarity between two concepts expressed as theamount of shared information:• c1, c2 – terms• S(c1, c2) – the set of common ancestors of the terms• p(c) is calculated as a relative frequency• Consistently shown to outperform other metrics• Disregards network structureResnikLin• Alternative information-content approach• Similarity expressed as the ratio of the commoninformation to the full description information:• Considers information content of parent andchild terms• Does not reflect the level of detail of the parentnodeLinJiang and ConrathJiang and ConrathSchlicker et al.• Combines Resnik and Lin measures into asingle semantic similarity metric:• Combines different ontologies into a singlesimilarity score:Semantic similarity applications• Four major groups:– Correlation between semantic and sequence similarity– Correlation between semantic similarity and geneexpression– Genetic and protein interactions– Gene clustering and microarray data analysisSemantic and sequence similarity• Lord et al. (2003)– Found semantic similarity metrics to correlate well with genesequence similarity• Björklund et al. (2005)– Studied domain rearrangements in protein evolution– Found domain distance metric to agree with sequence similarityand semantic similarity (MF)• Friedberg and Godzik (2005)– Combined semantic similarity with fragment-based interfoldsimilarity in order to analyze protein fold space– Interfold similarity correlated well with semantic similarity• Schlicker et al. (2006)– Applied a combination of Resnik and Lin metrics to proteinanalysis– Their method demonstrated good agreement with traditionalsequence similarity approachesSemantic similarity and geneexpression• Azuaje and Bodenreider (2004)– Found that high degree of semantic similarity is associated withsignificant levels of expression correlation• Wang et al. (2004)– Demonstrated significant quantitative relationships betweensemantic similarity and expression correlation for more than 3million


View Full Document

DARTMOUTH BIOL 039 - Semantic Similarity

Download Semantic Similarity
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Semantic Similarity and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Semantic Similarity 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?