New version page

UW-Madison COMPSCI 838 Topic - Modeling Protein-Protein Interactions in Biomedical Abstracts with Latent Dirichlet Allocation

Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

Modeling Protein-Protein Interactions inBiomedical Abstracts with Latent DirichletAllocationDavid AndrzejewskiCS 838 - Final ProjectJune 14, 2006AbstractA major goal in biomedical text processing is the automatic extraction of pro-tein interaction information from scientific articles or abstracts. We approach thistask with a topic-based generative model. Under the model, sentences in biomed-ical abstracts can be generated by either an ’interaction’ topic if they contain ordiscuss interacting proteins or a ’background’ topic otherwise. This structure isimplemented as a Latent Dirichlet Allocation (LDA) model. The model struc-ture was previously developed as part of work with Mark Craven and Jerry Zhu.During this project, parameter inference equations and algorithms were derived.Future work will consist of implementation and experimental testing.1 IntroductionProteins are biomolecules made up of amino acids which occupy a central role in cel-lular biology. After water, they make up the next highest proportion of cellular weight[8]. Interactions between proteins are very important in many vital biological pro-cesses. Because of this, protein-protein interaction information can be very useful forboth biological scientists and computational systems designed to analyze biologicaldata. This information can be found in a structured format in protein-protein interac-tion databases like the Database of Interacting Proteins (DIP) [10]. These databases arepopulated by human readers who read the relevant research articles and then enter theinteraction data into the database. This manual entry step can be a severe bottleneck insuch a system, especially given the explosive growth of the biosciences literature. Thetotal number of articles indexed by Medline, for example, has been growing exponen-tially, adding an average of 1800 new articles per day in 2005 [5].This situation motivates the need for tools for assist in the extraction protein-proteininteraction information from the scientific literature. An early approach by the DIPteam used discriminating words to identify Medline abstracts that were likely to discussprotein interactions[7]. Human curators could then focus their attentions on these high-scoring articles, yielding a more efficient use of valuable human time. More ambitious1systems aim to directly extract interacting pairs from text, often using rule or pattern-based approaches [4]. Our system is designed to automatically extract interacting pairsusing a ’bag of words’ approach with sentence-level topics.2 The modelThe central feature of the model is the sentence-level topic model. Each sentence isconsidered to be generated by either an ’interaction’ topic or a ’background’ topic.Each topic is associated with a different ’bag of words’ multinomial. Furthermore,each interaction sentence is associated with exactly one pair of proteins. Words in aninteraction sentence can either be drawn from the interaction word bag or a ’proteinpair bag’ which contains all possible identifiers for each of the two proteins. This isnecessary because there may be multiple identifiers which refer to the same protein.The p value represents the probability of a sentence in the abstract having the in-teraction topic. For each document, a new p value is generated from a Dirichlet dis-tribution. This allows the proportion of interacting sentences to vary between differentdocuments. This flexibility should be valuable for modeling the abstracts of differ-ent types of articles, such as articles that are primarily concerned with protein-proteininteractions or those that mention them only in passing (if at all). The use of a multi-nomial over latent topics whose parameters are themselves generated by a Dirichletdistribution is characteristic of the Latent Dirichlet Allocation (LDA) model [2]. Thegeneral outline of our generative model is as followsfor each doc d in our corpus Cpd=dir(α)for each sentence ℓrℓ=mult(pd)tℓ=mult(θ)for each word ksℓk= bern(µ)if rℓk= 0wℓk=mult(β0)else if rℓk= 1if s = 0wℓk=mult(β1)else if s = 1wℓk=pair(tℓ)The model parameters and variables are• θ = The protein pair selection multinomial(θt= probability of selecting protein pair t)• α = Dirichlet hyperparameter for topic selection variable2• β = Word bags for ’interaction’ and ’background’ topics(βjw= probability of word w under topic j)• µ = Probability for pair switch variable in ’interaction’ sentences(µ0= P(s = 0))• r = Topic switch variable(r = 0 means background topic, r = 1 means interaction topic)• s = Pair switch variable for ’interaction’ sentences(s = 1 means select protein, else s = 0 means select from word bag)• t = Protein pair switch variable(specifies a pair of proteins)• w = The observed words(wdℓkis the word in document d, sentence ℓ, word k)KDLθαpβtrµswFigure 1: Graphical representation of the model.33 Parameter estimation3.1 Document likelihoodThe parameters of our model are α, θ,β, and µ. The hidden variables are p,t,r, and s,and the only observed values are the actual words w. The log likelihood of a singledocument can be obtained by marginalizing over the hidden variableslog(P(d|θ, α,β,µ)) =Zp∑ℓ∑(rℓ,tℓ)∑k∑slog(P(wℓk|rℓ,tℓ,s, β))+ log(P(rℓ|p))+ log(P(p|α))+ log(P(tℓ|θ))+ log(P(s|µ))dpWhere ℓ and k are indices into sentences and words within sentences, respectively.P(wℓk|rℓ,tℓ,sℓk,β,θ) uses indicator functions of tℓ, rℓ, and sℓkto model the probabilityof a single wordP(w|β,t, s,r) = 1r==0β0w+ 1r==1(1s==0β1w+ 1s==1yt(w))yt(w) represents the protein pair specified by the switch variable t.yt(w) =(1 w ∈ σt0 elsewhere σtis the set of identifiers for the two proteins in the protein-protein pair specifiedby the index t.3.2 Corpus likelihoodIn order to determine the log likelihood of a corpus C, we assume the documents tobe independent of one another. This allows us to simply extend the above equation bysumming over all training documents.log(P(C|θ,α,β, µ)) =∑d∈Clog(P(d|θ, α,β,µ))3.3 Variational EMTo set the parameters forour model, we want to find the parametervalues that maximizethe log likelihood of the training corpus (maximum likelihood estimation or MLE). Wecannot directly optimize by taking the derivatives with respect to the parameters andsetting them to zero, due to the presence of the hidden variables. Furthermore, wecannot use a standard


View Full Document
Download Modeling Protein-Protein Interactions in Biomedical Abstracts with Latent Dirichlet Allocation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Modeling Protein-Protein Interactions in Biomedical Abstracts with Latent Dirichlet Allocation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Modeling Protein-Protein Interactions in Biomedical Abstracts with Latent Dirichlet Allocation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?