Penn COGS 502 - Computational linguistic - D2452163

Home> Schools> University of Pennsylvania> Cognitive Science (COGS) > COGS 502> Computational linguistic

Penn COGS 502 - Computational linguistic

Course Cogs 502- Mathematical Foundations for the Study of Language

Pages 12

Download Save

Unformatted text preview:

Computational linguistics: A new tool for exploring biopolymer structures and statistical mechanicsHow computational linguistics applies to the structures of RNA and proteinsParsing sentences using dynamic programmingHow the CKY method works: the detailsA dynamic programming algorithm for hierarchical protein foldingThe hierarchical search method is also a useful model for the physical folding processThis search method can account for the Levinthal paradoxPhysical kinetics is more parallel than in Monte CarloThis algorithm captures the Plaxco-Simons-Baker relationship between the native structure and folding rateZAMDP identifies slow- and fast-folding proteinsZAMDP identifies slow- and fast-folding routesDynamic programming is also useful for computing statistical mechanical partition functions of heteropolymersThe Ascending Level Model (ALM) of helix bundlesPredictions of the thermal properties of helices and helix bundlesOther applications of computational linguistics to biological polymersConclusionsAcknowledgementsReferencesFeature ArticleComputational linguistics: A new tool for exploring biopolymerstructures and statistical mechanicsKen A. Dilla,*, Adam Lucasb, Julia Hockenmaierc, Liang Huangc,David Chiangd, Aravind K. JoshicaDepartment of Pharmaceutical Chemistry, University of California, San Francisco, N 472-F, 600 16th Street, San Francisco, CA 94143-2240, United StatesbDepartment of Mathematics and Computer Science, Saint Mary’s College of California, Moraga, CA 94575, United StatescInstitute for Research in Cognitive Science and Department of Computer and Information Science, University of Pennsylvania,3401 Walnut Street, Suite 400A, Philadelphia, PA 19104, United StatesdUSC Information Sciences Institute, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292, United StatesReceived 1 March 2007; received in revised form 4 May 2007; accepted 8 May 2007Available online 23 May 2007AbstractUnlike homopolymers, biopolymers are composed of specific sequences of different types of monomers. In proteins and RNA molecules,one-dimensional sequence information encodes a three-dimensional fold, leading to a corresponding molecular function. Such folded structuresare not treated adequately through traditional methods of polymer statistical mechanics. A promising new way to solve problems of the statisticalmechanics of biomolecules comes from computational linguistics, the field that uses computers to parse and understand the sentences in naturallanguages. Here, we give two examples. First, we show that a dynamic programming method of computational linguistics gives a fast way tosearch protein models for native structures. Interestingly, the computational search process closely resembles the physical folding process. Sec-ond, linguistics-based dynamic programming methods are also useful for computing partition functions and densities of states for some foldablebiopolymers e helix-bundle proteins are reviewed here. In these ways, computational linguistics is helping to solve problems of the searchingand counting of biopolymer conformations.Ó 2007 Elsevier Ltd. All rights reserved.Keywords: Biopolymers; Proteins; Lattice models1. How computational linguistics applies to the structuresof RNA and proteinsWe review here new computational ways to enumerate theconformations of biopolymers, drawn from the seeminglydistant field of computational linguistics. Why is computa-tional linguistics relevant to biopolymer statistical mechanics?A biopolymer chain encodes a one-dimensional information,resembling the way a sentence encodes information in a linearstring of words. Just as a sentence is a linear chain of wordstaken from a vocabulary, a linear heteropolymer moleculeis a covalent linear chain of monomers taken from an ‘‘alpha-bet’’ of different chemical moieties. Just as every sentence ina natural language has a particular grammatical structure thatencodes its meaning in a one-dimensional string of words, pro-teins encode their three-dimensiona l structures (and functions)in a one-dimensional string of amino acids. In this paper, weillustrate the computational linguistics approach with two ex-amples: (1) an efficient algorithm for predicting the nativestates and folding routes of proteins and (2) an algorithm forcomputing the partition functions and stabilities of helix-bundle proteins.Consider the process of extracting information from a spo-ken or written sentence. This is called diagramming or parsinga sentence. Parsing is a process by which a listener: (a) beginswith the one-dimensional string of words, (b) searches through* Corresponding author.E-mail address: [email protected] (K.A. Dill).0032-3861/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved.doi:10.1016/j.polymer.2007.05.018Polymer 48 (2007) 4289e4300www.elsevier.com/locate/polymera potentially large number of different topologies that representthe many possible relationship s among different words andphrases, and (c) chooses the one that conveys the correct singlemeaning of the sentence. This is how a listener comes to under-stand a sentence. A similar process is needed for predicting thethree-dimensional structure of a protein molecule from itssequence of monomer units (see Fig. 1). Computer proteinstructure prediction, too, is a process: (a) starting with aone-dimensional string information, (b) considering all the pos-sible topologies (conformations) that could represent the possi-ble native structure of the protein, and then (c) choosing the one(native structure) having the global minimum free energy.In this review, we descr ibe how the automated diagram-ming or parsing algorithms used in computational linguisticsare beginning to contribute insights into biopolymer struct uresand statistical mechanics. These algorithms are all instances ofdynamic programming. Dynamic programming algorithms cansolve large search problems (such as the search for the lowestenergy structure of a protein) by recursively decomposing theminto smaller problems that can be solved independently (e.g.the search for the lowest energy structure of fragments of thisprotein) and using the solutions of these smaller problems tosolve the larger problem. When applied to language, such al-gorithms require a formal grammar, a mathematical descriptionof the possible sentences and associated grammatical struc-tures for a particular language.The key insight is in recognizing that for certain topologicaloptimization problems e

View Full Document


School:
Email:
New Password:
Confirm Password:

Penn COGS 502 - Computational linguistic

Sign up for free to view:

Please select your school