CMU 11722 Grammar Fomalism - Transformer Language Models

Unformatted text preview:

10 423 10 623 Generative AI Machine Learning Department School of Computer Science Carnegie Mellon University Transformer Language Models Matt Gormley Lecture 2 Jan 22 2024 1 Reminders Homework 0 PyTorch Weights Biases Out Wed Jan 17 Due Wed Jan 24 at 11 59pm Two parts 1 written part to Gradescope 2 programming part to Gradescope unique policy for this assignment we will grant essentially any and all extension requests 2 Some History of LARGE LANGUAGE MODELS 3 Noisy Channel Models Prior to 2017 two tasks relied heavily on language models speech recognition machine translation Definition a noisy channel model combines a transduction model probability of converting y to x with a language model probability of y y argmax p y x argmax p x y p y y y Goal to recover y from x For speech x is acoustic signal y is transcription For machine translation x is sentence in source language y is sentence in target language transduction model language model 4 Large n Gram Language Models The earliest truly large language models were n gram models Google n Grams 2006 first release English n grams trained on 1 trillion tokens of web text 95 billion sentences included 1 grams 2 grams 3 grams 4 grams and 5 grams 2009 2010 n grams in Japanese Chinese Swedish Spanish Romanian Portuguese Polish Dutch Italian French German Czech English n gram model is 3 billion parameters Number of unigrams Number of bigrams Number of trigrams Number of fourgrams Number of fivegrams 13 588 391 314 843 401 977 069 902 1 313 818 354 1 176 470 663 serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 serve as the industrial 52 serve as the industry 607 serve as the info 42 accessoire Accessoires S 515 accessoire Accord i CTDi 65 accessoire Accra accu 312 accessoire Acheter cet 1402 accessoire Ajouter au 160 accessoire Amour Beaut 112 accessoire Annuaire LOEIL 49 accessoire Architecture artiste 531 accessoire Attention 44 5 Large n Gram Language Models The earliest truly large language models were n gram models Google n Grams 2006 first release English n grams trained on 1 trillion tokens of web text 95 billion sentences included 1 grams 2 grams 3 grams 4 grams and 5 grams 2009 2010 n grams in Japanese Chinese Swedish Spanish Romanian Portuguese Polish Dutch Italian French German Czech English n gram model is 3 billion parameters Number of unigrams Number of bigrams Number of trigrams Number of fourgrams Number of fivegrams 13 588 391 314 843 401 977 069 902 1 313 818 354 1 176 470 663 Q Is this a large training set Q Is this a large model A Yes serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 serve as the industrial 52 serve as the industry 607 serve as the info 42 accessoire Accessoires S 515 accessoire Accord i CTDi 65 accessoire Accra accu 312 accessoire Acheter cet 1402 accessoire Ajouter au 160 accessoire Amour Beaut 112 accessoire Annuaire LOEIL 49 accessoire Architecture artiste 531 accessoire Attention 44 A Yes 6 How large are LLMs Comparison of some recent large language models LLMs Model Creators GPT 2 GPT 3 cf ChatGPT PaLM Chinchilla LaMDA cf Bard LLaMA LLaMA 2 GPT 4 OpenAI OpenAI Google DeepMind Google Meta Meta OpenAI Year of release 2019 2020 Training Data tokens 10 billion 40Gb 300 billion Model Size parameters 1 5 billion 175 billion 2022 2022 2022 2023 2023 2023 780 billion 1 4 trillion 1 56 trillion 1 4 trillion 2 trillion 540 billion 70 billion 137 billion 65 billion 70 billion 7 FORGETFUL RNNS 10 Ways of Drawing Neural Networks Recall Computation Graph The diagram represents an algorithm Nodes are rectangles One node per intermediate variable in the algorithm Node is labeled with the function that it computes inside the box and also the variable name outside the box Edges are directed Edges do not have labels since they don t need them For neural networks Each intercept term should appear as a node if it s not folded in somewhere Each parameter should appear as a node Each constant e g a true label or a feature vector should appear in the graph It s perfectly fine to include the loss 1 y F Loss J 1 b D E Output sigmoid 1 cid 50 cid 116 cid 84 b Neural Network Diagram 2 y y 2 The diagram represents a neural network Nodes are circles One node per hidden unit E Label Node is labeled with the variable corresponding to the hidden unit Given y For a fully connected feed forward neural network a hidden unit is a nonlinear function of nodes in the previous layer D Output linear Edges are directed j 0 jzj Each edge is labeled with its weight side note we should be careful about ascribing how a matrix can be used to indicate the labels of the edges and pitfalls there Other details 1 zj 1 cid 50 cid 116 cid 84 aj j Following standard convention the intercept term is NOT shown as a node but rather is assumed to be part of the non linear function that yields a hidden unit i e B Hidden linear its weight does NOT appear in the picture aj M anywhere The diagram does NOT include any nodes related to the loss computation A Input Given xi i C Hidden sigmoid i 0 jixi j A Parameters Given ij i j C Parameters Given j j 11 RNN Language Model Recall The bat made noise at night END p w1 h1 p w2 h2 p w3 h3 p w4 h4 p w5 h5 p w6 h6 p w7 h7 h1 h2 h3 h4 h5 h6 h7 START The bat made noise at night Key Idea 1 convert all previous words to a fixed length vector 2 define distribution p wt f wt 1 w1 that conditions on the vector ht f wt 1 w1 12 RNNs and Forgetting 13 Long Short Term Memory LSTM Motivation Standard RNNs have trouble learning long distance dependencies LSTMs combat this issue y1 h1 x1 y2 h2 x2 yT 1 hT 1 xT 1 yT hT xT 15 Long Short Term Memory LSTM Motivation Vanishing gradient problem for Standard RNNs Figure shows sensitivity darker more sensitive to the input at time t 1 Figure from Graves 2012 16 Long Short Term Memory LSTM Motivation LSTM units have a rich internal structure The various gates determine the propagation of information and can choose to remember or forget information CHAPTER 4 LONG SHORT TERM MEMORY 35 Figure from Graves 2012 Figure 4 4 Preservation of gradient


View Full Document

CMU 11722 Grammar Fomalism - Transformer Language Models

Download Transformer Language Models
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Transformer Language Models and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Transformer Language Models 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?