Stanford CS 224 - Documentation Classification using Deep Beliefs Nets - D460570

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Documentation Classification using Deep Beliefs Nets

DOC PREVIEW

Stanford CS 224 - Documentation Classification using Deep Beliefs Nets

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 15

This preview shows page 1-2-3-4-5 out of 15 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

McAfee 5434149 [email protected] Lawrence McAfee CS 224n 6/4/08 Document Classification using Deep Belief Nets Overview This paper covers the implementation and testing of a Deep Belief Network (DBN) for the purposes of document classification. Document classification is important because of the increasing need for document organization, particularly journal articles and online information. Many popular uses of document classification are for email spam filtering, topic-specific search engine (where we only want to search documents that fall under a certain topic), or sentiment detection (to determine whether a document has more of a positive or negative tone) [1]. DBNs using a mostly unsupervised clustering algorithm for training are implemented. The clustering algorithm groups documents together that fall under the same category. Once the documents are grouped, a supervised learning algorithm (SVM in this case) is used to “assign” labels to the groups. More information and a background on DBNs is given in the following section. Testing of the DBN is primarily performed along three axes: 1) the number of hidden neurons in each layer of the DBN (which determines a “shape” for the overall network), 2) the number of layers in the DBN, and 3) the number of iterations that are used to train each layer of the DBN. Additionally, classification results of the DBN are compared with results obtained from using support vector machine (SVM) and Naïve Bayes (NB) classifiers. Some final experiments investigate how vocabulary size and word preprocessing affects performance. Preliminary results indicate that the DBN, given the implementations used in this paper, is not a viable alternative to other forms of document classifiers. The DBN has a significantly longer training time, and no matter how many training iterations are used, the DBN appears to have worse accuracy than either the SVM or NB classifiers. A number of suggestions for improvement are given in the final section, at least one of which should greatly increase the performance of the DBN. Background on Deep Belief Networks A Deep Belief Network is a generative model consisting of multiple, stacked levels of neural networks that each can efficiently represent non-linearities in training data. Typically, these building block networks for the DBN are Restricted Boltzmann Machines (more on these later). The idea of a DBN is that since each Restricted Boltzmann Machine (RBM) is non-linear, then when composing several RBMs together, one would expect to be able to represent highly non-linear/highly varying patterns in the training data. A Boltzmann Machine (the unrestricted version of an RBM) is a network that consists of a layer of visual neurons and a layer of hidden neurons. Frequently, and asMcAfee 5434149 [email protected] used in this paper, the neurons are binary stochastic neurons, meaning that they can only be in one of two states, “on” or “off”. The probability of a neuron turning on is a sigmoid function of its bias, weights on connections it has with all other neurons in the network, and the states of all other neurons in the network. A Boltzmann machine is defined in terms of energies of configurations of the visible and hidden neurons, where a configuration simply defines the state of each neuron. To simplify for training time complexity, DBNs use the Restricted Boltzmann Machine, which only allows connections between a hidden neuron and a visible neuron, and no connections are between two visible neurons or between two hidden neurons. In an RBM, the energy is defined as: (), ' ' 'Energy v h h Wv b v c h= + + where (v,h) is the configuration, W is the weight matrix, and b and c are the bias vectors on the visible and hidden neurons, respectively. In training a Boltzmann machine, we present inputs to the network through the visible neurons, and the goal is to update the weights and biases so as to minimize the energy of the configuration of the network when the training data is used as input. In text classification, when presenting the training data (binary bag-of-word vectors indicating the occurrence of a word) to an RBM, most documents within a certain category are expected to have similar training vectors and are therefore expected generate similar configurations on the RBM. For example, assuming we have several documents falling under the two categories “Golf” and “Cycling”, after training the RBM (to be explained next), we will have an energy graph that should look something as follows: where documents pertaining to “Golf” and “Cycling” have lower energy for their configurations because the training data is made up documents from these two categories. The probability of a configuration is determined by the energy of a configuration, where lower energy configurations have higher probability of occurring: ( )1,energyP v h eZ−= where Z is a normalization constant computed by summing over energies for all possible configurations of the visible and hidden units. Training an RBM consists of presenting a training vector to the visible units, then using the Contrastive Divergence algorithm [2] by alternatively sampling the hidden units, p(h|v), and sampling the visible units, p(v|h). Fortunately, when using an RBM, we never have to calculate the joint probability P(v,h) above, and sampling is easy: Configuration (v,h) Golf Cycling EnergyMcAfee 5434149 [email protected] ( )()( )( )1|1|k k kj jjj j kj kkP v h sigm b W hP h v sigm c W v= = − −= = − −∑∑ After doing just a single iteration of Gibbs sampling (according to Contrastive Divergence), we can update the weights and biases for the RBM as follows: ()( )( )0 0 1 10 10 1kj kj k j k jk k k kj j j jW W v h v hb b v vc c h hααα= − −= − −= − − where α is the learning rate, v0 is sampled from the training data, h0 is sampled from P(h|v0), v1 is sampled from P(v|h0), and h1 is sampled from P(h|v1). We repeat this update for several samples of the training data. To train a stack of RBMs, we iteratively train each layer by first training the lowest layer from the training data, then training each next higher layer by treating the activities of the hidden units of the previous layer as the input data/visible units for the next higher layer. This is best illustrated as a picture: The idea of the layer-by-layer training is that each layer “learns”

View Full Document