# ILLINOIS CS 446 - 102617.1 (8 pages)

Previewing pages*1, 2, 3*of 8 page document

**View the full content.**## 102617.1

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 102617.1

0 0 45 views

- Pages:
- 8
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 17 MLP II Lecturer Sanmi Koyejo Scribe Karthik Bala Oct 26th 2017 In this lecture we finished discussing multilayer perceptrons determining appropriate loss functions for different types of classification and discussing dropout Recap Multilayer Perceptrons Multilayer perceptrons also known as deep feed forward neural networks can be drawn in the following way T z Each node in the hidden layer represents the nonlinear computation zi k g wi k k 1 where g x is a nonlinear function We can write the same computation for each layer in terms of a matrix computation Let m be the number of nodes 1 2 17 MLP II Then layer k of our feedforward NN can be zk 1 zk 2 zk g written as T wk 1 zk 1 1 T wk 2 zk 1 2 T wk m zk m zk 1 m Examples of nonlinear functions g are the Relu function and the sigmoid function Output Layers Loss Functions Consider binary classification y 1 1 We can write zl in terms of all the nodes that enter it zl wlT zl 1 bias f x Given a choice of sigmoid for g what is a reasonable loss function l yi f xi yi log f xi 1 yi log 1 f xi Tz log 1 e y wl l The function above is referred to as the log loss binary cross entropy or Bernoulli log likelihood Now consider the case in which zl is linear zl f x wT zl 1 bl 1 An alternative loss function for binary classification becomes the hinge loss l yi f xi max 0 1 yi f xi Multiclass Classification Multiclass classification is defined by y h1 ki Here we introduce the softmax function Tz e wj zl k Pk i 1 e l 1 wiT zl 1 P rob yi j for some j 1 k 17 MLP II 3 Thus X zl k 1 k This is essentially a generalization of the sigmoid function for when k 2 Note that wl Rk size of zl 1 as we need a weight vector for each class Then the loss function is l yi f xi k X yj log f xj j 1 which is often called the discrete cross entropy Note that here y is stored in a one hot encoding so that f xj is highest for the highest yj As an alternative if f x zl a linear activation the loss function is multiclass hinge loss Regression Regression is defined by y Rk Recall that here f x is linear Then the loss function is l yk f xi ky f x k22 and when y R this is simply yi f xi 2 Multilabel Classification Multilabel classification is defined by y 1 k The last layer output is a sigmoid function and the loss function l y f is given by X l y f binary cross entropy k Optimization Some optimization techniques

View Full Document