UCI ICS 273A - Recognizing Human Actions in Video Sequences - D1619790

Home> Schools> University of California, Irvine> (ICS) > ICS 273A> Recognizing Human Actions in Video Sequences

DOC PREVIEW

UCI ICS 273A - Recognizing Human Actions in Video Sequences

School name University of California, Irvine

Course Ics 273a- Machine Learning

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Recognizing Human Actions in Video SequencesGoutham [email protected] human actions is a challenging problem in computer vision.In this work, a bag-of-words approach is taken to solving this problem.Each video sequence is represented as a document and each frame in asequence corresponds to a ”‘word”’. Although a bag-of-words modelmay not seem intuitive for data other than text, it has shown to be quitesuccessful in object recognition problems in computer vision, and mayprove to be a simple yet powerful model for recognizing human actionsin video sequences. This particular model is probabilistic one trained ina semi-supervised way using a variant of the Latent Dirichlet Allocationmodel.1 IntroductionThe bag of words approach to human action recognition is motivated by some recent suc-cess in applying the same approach to object recognition problems in computer vision. Inthis particular model, each frame in a video sequence corresponds to a visual word and eachvideo sequence corresponds to a ”‘bag”’ of these words. It is important to note that the or-der in which the ”‘words”’ occur in the video sequence does not matter and hence, somestructure is lost by moving to this representation. On the plus side, this is a much simplermodel than the ones that model temporal structure. Instead of capturing temporal struc-ture, this model captures ”temporal smoothing”’ via co-occurance statistics among visualwords. In particular, the model is based on the Latent Dirichlet Allocation (LDA)[2] model.One major issue with the LDA model is that it is not clear how to choose the right num-ber of latent topics. Usually, this is done in an ad-hoc way i.e several different values aretried, but this is not a realistic approach. This issue is dealt with in two ways. First ofall, the bag-of-words model for video sequences used here is much simpler than previousones. Each frame corresponds to a ”‘visual word”’ rather than a ”‘collection of words”’computed at spatial-temporal points of interest [3]. Secondly, some of the latent variablesin the LDA model are observed in the training phase. This solves the problem of choosingthe right number of latent topics. Section 2 describes the training and testing algorithms forthe LDA model in detail. Experimental results are presented in Section 3 and conclusionsin Section 4.2 ApproachVideo sequences are represented as a ”‘bag of words”’. A ”‘word”’ corresponds to a frameand a ”‘document”’ corresponds to a video sequence in this representation. The modelis trained in a semi-supervised fashion using a variant of the LDA algorithm. First, avocabulary of visual words, called a codebook is created. Using this codebook, a dataset ofvideo sequences is converted to a bag of words representation which is then used to builda probabilistic model. The model is then used to classify video sequences. Algorithms aredescribed in detail below.2.1 Building a CodebookThe first step in moving toward a bag-of-words model is to build a vocabulary or codebook.For this, there needs to be a way to compare video frames so that frames that are ”‘similar”’can correspond to the same word. In this work, the motion descriptor in Efros et al.[4] isused. The first step is to track the human figures in video sequences. For this, the algorithmin Sabzmeydani and Mori [5][6] is used.Given a video sequence centered around the human figure, the optical flow at each frameis computed using the Lucas-Kanade [7][9] algorithm.The optical flow field F is then splitinto Fxand Fy(flow fields corresponding to movement in the x direction and y directionrespectively). Fxand Fyare further split into F−x, F+x, F−yand F+ysuch that Fx= F+x-F−xand Fy= F+y- F−y. These four non-negative channels are then blurred with a Gaussiankernel and normalized to obtain the final four channels textitFb−x, Fb+x, Fb−yand Fb+y.The motion descriptors of two frames are compared as follows: If the four channels forframe A are a1, a2, a3and a4, and the four channels for frame B are b1, b2, b3and b4, thenthe ”‘similarity between frames A and B is:S(A, B) =4Xc=1Xx,y ∈Iac(x, y)bc(x, y) (1)where I are the indices of the motion descriptors.Now, to build a codebook, a subset of all the frames are randomly selected and an affin-ity matrix A is computed for these frames where entry (i,j) is the affinity between frame iand frame j. Then k-medoid clustering is run on the affinity matrix A, obtaining V clus-ters. Words are then defined as the centers of the clusters. Then, all video sequences areconverted to ”‘documents”’ by replacing each frame with its corresponding word.2.2 Latent Dirichlet AllocationThe model used in this work is based on the Latent Dirichlet Allocation (LDA)[7] model. The LDA model in the context of video sequences is described below.A dataset of videos is described as a collection D of video sequences(w1,w2,w3,...,wN). Each video sequence w is represented as a collection of framesw = (w1,w2,w3,...,wN). A word wiis an item from the codebook indexed by (1,2,3,...,V).Assuming there are K underlying latent topics (action label classes), eachtopic is represented as a multinomial distribution over V motion words.A video sequence (represented by θ) is generated by sampling a mix-ture of the topics. LDA is a generative model and is described in Fig 1.:The parameter θ indicates the mixing proportion of different action labels in a video se-quence. α is the Dirichlet prior (a k element vector such that ∀ i, αi> 0) that controls howtheta varies among different video sequences. β is the parameter of a set of multinomialdistributions indicating the the distribution of motion words within a action label. Learningthe LDA model from a collection of video sequences D involves finding the α and β thatmaximize the log likelihood of the data l(α, β) =PMd=1logP(wd|α, β).1. Choose θ from Dir(α)2. For each of the N motion words wn:(a) Choose action label (topic) znfrom Mult(θ)(b) Choose a motion word wnfrom p(wn|zn,β), a multinomial probability conditioned on znFigure 1: The generative LDA processFigure 2: Representation of the LDA model2.3 Semi-Latent Dirichlet AllocationIn LDA, the topic zifor word wiis not known and neither is the mixing proportion θfor a document. In the action classification framework, all the frames in a training dataset have class labels associated with them; each frame is

View Full Document