Learning and Recognizing Activities in Streams of Video

Home> Academic Documents> Learning and Recognizing Activities in Streams of Video

DOC PREVIEW

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju and Manuela Veloso Computer Science Department Carnegie Mellon University [email protected], [email protected] Abstract This paper presents an algorithm for learning the underlying models which generate streams of observations, found in video data, which encode activities performed by a person who appears in the video. With these learned models, we then aim to carry out recognition in new video streams which display the same activities as the ones that were learned. Our algorithm represents the underlying models as regular Hidden Markov Models as the problem includes sequential and temporally discrete observations and uses the Baum Welch algorithm in learning the underlying models. Introduction Effective behavior recognition using a Camera Assisted Meeting Event Observer [1] in the context of an office meeting is a challenging problem with many potential benefits to be reaped. If when given a video sequence of an office meeting, we are able to accurately segment the video and recognize consistent activities of meeting attendees, we would then be able to augment higher level functionality such as recognizing behaviors and with these, infer which attendees are presenting agenda items and even gauge attendee interest levels during the meeting. Previous work done in activity recognition includes using optical flow [8, 12] as well as variants of HMMs [7] to carry out the recognition. Instances where regular HMMs have been used to carry out activity recognition include cases where the structure of the models are already known [6] or have been generated using implicit information possessed by the researcher regarding the problem [4, 9, 10, 11] such as the number of states. In this paper we are interested in extending earlier work done in [13] by learning the underlying activity models automatically from video data with which we can subsequently carry out activity recognition. We wish to make the learning process automatic so as to make it feasible in carrying out recognition on video streams containing a large number of activities. Generating the underlying models by hand in cases where there are many activities would be time consuming to the extent of rendering the process impractical. Automating the process also allows us to do learning on datasets which are very general. This has the advantage of making the process versatile as the learning is applicable to a wide variety of data and is not constrained by particular aspects of a specific problem. This paper is structured as follows. In the next section we define the problem in more detail and identify some specific problems that arise when generating a solution. We then outline the approach taken by our algorithm and present some results by evaluating it against a hand labeled control case. We finally discuss the results of the evaluation and some possible future improvements to the algorithm. Problem Definition In this paper we focus on the problem of activity recognition given a video stream of a person engaged in particular set of activities. In particular, we will first extract a stream of movement deltas of the person along the x and y coordinates given an image sequence with the position of the person’s face labeled in each image. Given this sequence of movements deltas and a few training segments of the various activities the person is engaged in (also expressed in terms of x and y deltas), we wish to be able to accurately recognize the correct activity the person is engaged in for each image in the unlabeled sequence. We want to be able to do this by first learning the parameters of some underlying model for each of the activities using the training segments and subsequently using these learned models to carry out activity recognition in the image sequence. One issue that arises when trying to solve this problem is that we don’t know the segment length of each activity in the unlabeled image sequence as we make no assumptions on how to segment the sequence. Another problem that surfaces is that we do not have any implicit knowledge about the activities that we wish to recognize and so cannot make any assumptions about the underlying model such as the number of states generating the observations or the space of observations that can be emitted.Approach Our approach involves two stages. In the first, we use the training data to learn the underlying models and in the second, we use the learned models to recognize activities in the unlabeled image sequence. Learning Underlying Models We start our approach to the problem by generating the underlying models for each activity that we will later use for recognition. At this point we will make the assumption that the underlying models can be closely approximated by using regular Hidden Markov Models [3] consisting of (1) as the problem includes sequential and temporally discrete observations. {}isN = - the states in the model {}ioM = - the observation space {}ijaA = - the state transition matrix where ),|(1 itjtijsSsSPa ===+ Nji ≤≤ ,1 (){}kiobB = - the observation probabilities where ()),|(itisSoPob == Ni ≤≤1 {}iππ= - the initial state distribution ),(1 iisSP ==π Ni ≤≤1 (1) We will first obtain a set of training segments for each of the activities which will be found in the final unlabeled image sequence. The training segments for each activity are hand labeled streams of observations and can be of varying lengths. We then run the Baum Welch algorithm to learn the optimum model for each activity using the training segments. To do this, we first need to determine the possible space of observations that can be emitted by the learned model and we obtain these observations by running through all the training segments for each activity and including all observations we find. In the case that we find an observation in the unlabeled sequence that we have not encountered before, we augment the learned observation probability matrices with the new observation, assigning it with a small non-zero probability and re-normalize them. We also need to find the number of states, Q, which our model will contain before we can commence Baum Welch. We will choose this optimum value of Q by carrying out N-fold cross validation. To do this, we first set the upper bound for Q by choosing the length of the longest training segment, Qu (a value of Q corresponding to the case where each


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

Please select your school