Inferring Human Interactions From Sparse Visual Data

Home> Academic Documents> Inferring Human Interactions From Sparse Visual Data

DOC PREVIEW

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Inferring Human Interactions From Sparse Visual DataPaul E. Rybski, Manuela M. VelosoRobotics Institute,Carnegie Mellon University5000 Forbes Ave.,Pittsburgh, PA 15213{prybski,mmv}@cs.cmu.eduAbstractWe have recently engaged on the challenging develop-ment of an agent to assist users in everyday office-relatedtasks. In particular, the agent needs to keep track of thestate of their users so it can anticipate the user’s needsand proactively address them. The state of the user may beeasily available when the user directly interacts with theiragent through a PC or PDA interface. However, when theuser attends a meeting and interacts with other people, PCand PDA interfaces are not sufficient to give the agents ageneral view of the environment in which their users areinteracting. In this paper, we introduce the CAMEO, theCamera Assisted Meeting Event Observer, which is a phys-ical awareness system designed for use by an agent-basedelectronic assistant. We then present a particular aspect ofCAMEO and main contribution of the paper, namely howCAMEO addresses the problem of extracting and reason-ing about high-level features from real-time and continuousobservation of a meeting environment. Contextual informa-tion about meetings and the interactions that take place withthem is used to define Dynamic Bayesian Network classi-fiers to effectively infer the state of the users as well as ahigher-level state of the meeting. We present and show re-sults of the state inference algorithm.1. IntroductionElectronic agents designed to interact with humans andhelp them carry out their day-to-day business in an officedomain require a good estimate of their user’s state. Sucha state estimate might consist of the projects the user is re-sponsible for completing, the set of resources that the userhas at his/her disposal, and the user’s daily schedule, toname a few. By obtaining a good state estimate, an elec-tronic agent will be able to reason about its user’s needs andaddress them as best as it can. Ideally, it would be able to an-ticipate future needs and prepare for them.Obtaining an accurate estimate of the user’s state is adifficult challenge. Electronic agents that interact directlywith humans (as opposed to those that might only han-dle email or scheduling information) can obtain informa-tion from a range of different sources including, traditionalworkstation/PDA input devices, spoken audio processing,and video processing systems. Workstation/PDA interfacesrequire that the user be using the device in question so thatthe data or queries/requests can be entered directly. Audioand video processing systems are more flexible in that theuser can interact with an agent in a manner that is closerto interacting with a co-worker than with a data-entry de-vice.However, regardless of the interface, many important hu-man interactions take place outside of the office and typ-ically not in a fashion in which the agent can observe orparticipate. Of particular interest are interactions that takeplace in formal meeting environments. Having a sensorsuite present in a meeting environment would give an agentuseful information about what tasks it could assist its userwith. Afterwards, this information could be automaticallyorganized such that the agent could easily answer ques-tions posed by its user such as ”What was the third bul-let on slide 15?”, or ”What was the action item decided onwhile I was out of the room?” An agent that could recog-nize these events would provide its user with the ability torecall events throughout the working day whose importancemight have been initially missed.In order to address this challenge, our group is develop-ing a physical awareness system for an agent-based elec-tronic assistant called CAMEO (Camera Assisted Meet-ing Event Observer) [12]. CAMEO is an omni-directionalcamera system consisting of four or five firewire cameras(CAMEO supports both configurations) mounted in a cir-cle, as shown in Figure 1. The individual data streams com-ing from each of the cameras are merged into a singlepanoramic image of the world. The cameras are connectedto a Small Form-Factor 3.0GHz Pentium 4 PC that capturesthe video data and does the image processing.Figure 1. The CAMEO†system consists ofa set of firewire cameras arranged in apanoramic fashion and a small-form-factorPC.The panoramic video stream is scanned for human ac-tivity by identifying the positions of human faces found inthe image. This low-level visual information is fed into aDynamic Bayesian Network (DBN) classifier system. Theclassifier determines the state of the individual people in themeeting. These individual person state estimates are thenused to infer high-level state estimates of the meeting it-self. Our approach makes use of a very specific set of con-textual information regarding the meeting domain to gener-ate the Bayesian classification system, rather than attempt-ing to solve the general image understanding problem.2. Related WorkResearch in human/agent activity recognition is spreadacross a variety of different areas. On one side is gesturerecognition, which attempts to use sensor input and sig-nal processing techniques to recognize arm or hand gesturessuch as sign language [14]. On the other side is plan recog-nition [1] which ultimately attempts to classify a high-levelset of goals, intentions, or belief states about agents (humanor otherwise). Our work falls somewhere between those twoareas of research in that we are interested in inferring highlevel behavioral interactions (as restricted to a meeting do-main) from fairly sparse sensor information.Dynamic Bayesian networks are used by [5] to recog-nize the gestures such as writing in different languages ona whiteboard, as well as activities such as using a Glucosemonitor. The gesture recognition system described in thiswork is probably the most similar to ours. However, instead† Special thanks to Fernando de la Torre, Raju Patil, Carlos Vallespi,Brett Browning and Betsy Ricker for their help with the developmentof CAMEO.of attempting to classify the specific kinds of actions thata human is doing, which tend to be very viewpoint depen-dent, we infer body stance and motion by tracking the user’sface. This is a more general method of tracking and workswell with the notion that CAMEO is designed to be set upand operate in relatively unstructured environments.In [7], finite state machine models of gestures are con-structed that by learning the


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

Please select your school