Unformatted text preview:

Image Mosaicing with Motion Segmentation from Video Augusto Román and Taly Gilat EE392J – Digital Video Processing Winter 2002 Introduction: Many digital cameras these days include the capability to record video sequences. A video taken standing in one place and looking around could be used to create a single panoramic image. Using motion estimation to align each frame, the overall image can be accumulated across the range of view covered by the video. Assuming a reasonable frame rate and a reasonable angular speed (little or no blurring), the motion from one frame to the next should be predictably small. Some current methods of taking panoramic pictures are: (1) Using a very-wide angle lens (up to 360 degrees!) (2) Taking a number of pictures and manually aligning & stitching them together Method (1) can be rather expensive and not accessible to the average consumer. Method (2) can be difficult, time- consuming, and may have problems with seams between images due to lighting, angle, and motion. Our method should allow a number of potential advantages: • Cheap - Any device capable of generating a movie (DV camcorders and many digital cameras) could be used. • Seamless - Combining a number of frames for each output pixel should reduce seam artifacts • Infinite FOV - Can also be used simply to capture large images but scanning across a desired scene. Depending on time constraints, the following possible extensions offer further advantages: • Occlusion - Using motion segmentation, moving objects can be eliminated. • Motion playback - Using motion segmentation, the resulting panorama could include moving objects. • Superresolution - Multiple overlapping images from a low-resolution video could be combined to create a higher resolution panorama. • Substituting Layers - By identifying similar layers across frames, layers can be extracted and replaced in a video, allowing, for example, the substitution of the background. The approach we are implementing for transforming the video sequence into a panoramic image is based on the work of Wang and Adelson. The algorithm proposed by Wang and Adelson (referred to as the WA algorithm) segments objects in a video sequence based on motion, so that overlapping layers represent the video. Each layer contains objects undergoing similar motion and a motion model for describing the objects through thevideo sequence. The algorithm accumulates the layers by combining the contributions from all video frames. These accumulated objects are exactly what is needed for panoramic image mosaicing. The algorithm also presents a method for determining the depth relationships of the segmented objects, so that occlusions can be dealt with correctly. Although Wang and Adelson present several uses for the algorithm, their primary motivation for the layered representation is improved video coding. In this report we will focus on the application of the WA algorithm to image mosaicing. Since [1] and [2] provide a thorough description of the algorithm, we will give a brief overview and elaborate on our proposed enhancements, modifications, and results. The WA Algorithm The goal of the algorithm is to segment the images into regions with similar motion, and then to describe this motion. The motion models are necessary for accurate segmentation, while the segmented regions are required for determining the motion model. The algorithm uses an iterative approach to solve this chicken and egg problem. A basic flow chart of the implementation is shown in Figure. 1. Figure 1. Flow chart of algorithm implementation. Adapted from [1]. Dense Motion Estimation Because all the following layers depend on and reference the original dense motion field, it is essentially the “gold standard” and as such must be as accurate as possible for the rest of the algorithm to work. The dense motion estimation was initially implemented using a full-search block-based routine. Estimating the motion on a per-pixel level causes this method to be unacceptably slow – on the order of two hours per image pair. In addition, while this approach minimizes the MSE difference between the two frames, it does not necessarily correspond well to the true motion between the frames. Instead, we used phase correlation to determine the motion between pairs of frames. This method works by taking a relatively large block of the image and computing the cross-correlation using the Fourier transform. This identifies several possible motions contained within this block. Each pixel in the center of the block is then tested with each of those possible motions and the motion with the minimum MSE is selected for thatpixel. Because the number of possible motions is small, generally not more than 8 or so, this method offers an enormous efficiency boost over the full-search approach. Only the pixels in a central region of the block are assigned motion vectors to avoid noise due to aliasing. Thus, subsequent blocks must overlap the previous block such that the central regions fill the image. That is, if a 16x16 region in the center of a 128x128 block is assigned motion vectors from the larger blocks’ phase correlation, then the next image block computed would have to be shift over by only 16 pixels so that the center regions line up. Pseudocode for the algorithm is as follows: f1 = normalize( frame(i) ) f2 = normalize( frame(i+1) ) for each block B b1 = f1(B) * window(B) b2 = f2(B) * window(B) FB1 = fft2( b1 ) FB2 = fft2( b2 ) CPS = ( FB1 * conj(FB2) ) / abs(FB1*conj(FB2)) PCF = ifft2( CPS ) possible_motions = peaks( PCF ) for each pixel in f2 for each possible_motion p compute block match err (f1+p,f2) select min err(p) Select input frames Iterate over each block choosing corresponding blocks from each image Compute spectrum of each block Compute cross-power spectrum Compute phase correlation function Select possible motions Test each pixel against possible motions For a more detailed description of the theory behind the basic phase correlation algorithm, see section 6.4.5 (pp 162 – 163) of [3]. For our algorithm, we used a block size of 64 and a block shift of 16. Using the phase correlation algorithm, the processing time per frame was reduced from over two hours per frame pair (with the full search) to under 30 seconds. In attempting to improve the accuracy of the resulting motion fields, we tried


View Full Document

Stanford EE 392J - Study Notes

Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?