Block-Matching Motion Compensation

Block-Matching Motion Compensation

Introduction

Predictive coding is widely used in video transmission, especially for low bit-rate coding. Typically only some fraction of an image changes from frame to frame allowing straightforward prediction from previous frames. Motion compensation is used as part of the predictive process. If an image sequence shows moving objects, then their motion within the scene can be measured, and the information used to predict the content of frames later in the sequence.

Frame based block-matching motion compensation

Fixed Size Block-Matching (FSBM)

The technique was originally described by Jain and Jain [1]; it is easy to implement, and thus widely adopted. Each image frame is divided into a fixed number of usually square blocks. For each block in the frame, a search is made in the reference frame over an area of the image that allows for the maximum translation that the coder can use. The search is for the best matching block, to give the least prediction error, usually minimising either mean square difference, or mean absolute difference which is easier to compute.

Typical block sizes are of the order of 16x16 pixels, and the maximum displacement might be +-64 pixels from a block's original position. Several search strategies are possible, usually using some kind of sampling mechanism, but the most straightforward approach is exhaustive search. This is computationally demanding in terms of data throughput, but algorithmically simple, and relatively easily implemented in hardware.

A good match during the search means that a good prediction can be made, but the improvement in prediction must outweigh the cost of transmitting the motion vector. A good match requires that the whole block has undergone the same translation, and the block should not overlap objects in the image that have different degrees of motion, including the background.

The choice of block-size to use for motion compensation is always a compromise, smaller and more numerous blocks can better represent complex motion than fewer large ones. This reduces the work and transmission costs of subsequent correction stages but with greater cost for the motion information itself. The problem has been investigated by Ribas-Corbera and Neuhoff [2] and they conclude that the choice of block-size can be affected not only by motion vector accuracy but also by other scene characteristics such as texture and inter-frame noise.

Foreman FSBM Figure 1: FSBM example. It is well known that the motion vectors resulting from FSBM are well correlated. Vector information can be coded differentially using variable length codes. This is done in a number of codecs (e.g.. ITU-T H.263 [3]) and it is proposed for the MPEG-4 video standard [4].
An example of the block structure generated is shown in figure 1 (a frame from the MPEG-4 test sequence known as "Foreman"). It can be noticed that the stationary background is represented by large numbers of blocks with very similar motion vectors (represented by the short lines starting from each block centre).

These motion vectors are subsequently variable length coded using a differential 2D prediction mechanism.

Variable Size Block-Matching (VSBM)

Proposals have been made for improvements to FSBM by varying the size of blocks to more accurately match moving areas. Such methods are known as variable size block matching (VSBM) methods. Chan, Yu and Constantinides [5] have proposed a scheme that starts with relatively large blocks, which are then repeatedly divided, this is a so-called top down approach. If the best matching error for a block is above some threshold, the block is divided into four smaller blocks, until the maximum number of blocks or locally minimum errors are obtained. The application of such top-down methods may generate block structures for an image that match real moving objects, but it seems that an approach which more directly seeks out areas of uniform motion might be more effective.

We developed a VSBM technique [6] that detects areas of common motion, grouping them into variable sized blocks with a coding strategy based on the use of quad-trees. Use of a quad-tree obviates the need to describe the size and position of each block explicitly, only the tree description is needed. The vectors for each block in the tree are identical in nature to those of FSBM. As the process is a grouping together of smaller blocks to form larger ones, it is generally regarded as a bottom-up technique.

Foreman VSBM Figure 2: VSBM example. For the same number of blocks per frame as FSBM, our VSBM method results in a smaller mean square error (MSE), or better prediction. More significantly, for a similar level of MSE as FSBM, the VSBM technique can represent the inherent motion using fewer (variable-sized) blocks, and thus a reduced number of motion vectors.
An example of the block structure generated is shown in figure 2. It can be seen that the stationary background is represented by comparatively few (large) blocks, whereas the moving parts of the image are represented by smaller blocks, and hence a larger number of motion vectors.

Motion vectors are subsequently variable length coded using a quad-tree based 2d predictor mechanism [7].

Object based block-matching motion compensation

Evolving object-based video coding standards, such as MPEG-4, permit arbitrary-shaped objects to be encoded and decoded as separate video object planes (VOPs). There are several motivating scenarios behind the use of objects:

where transmission bandwidth or decoder performance is limited, the user may be able to select some subset of all video objects, which are of particular interest,

the user may wish to manipulate objects at the receiver, i.e. change position, size and depth ordering, depending on interest,

it may be possible to replace the content of an object with material generated later or local to the receiver/display, which can be used for enhanced visualisation and "augmented reality".

The exact method of producing VOPs from the source imagery is not defined, but it is assumed that "natural" objects are represented by shape information in addition to the usual luminance and chrominance components. Shape data is provided as a binary segmentation mask, or a grey scale alpha plane to represent multiple overlaid objects.

Figure 3 shows a cropped frame from the MPEG-4 test sequence known as "Weather". This is made up of the synthetic weather-map, and the presenter part of the sequence for which there is an object mask or shape description.

Once the shape information has been used to derive the partial image corresponding to the object, the resulting pixel data is coded much as if it were a full image.

Figure 3: VOP example.

Fixed Size Block-Matching (FSBM)

Figure 4: FSBM padding example.

An example object is shown in figure 4, the object is shown in both current and reference frames, with the bounding rectangle made up of coding blocks.

The origin of the rectangle is adjusted to minimise the number of separate blocks required, although in the example this is not a particular issue. Two kinds of padding are shown:

(i) normal padding is used to fill partially populated blocks, based on a form of bi-directional interpolation,

(ii)	extended padding is used to fill an extra layer of blocks by pixel replication in the appropriate direction.

The padding is designed to feed the best possible input into the next stage of coding, when due to some change in the object there would be parts that do not exist in the reference frame.

Figure 5 shows a padded example of the weather presenter object. The resulting VOP is used as the reference frame for the block matching operation, but only pixels inside the object proper are matched against this data. Motion estimation is performed by matching the 16x16 blocks within a prescribed search area of the previous frame.

For maximum accuracy, an exhaustive search is used, and matching is conducted to 1/2 pixel precision.

Figure 5: Padded VOP.

The block-matching process is similar to the full-frame approach described earlier. An example FSBM block structure can be seen in figure 6a.

Variable Size Block-Matching (VSBM)

Since frame-based VSBM results in a better estimate of "true" motion and hence more efficient coding of vector information, one would expect that it can be applied to object-based systems with similar effects [8].

There are two problems to overcome when using a basic block matching approach to find true motion:

The first is the majority effect, where any small area of motion inside a block will simply be lost as the matching error for the block is determined by the majority of the block. This is an argument against the use of large block sizes. Furthermore, a single block cannot effectively represent more than one motion, so there is always a trade-off between block size and error quality of match.

The aperture problem is the second difficulty, in this case associated with small block sizes. The fewer pixels there are to match, the more spurious matches there will be due to ambiguity. Additionally there is little point in having the overhead of many small blocks if they all have the same vector (if they don't, the vectors are unlikely to all be correct).

(a)	(b)
Figure 6: (a) FSBM block structure, (b) VSBM block structure.

Figure 6b shows the result of applying VSBM to the same frame of the "weather" sequence as in 6a for the same quality prediction. While FSBM requires 109 blocks, VSBM only needs 44, a saving of almost 60%.

Motion vectors are then variable length coded using a differential, object-based 2D prediction strategy [9].

References

[1] J.R. Jain and A.K. Jain, "Displacement measurement and its application in interframe image coding", IEEE Trans. Commun., Vol. COM-29, No. 12, pp. 1799-1808, Dec. 1981.

[2] J. Ribas-Corbera and D.L. Neuhoff "On the optimal block size for block-based, motion compensated video coders", SPIE Proceedings of Visual Communications and Image Processing, Vol. 3024, pp 1132-1143, February 1997.

[3] ITU-T Recommendation H.263 - "Video coding for low bit rate communication", Geneva, Mar. 1996.

[4] MPEG-4 Video Group, "MPEG-4 Video Verification Model Version 10.0", ISO/IEC JTC1/SC19/WG11 Document MPEG98/N1992, San Jose, February 1998.

[5] M.H. Chan, Y.B. Yu, A.G. Constantinides, "Variable size block matching motion compensation with applications to video coding", IEE Proceedings, Vol 137, Pt. 1, No. 4, August 1990.

[6] G.R. Martin, R.A. Packwood and I. Rhee, "Variable size block matching estimation with minimal error", SPIE Conference on Digital Video Compression: Algorithms and Technologies 1996, San Jose, USA, Vol. 2668, pp 324-333, February 1996.

[7] G.R. Martin & R.A. Packwood & M.K. Steliaros, "Reduced entropy motion compensation using variable sized blocks", SPIE Proceedings of Visual Communications and Image Processing, Vol. 3024, February 1997, pp 293-302.

[8] R.A. Packwood & M.K. Steliaros & G.R. Martin, "Variable size block matching motion compensation for object-based video coding", IEE 6th International Conference on Image Processing & its Applications, Dublin, Ireland, July 1997, pp 56-60.

[9] M.K. Steliaros & G.R. Martin & R.A. Packwood, "Locally-accurate motion estimation for object-based video coding", SPIE Proceedings of Visual Communications and Image Processing, Vol. 3309, January 1998, pp 306-316.