 |
Block-Matching Motion
Compensation
Introduction
Predictive coding is widely used in video transmission, especially for
low bit-rate coding. Typically only some fraction of an image changes from
frame to frame allowing straightforward prediction from previous frames.
Motion compensation is used as part of the predictive process. If an image
sequence shows moving objects, then their motion within the scene can be
measured, and the information used to predict the content of frames later
in the sequence.
Frame based block-matching motion compensation
Fixed Size Block-Matching (FSBM)
The technique was originally described by Jain and Jain [1]; it is easy
to implement, and thus widely adopted. Each image frame is divided into
a fixed number of usually square blocks. For each block in the frame, a
search is made in the reference frame over an area of the image that allows
for the maximum translation that the coder can use. The search is for the
best matching block, to give the least prediction error, usually minimising
either mean square difference, or mean absolute difference which is easier
to compute.
Typical block sizes are of the order of 16x16 pixels, and the maximum
displacement might be +-64 pixels from a block's original position. Several
search strategies are possible, usually using some kind of sampling mechanism,
but the most straightforward approach is exhaustive search. This is computationally
demanding in terms of data throughput, but algorithmically simple, and
relatively easily implemented in hardware.
A good match during the search means that a good prediction can be made,
but the improvement in prediction must outweigh the cost of transmitting
the motion vector. A good match requires that the whole block has undergone
the same translation, and the block should not overlap objects in the image
that have different degrees of motion, including the background.
The choice of block-size to use for motion compensation is always a
compromise, smaller and more numerous blocks can better represent complex
motion than fewer large ones. This reduces the work and transmission costs
of subsequent correction stages but with greater cost for the motion information
itself. The problem has been investigated by Ribas-Corbera and Neuhoff
[2] and they conclude that the choice of block-size can be affected not
only by motion vector accuracy but also by other scene characteristics
such as texture and inter-frame noise.
Figure 1: FSBM example.
|
It is well known that the motion vectors
resulting from FSBM are well correlated. Vector information can be coded
differentially using variable length codes. This is done in a number of
codecs (e.g.. ITU-T H.263 [3]) and it is proposed for the MPEG-4 video
standard [4].
An example of the block structure generated is shown in figure 1 (a
frame from the MPEG-4 test sequence known as "Foreman"). It can be noticed
that the stationary background is represented by large numbers of blocks
with very similar motion vectors (represented by the short lines starting
from each block centre). |
These motion vectors are subsequently variable length coded using
a differential 2D prediction mechanism.
Variable Size Block-Matching (VSBM)
Proposals have been made for improvements to FSBM by varying the size
of blocks to more accurately match moving areas. Such methods are known
as variable size block matching (VSBM) methods. Chan, Yu and Constantinides
[5] have proposed a scheme that starts with relatively large blocks, which
are then repeatedly divided, this is a so-called top down approach. If
the best matching error for a block is above some threshold, the block
is divided into four smaller blocks, until the maximum number of blocks
or locally minimum errors are obtained. The application of such top-down
methods may generate block structures for an image that match real moving
objects, but it seems that an approach which more directly seeks out areas
of uniform motion might be more effective.
We developed a VSBM technique [6] that detects areas of common motion,
grouping them into variable sized blocks with a coding strategy based on
the use of quad-trees. Use of a quad-tree obviates the need to describe
the size and position of each block explicitly, only the tree description
is needed. The vectors for each block in the tree are identical in nature
to those of FSBM. As the process is a grouping together of smaller blocks
to form larger ones, it is generally regarded as a bottom-up technique.
Figure 2: VSBM example.
|
For the same number of blocks per frame
as FSBM, our VSBM method results in a smaller mean square error (MSE),
or better prediction. More significantly, for a similar level of MSE as
FSBM, the VSBM technique can represent the inherent motion using fewer
(variable-sized) blocks, and thus a reduced number of motion vectors.
An example of the block structure generated is shown in figure 2. It
can be seen that the stationary background is represented by comparatively
few (large) blocks, whereas the moving parts of the image are represented
by smaller blocks, and hence a larger number of motion vectors. |
Motion vectors are subsequently variable length coded using a quad-tree
based 2d predictor mechanism [7].
Object based block-matching motion compensation
Evolving object-based video coding standards, such as MPEG-4, permit
arbitrary-shaped objects to be encoded and decoded as separate video object
planes (VOPs). There are several motivating scenarios behind the use of
objects:
-
where transmission bandwidth or decoder performance is limited, the user
may be able to select some subset of all video objects, which are of particular
interest,
-
the user may wish to manipulate objects at the receiver, i.e. change position,
size and depth ordering, depending on interest,
-
it may be possible to replace the content of an object with material generated
later or local to the receiver/display, which can be used for enhanced
visualisation and "augmented reality".
The exact method of producing VOPs from
the source imagery is not defined, but it is assumed that "natural" objects
are represented by shape information in addition to the usual luminance
and chrominance components. Shape data is provided as a binary segmentation
mask, or a grey scale alpha plane to represent multiple overlaid objects.
Figure 3 shows a cropped frame from the MPEG-4 test sequence known as
"Weather". This is made up of the synthetic weather-map, and the presenter
part of the sequence for which there is an object mask or shape description.
Once the shape information has been used to derive the partial image
corresponding to the object, the resulting pixel data is coded much as
if it were a full image. |
Figure 3: VOP example.
|
Fixed Size Block-Matching (FSBM)
Figure 4: FSBM padding example.
|
An example object is shown in figure 4, the object
is shown in both current and reference frames, with the bounding rectangle
made up of coding blocks.
The origin of the rectangle is adjusted to minimise the number of separate
blocks required, although in the example this is not a particular issue.
Two kinds of padding are shown:
(i) |
normal padding is used to fill partially populated blocks, based
on a form of bi-directional interpolation, |
|
(ii) |
extended padding is used to fill an extra layer of blocks by pixel
replication in the appropriate direction. |
The padding is designed to feed the
best possible input into the next stage of coding, when due to some change
in the object there would be parts that do not exist in the reference frame.
Figure 5 shows a padded example of the weather presenter object. The
resulting VOP is used as the reference frame for the block matching operation,
but only pixels inside the object proper are matched against this data.
Motion estimation is performed by matching the 16x16 blocks within a prescribed
search area of the previous frame.
For maximum accuracy, an exhaustive search is used, and matching is
conducted to 1/2 pixel precision. |
Figure 5: Padded VOP.
|
The block-matching process is similar to the full-frame approach described
earlier. An example FSBM block structure can be seen in figure 6a.
Variable Size Block-Matching (VSBM)
Since frame-based VSBM results in a better estimate of "true" motion
and hence more efficient coding of vector information, one would expect
that it can be applied to object-based systems with similar effects [8].
There are two problems to overcome when using a basic block matching
approach to find true motion:
-
The first is the majority effect, where any small area of motion inside
a block will simply be lost as the matching error for the block is determined
by the majority of the block. This is an argument against the use of large
block sizes. Furthermore, a single block cannot effectively represent more
than one motion, so there is always a trade-off between block size and
error quality of match.
-
The aperture problem is the second difficulty, in this case associated
with small block sizes. The fewer pixels there are to match, the more spurious
matches there will be due to ambiguity. Additionally there is little point
in having the overhead of many small blocks if they all have the same vector
(if they don't, the vectors are unlikely to all be correct).
(a) |
(b) |
Figure 6: (a) FSBM block structure, (b) VSBM block structure.
|
Figure 6b shows the result of applying VSBM to the same frame of the "weather"
sequence as in 6a for the same quality prediction. While FSBM requires
109 blocks, VSBM only needs 44, a saving of almost 60%.
Motion vectors are then variable length coded using a differential,
object-based 2D prediction strategy [9].
References
[1] |
J.R. Jain and A.K. Jain, "Displacement measurement and its application
in interframe image coding", IEEE Trans. Commun., Vol. COM-29, No.
12, pp. 1799-1808, Dec. 1981. |
[2] |
J. Ribas-Corbera and D.L. Neuhoff "On the optimal block size for
block-based, motion compensated video coders", SPIE Proceedings of
Visual Communications and Image Processing, Vol. 3024, pp 1132-1143, February
1997. |
[3] |
ITU-T Recommendation H.263 - "Video coding for low bit rate communication",
Geneva, Mar. 1996. |
[4] |
MPEG-4 Video Group, "MPEG-4 Video Verification Model Version 10.0",
ISO/IEC JTC1/SC19/WG11 Document MPEG98/N1992, San Jose, February 1998. |
[5] |
M.H. Chan, Y.B. Yu, A.G. Constantinides, "Variable size block matching
motion compensation with applications to video coding", IEE Proceedings,
Vol 137, Pt. 1, No. 4, August 1990. |
[6] |
G.R. Martin, R.A. Packwood and I. Rhee, "Variable size block matching
estimation with minimal error", SPIE Conference on Digital Video Compression:
Algorithms and Technologies 1996, San Jose, USA, Vol. 2668, pp 324-333,
February 1996. |
[7] |
G.R. Martin & R.A. Packwood & M.K. Steliaros, "Reduced entropy
motion compensation using variable sized blocks", SPIE Proceedings
of Visual Communications and Image Processing, Vol. 3024, February 1997,
pp 293-302. |
[8] |
R.A. Packwood & M.K. Steliaros & G.R. Martin, "Variable
size block matching motion compensation for object-based video coding",
IEE 6th International Conference on Image Processing & its Applications,
Dublin, Ireland, July 1997, pp 56-60. |
[9] |
M.K. Steliaros & G.R. Martin & R.A. Packwood, "Locally-accurate
motion estimation for object-based video coding", SPIE Proceedings
of Visual Communications and Image Processing, Vol. 3309, January 1998,
pp 306-316. |
|