Network Working Group                                      T. Terriberry
Internet-Draft                                       Mozilla Corporation
Intended status: Informational                              July 6, 2015
Expires: January 7, 2016


             Overlapped Block Motion Compensation for NETVC
                     draft-terriberry-netvc-obmc-00

Abstract

   This document proposes a scheme for overlapped block motion
   compensation that could be incorporated into a next-generation video
   codec.  The scheme described is that currently used by Xiph's Daala
   project, which supports variable block sizes without introducing any
   discontinuities at block boundaries.  This makes the scheme suitable
   for use with lapped transforms or other techniques where encoding
   such discontinuities is expensive.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 7, 2016.

Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of


Terriberry               Expires January 7, 2016                [Page 1]

Internet-Draft                Coding Tools                     July 2015


   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Adaptive Subdivision OBMC . . . . . . . . . . . . . . . . . .   3
   3.  Implementation and Motion Estimation  . . . . . . . . . . . .   7
     3.1.  Initial Estimates . . . . . . . . . . . . . . . . . . . .  10
     3.2.  Adaptive Subdivision  . . . . . . . . . . . . . . . . . .  10
     3.3.  Iterative Refinement  . . . . . . . . . . . . . . . . . .  12
       3.3.1.  Rate and Distortion Changes . . . . . . . . . . . . .  12
       3.3.2.  Complexity Reduction  . . . . . . . . . . . . . . . .  13
       3.3.3.  Subpel Refinement . . . . . . . . . . . . . . . . . .  14
   4.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
     4.1.  Informative References  . . . . . . . . . . . . . . . . .  15
     4.2.  URIs  . . . . . . . . . . . . . . . . . . . . . . . . . .  16
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  16

1.  Introduction

   Most current video codecs still use straightforward Block Matching
   Algorithms (BMA) to perform motion compensation, despite their
   simplicity.  These algorithms simply copy a block of pixels from a
   reference frame, possibly after applying a sub-pixel (subpel) filter
   to allow for increased motion resolution.  When the motion vectors
   (MVs) of two adjacent blocks differ, a discontinuity is likely to be
   created along the block edge.  These discontinuities are expensive to
   correct with transform stages that do not themselves have
   discontinuities along block edges, such as lapped transforms
   [I-D.egge-videocodec-tdlt].  Even with a more traditional block-based
   DCT as the transform stage, the creation of these discontinuities
   requires that some residual be coded to correct them (and to activate
   loop filtering along these edges) and requires that the size of a
   transform block used to code that residual be no larger than the size
   of a prediction block (or they will suffer the same efficiency
   problem as lapped transforms in correcting them)

   Overlapped Block Motion Compensation (OBMC) avoids discontinuities on
   the block edges by copying slightly larger blocks of pixels, and
   blending them together with those of neighboring blocks, in an
   overlapping fashion.  Under the assumption that pixels in the
   reference frames are highly spatially correlated, this blending
   compensates for motion uncertainty at the pixels farthest from the
   estimated MVs.  This improves the accuracy of the prediction near
   block edges, making the expected error more uniform across a block,
   and improving coding performance over a similar BMA scheme (with


Terriberry               Expires January 7, 2016                [Page 2]

Internet-Draft                Coding Tools                     July 2015


   fixed-size blocks) by 0.4 dB [KO95] to 1.0 dB [KK97], depending on
   the search strategy used.

   Non-overlapped BMA schemes that support varying the block size give
   much better compression than fixed-size schemes [Lee95].  Although
   previous formats such as Dirac use OBMC, it has always been with a
   (small) fixed blending window size.  The size of a block might vary
   from, say, 4x4 to 16x16 pixels, with each block given a single MV,
   but the overlap with neighboring blocks remains fixed, limited by the
   smallest supported block size to, say, 2 pixels on either side of an
   edge (the exact sizes in Dirac are configurable, but do not vary
   within a frame).  This is essentially equivalent to performing
   prediction for the entire frame at the smallest block size (4x4) with
   an efficient scheme for specifying that the same MV be used for many
   of the blocks.

   We propose a subdivision scheme for OBMC that supports adaptive
   blending window sizes, allowing much larger blending windows in
   blocks that are not subdivided, which previous research has suggested
   should improve prediction performance compared to fixed-size
   windows [ZAS98].  Our scheme uses simple window functions that can be
   computed on the fly, rather than stored in large tables, allowing it
   to scale to very large block sizes.  It admits a dynamic programming
   algorithm to optimize the subdivision levels with a reasonable
   complexity.

2.  Adaptive Subdivision OBMC

   Traditional BMA schemes and previous OBMC schemes have a one-to-one
   correspondence between blocks and MVs, with each block having a
   single MV.  That MV is most reliable in the center of the block,
   where the prediction error is generally the smallest [ZSNKI02].
   Instead, we use a grid of MVs at the corners of blocks.  With a
   fixed-size grid, away from the edges of a frame, this difference is
   mostly academic: equivalent to shifting block boundaries by half the
   size of a block in each direction.  However, with variable-sized
   blocks, the distinction becomes more relevant: there is no longer a
   one-to-one correspondence between blocks and MVs.  Under the scheme
   where MVs correspond to the center of a block, splitting a block
   removes the old MV at the center of the old block and produces new
   MVs at the centers of the new blocks.  Under the scheme where MVs
   belong to the corners, splitting a block retains the MVs at the
   existing corners (corresponding to the same motion as before), but
   may add new MVs at the new block corners.

   We use square blocks with an origin in the upper-left corner and x
   and y coordinates that vary between 0 and 1 within a block.  The


Terriberry               Expires January 7, 2016                [Page 3]

Internet-Draft                Coding Tools                     July 2015


   vertices and edges of a block are indexed in a clockwise manner, as
   illustrated in Figure 1.

                     mv[0]                       mv[1]
                           \  x --->           /
                            \                 /
                             0---------------1
                          y  |       0       |
                             |               |
                          |  |               |
                          v  |3             1|
                     mv[3]   |       mv[2]   |
                           \ |             \ |
                            \|       2      \|
                             3---------------2

               Figure 1: Vertex and Edge Indices for a Block

   In a block with MVs at all four corners, we use normal bilinear
   weights to blend the predictions from each MV.  The bilinear weights
   for each vertex, w[k], at a pixel location (x, y) are defined as

                          w[0] = (1 - x)*(1 - y)
                          w[1] = x*(1 - y)
                          w[2] = x*y
                          w[3] = (1 - x)*y

   Let "I" be the reference image, and for simplicity denote the
   predictor I[x + mv[k].x, y + mv[k].y] for the pixel at location
   (x, y) with motion vector mv[k] as simply I(mv[k]).  In a regular
   grid, with unique motion vectors defined at all four corners of a
   block, we predict the interior of the block using

       I(mv[0])*w[0] + I(mv[1])*w[1] + I(mv[2])*w[2] + I(mv[3])*w[3]

   In order to extend OBMC to handle variable block sizes while
   maintaining continuity along the edges, we start by imposing the
   restriction that the size of two adjacent blocks differ by at most a
   factor of two, which greatly simplifies the problem.  To do this, we
   borrow a data structure from the related graphics problem of surface
   simplification, the semi-regular 4-8 mesh [VG00].  This is normally
   used to represent subdivisions in a triangle mesh, but we use it for
   variable-sized quadrilaterals.


Terriberry               Expires January 7, 2016                [Page 4]

Internet-Draft                Coding Tools                     July 2015


                                                         H
         0---------------0   0---------------0   0-------2-------0
         |               |   |       :    _/ |   |       H       |
         |               |   |       : __/   |   |       H       |
         |               |   |       :/      |   |       H       |
         |               |   |.......1.......|  =2=======1=======2=
         |               |   |    __/:       |   |       H       |
         |               |   |  _/   :       |   |       H       |
         |               |   | /     :       |   |       H       |
         0---------------0   0---------------0   0-------2-------0
                                                         H
              Level 0             Level 1             Level 2

                                           H       H
                   0-------2-------0   0---4---2---4---0
                   |   :/  : \_:   |   |   H   :   H   |
                   |...3...:...3...|  =4===3===4===3===4=
                   | / :   :   :\_ |   |   H   :   H   |
                   2...:...1...:...2   2...4...1...4...2
                   | \_:   :   : / |   |   H   :   H   |
                   |...3...:...3...|  =4===3===4===3===4=
                   |   :\_ : / :   |   |   H   :   H   |
                   0-------2-------0   0---4---2---4---0
                                           H       H
                        Level 3             Level 4

    The first four subdivision levels in a 4-8 mesh.  Numbers indicate
    vertices with transmitted MVs.  Diagonal lines (on odd levels) and
    double lines (on even levels) connect each new vertex to its two
    parents at the previous level (in some cases, this parent may lie
    in an adjacent block).  Dotted lines indicate additional block
    boundaries.

                Figure 2: Subdivision Levels in a 4-8 Mesh

   Subdivision in a 4-8 mesh proceeds in two phases, as illustrated in
   Figure 2.  In the first phase, a new vertex is added to the center of
   a quadrilateral.  This subdivides the quadrilateral into four
   "quadrants", but does not add any aditional vertices to the edges.
   Such edges are called "unsplit".  In the second phase, each of the
   quadrilateral's edges may be split and connected to the center
   vertex, forming four new quadrilaterals.  One useful property of this
   two-phase subdivision is that the number of vertices in the mesh
   merely doubles during each phase, instead of quadrupling as it would
   under normal quadtree subdivision.  This provides more fine-grained
   control over the number of MVs transmitted.


Terriberry               Expires January 7, 2016                [Page 5]

Internet-Draft                Coding Tools                     July 2015


   To ensure that the size of two adjacent blocks differs by no more
   than a factor of two, we assign every vertex two parents in the mesh,
   which are the two adjacent vertices from the immediately preceding
   subdivision level.  A vertex may not be added to the mesh until both
   of its parents are present.  That is, a level 2 vertex may not be
   added to an edge until the blocks on either side have had a level 1
   vertex added, and a level 3 vertex may not be added to the center of
   a block until both of the level 2 vertices have been added to its
   corners, and so on.

   Therefore, we need only specify how to handle a block that has
   undergone phase one subdivision, but still has one or more unsplit
   edges, as illustrated in Figure 3.  Such a block is divided into
   quadrants, and each is interpolated separately using a modified
   version of the previous bilinear weights.

                      c   mv[0]                  mv[1]
                        0------------------------------1
                  mv[0] | mv[0] # mv[1] : mv[0]  mv[1] |
                        |###############:              |
                        |###############:              |
                        |###############:              |
                        |###############:              |
                        |###############:              |
                        | mv[3] # mv[4] : mv[4]  mv[5] |
                        |...............4--------------5
                        | mv[0]   mv[4] |              |
                        |               |              |
                        |               |              |
                        |               |              |
                        |               |              |
                        |               |              |
                  mv[3] | mv[3]   mv[6] |              |
                        3---------------6--------------2

              Figure 3: Interpolation Setup for Unsplit Edges

   The same two MVs are used along the unsplit edge(s) as before, but we
   shift some of the weight used for blending from the middle of the
   edge to the exterior corner.  More precisely, the weights w[k] are
   replaced by modified weights s[k].  For example, if c is the index of
   the vertex in the exterior corner, (+) denotes addition modulo four,
   and c (+) 1 is the index of the corner bisecting the unsplit edge
   (the top edge in the figure), then

                          s[c] = w[c] + 0.5*w[c (+) 1]
                    s[c (+) 1] = 0.5*w[c (+) 1]


Terriberry               Expires January 7, 2016                [Page 6]

Internet-Draft                Coding Tools                     July 2015


   The remaining weights are unchanged.  A similar modification is used
   if it is c (+) 3 that lies on the unsplit edge.  The modifications
   are cumulative.  That is, if both c (+) 1 and c (+) 3 lie on unsplit
   edges (as in the hashed region in Figure 3),

                  s[c] = w[c] + 0.5*w[c (+) 1] + 0.5*w[c (+) 3]
            s[c (+) 1] = 0.5*w[c (+) 1]
            s[c (+) 3] = 0.5*w[c (+) 3]
            s[c (+) 2] = w[c (+) 2]

   This defintiion of the blending weights clearly matches an adjacent
   block along an unsplit edge, regardless of whether or not that block
   has been split.  Careful examination will verify that it also matches
   other quadrants along the interior edges.  Each weight can be
   evaluated with finite differences at the cost of one add per pixel,
   plus setup overhead.  The blending can be done with three multiplies
   per pixel by taking advantage of the fact that the weights sum to
   one, just as with regular bilinear interpolation.

   The mesh itself may require more vertices than an unconstrained mesh
   to achieve a given level of subdivision in a local area, but requires
   fewer bits to encode the subdivision itself, simply because there are
   fewer admissable meshes.  As long as a (0, 0) MV residual can be
   efficiently encoded, the worst-case rate of the 4-8 mesh should be
   close to that of a similar, unconstrained mesh.

   This process can be extended to handle blocks that differ by more
   than one level of subdivision, so long as the edge between them
   remains entirely unsplit.  For example, to handle block sizes that
   differ by a factor of four, instead of shifting half the blending
   weight from one vertex to the other, one simply needs to shift 1/4,
   1/2, or 3/4 of the weight, depending on the location of the block
   along the unsplit edge.  However, the 4-8 mesh is no longer suitable
   for describing which vertices can appear in the mesh, and some
   modifications ot the adaptive subdivision algorithm in Section 3.2
   are required.  We have not yet implemented these extensions.

3.  Implementation and Motion Estimation

   The algorithms in Section 2 have been implemented in the Daala video
   codec [Daala-website].  We use them to produce a complete "motion
   compensated reference frame", which is then lapped and transformed
   (in both the encoder and decoder) [I-D.egge-videocodec-tdlt] to make
   it available as a frequency-domain predictor for the transform
   stage [I-D.valin-netvc-pvq].  The full source code, including all of
   the OBMC work described in this draft is available in the project git
   repository at [1].


Terriberry               Expires January 7, 2016                [Page 7]

Internet-Draft                Coding Tools                     July 2015


   Luma blocks are square with sizes ranging from 32x32 to 4x4.  The
   corners of the MV blocks are aligned with the corners of the
   transform blocks.  An earlier design had the MV blocks offset from
   the transform blocks, so that MVs remained in the center of the
   transform blocks at the coarsest level, with an extra ring of
   implicit (0, 0) MVs around the frame (to keep the minimum number of
   transmitted MVs the same as with BMA).  However, we found that there
   was essentially no performance difference between the two approaches
   (see commit 461310929fc5).  Some things are simpler with the current
   approach (all of the special cases for the implicit (0, 0) MVs go
   away), and some things are more complicated, but most of the
   complications are confined to the computation of MV predictors.

   The encoder performs rate-distortion (R-D) optimization during motion
   estimation to balance the prediction error (D) attainable against the
   number of bits required to achieve it (R), e.g., minimizing

                             J = D + lambda*R

   The value of lambda is obtained directly from the target quantizer
   setting.

   We use the Sum of Absolute Differences (SAD) in the luma plane as our
   distortion metric for the first three stage of the search, and use
   the Sum of Absolute Transformed Difference (SATD) during a final
   subpel refinement stage (with an appropriate adjustment to lambda).

   We approximate the rate of small MV components (with a magnitude less
   than 3 after subtracting a predictor) using statistics from the
   previous frame, plus one sign bit for each non-zero component.
   Larger MV components have an additional non-adaptive rate cost added
   that increases logarithmically with the MV magnitude.  The rate term
   for each vertex also includes a bit for each flag that indicates the
   presence of a child (2 per vertex on average).  The real motion
   information is adaptively arithmetic
   encoded [I-D.terriberry-netvc-codingtools], but these approximations
   avoid having to update the rate term for every MV every time a single
   MV is changed.

   We use a median-of-four predictor for almost every MV, as illustrated
   in Figure 4.  The middle two values of each MV component are
   averaged, rounding towards even values.  There are two exceptions.
   If an MV required for prediction lies outside the frame, a (0, 0) MV
   is substituted in its place.  If an MV required for prediction lies
   inside a 32x32 block that comes after the current one in raster
   order, then that MV is ignored and we use the median of the three
   remaining MVs.  This occurs when predicting MVs on even levels that
   lie on the right or bottom edges of a 32x32 block.  MVs on the top


Terriberry               Expires January 7, 2016                [Page 8]

Internet-Draft                Coding Tools                     July 2015


   and left edges of the frame are considered to belong to the 32x32
   block below or to the right, respectively (that is, the corresponding
   MV in that block is not ignored).

                    |               |              |
                    |               |              |
                ----O---------------O--------------O
                    |\__            H           __/|
                    |   \_          H         _/   |
                    |     \_        H       _/     |
                    |       \_      H     _/       |
                    |         \__   H  __/         |
                    |            \  H /            |
                    |             _|VL             |
                ----O==============>X--------------+
                                 Level 0

                                              H
                  O-------+-------O   +-------O-------+
                  | \_    |    _/ |   |       H       |
                  |   \__ | __/   |   |       H       |
                  |      \|/      |   |       H       |
                  +-------X-------+  =O=======X=======O=
                  |    __/|\__    |   |       H       |
                  |  _/   |   \_  |   |       H       |
                  | /     |     \ |   |       H       |
                  O-------+-------O   +-------O-------+
                                              H
                      Odd Levels         Even Levels

      Key:
      X - MV being predicted
      O - MV used for prediction.  Except at level 0, these are all
          ancestors of the MV being predicted, and thus are required
          to be present.
      + - MV grid point not used for prediction (might not be coded)

            Figure 4: The Predictors Used for MVs at Each Level

   The current bitstream encodes MVs level-by-level for the entire
   frame.  It is expected at some point that this will be migrated to
   code all the MVs for a single 32x32 block at a time.  This is the
   reason for excluding predictors outside of the current 32x32 block.

   The number of combinations of subdivision levels and MVs available
   make finding a globally optimal set of parameters impractical.  The
   problem of finding optimal subdivision levels alone is known to be
   NP-hard [AS94].  The estimation procedure outlined below attempts to


Terriberry               Expires January 7, 2016                [Page 9]

Internet-Draft                Coding Tools                     July 2015


   balance speed with compression performance, though it could certainly
   be improved with future research.

3.1.  Initial Estimates

   First we produce an initial MV estimate at each point in the fully-
   subdivided grid using BMA.  We compute several MV candidates from
   spatial and temporal neighbors and assuming constant speed or
   acceleration.  The candidates are grouped into sets by reliability
   and the search terminates early if the best candidate from a set has
   a SAD below a dynamically chosen threshold.  Otherwise, a local
   gradient search is performed using a square pattern around the best
   candidate vector.  The thresholds ensure extra computation time is
   spent only on blocks whose predictor can be reasonably expected to
   improve.  Although we look solely at SAD to determine whether to
   continue the search, the candidates themselves are ranked using the
   full R-D cost metric, J.

   Level 0 searches using (non-overlapped) 32x32 blocks centered on the
   corresponding grid points, while the next two levels use 16x16
   blocks, the next two levels 8x8, and so on.  MVs are estimated from
   the coarsest levels to the finest, to allow for the accurate
   computation of MV predictors used in the rate estimates.  As the mesh
   is subdivided, existing grid points do not have their MVs re-
   estimated with smaller block sizes, even though the area that those
   MVs would influence in a grid subdivided to that level is reduced.
   All MVs are estimated only up to whole-pel accuracy at this stage.

3.2.  Adaptive Subdivision

   The second stage of motion estimation fixes the mesh subdivision.
   During this stage, the SAD for each block is computed using full OBMC
   instead of BMA.  The MVs produced in the previous stage are held
   fixed in this one.  Only the mesh subdivision level changes.

   The extra subdivision required to add a vertex to the 4-8 mesh is
   similar to the implicit subdivision used by Zhang et al. in their
   variable block size OBMC scheme [ZAS98].  The difference is that we
   optimize over and encode such subdivision explicitly.  We use a
   global R-D optimization strategy with general mesh decimations, as
   proposed by Balmeli [Bal01].  This is a greedy approach that starts
   with a full mesh and successively decimates vertices.  Restricting
   decimation candidates to the leaves of the mesh can frequently
   produce sequences where decimating a MV (reducing rate) causes
   distortion to go _down_, clearly indicating that the previous rate
   allocation was not optimal.  General mesh decimations, on the other
   hand, allow any MV to be removed at a given step, not just the
   leaves.  If a non-leaf is decimated, all of its children are


Terriberry               Expires January 7, 2016               [Page 10]

Internet-Draft                Coding Tools                     July 2015


   decimated as well.  This helps smooth out non-monotonicities in the
   distortion measure during the decimation process, especially at low
   rates

   The following notation is used to describe the algorithm.  The
   current mesh is denoted by M, and M_v is the "merging domain" of v in
   M: the set of vertices in M that must be removed to remove v.  This
   includes v and all of its undecimated children.  Additionally, the
   variation dU(M_v) contains the pairs (dD(M_v), dR(M_v)): the change
   in distortion (SAD) and rate (bits) caused by removing M_v from M.
   We also refer to the change in SAD in a single block b caused by
   removing a single vertex v as dD_b(v).  Finall, A_v is the set of
   ancestors of v in M.  Some minor additions to Balmelli's original
   algorithm are made to handle the fact that distortion is measured
   over squares, not triangles.  The steps of the algorithm are:

   1.  For all v, compute dU(M_v).

   2.  Do

       (a)  Let v* be the value of v in M for which -dD(M_v)/dR(M_v) is
            the smallest.

       (b)  If -dD(M_v*)/dR(M_v*) > lambda, stop.

       (c)  For all w in M_v*, sorted by depth from deepest to
            shallowest:

            i.     For all a in A_w, subtract dU(M_w) from dU(M_a).

            ii.    Remove w from the mesh.

            iii.   If w was on an even level, then for each adjacent
                   block b with a w' in M such that w' lies on the same
                   level as w:

                   A.  Let d be change in dD_b(w') before and after
                       decimating w.

                   B.  For all w in {w'} U A_w' \ A_w, add d to dD(M_a).

   These steps ensure that dU(M_v) contains the up-to-date changes in
   the global rate and distortion after each merging domain is
   decimated.  This update process properly accounts for overlapping
   merging domains due to an inclusion-exclusion principle.  See
   Balmelli for details [Bal01].  Step 2(c)iii handles the case of
   decimating one corner of a block, w, when the opposite corner, w',
   remains.  This changes dD_b(w'), the cost of decimating the opposite


Terriberry               Expires January 7, 2016               [Page 11]

Internet-Draft                Coding Tools                     July 2015


   corner, and that change must be propagated to each merging domain to
   which w' belongs.  No change needs to be made to the common ancestors
   of w and w' however: once dD(M_w') is updated, the normal update
   process that will be executed when w' is decimated is sufficient.
   The addition of these extra steps does not affect the computational
   complexity of the algorithm which is Theta(n log n), where n is the
   size of the initial mesh.

   The distortion measurements needed to initialize and update dU(M_v)
   can be computed once, in advance, by computing the SAD value of each
   block in all sizes and with all possible combinations of unsplit
   edges.  All told, each pixel in the image is used in exactly 13 SAD
   computations (one for the largest block size, with no unsplit edges,
   and for for each additional block size).  Also, since the mesh only
   undergoes six levels of subdivision, there are only a small number of
   unique merging domains and ancestor sets.  These can be computed
   offline and stored in tables to simplify the decimation process.  To
   compute the set difference A_w' \ A_w, we note that w and w' share a
   single common parent, p.  The common ancestors of w and w' are now
   formed by the set {p} U A_p, meaning one can add d to the nodes in
   A_w' and then subtract it from the nodes in {p} U A_p to effect the
   set difference in Step 2(c)iiiB.  Alternatively, one could simply use
   a larger set of lookup tables.

3.3.  Iterative Refinement

   The next stage uses the iterated dynamic programming (DP) proposed by
   Chen and Willson to refine the MVs, accounting for their
   interdependencies [CW00].  In this scheme, a single row (resp.
   column) of MVs is optimized at a time using a Viterbi
   trellis [For73], while the rest remain fixed.  If there is no direct
   block edge between two consecutive MVs in a row (resp.  column) then
   the trellis stops, and a new one is started.  This continues until
   the entire row (resp. column) has been examined.  The process is then
   repeated until the total change in Lagrangian cost, J, falls below a
   given threshold.

3.3.1.  Rate and Distortion Changes

   We use the change in rate and distortion to compute the cost of each
   path in the trellis.  A single MV can influence the distortion of as
   many as 12 neighboring blocks.  Only the ones to the left (resp.
   above) are added to the current cost of each path.  When the
   following MV is chosen, an additional 2 to 8 blocks may be added.  If
   necessary, the blocks to the right (resp. below) are added after the
   last MV in the trellis.


Terriberry               Expires January 7, 2016               [Page 12]

Internet-Draft                Coding Tools                     July 2015


   Unfortunately, the rate of a MV depends on the values of the MVs used
   to predict it.  Chen and Willson assume MVs use 1-D differential
   coding, as in MPEG-1.  With our prediction scheme, several (not
   necessarily consecutive) MVs on the DP path may be used to predict a
   given MV, and the corresponding change in rate is not known until a
   MV has been chosen for all of them.

   If we were to consider all possible combinations of candidates for
   the predictors, the number of trellis edges would increase by several
   orders of magnitude.  This seems excessively wasteful, since as long
   as the changes to the MVs are small, the median operation ensures
   only one or two of them are likely to have any influence on the
   predicted value in the first place.  Instead, we immediately compute
   the rate change in each predicted vector---excluding those that
   themselves lie further along the DP path, since we do not yet know
   what MV will be encoded.  We do this assuming all MVs not already
   considered by the DP remain fixed, and add the change to the cost of
   the current path.  If changing a subsequent MV on the path causes the
   rate of one of these predicted MVs to change again, the new rate
   change is used from then on.

   Because we essentially discard a large number of trellis states of
   limited utility, we might theoretically discard the path that does
   not change any MVs, even though its true cost is lower than the ones
   we keep.  Thus, as a safety precaution, we check the final cost of
   the best path, and do not apply it if it is greater than zero.  This
   does occur in practice, but very rarely.

   Other, simpler alternatives to this approach are also possible.  For
   example, we tried only considering rate changes for MVs on the actual
   DP path, which is much like Chen and Willson's approach.  However, on
   frames with complex motion, we have seen dramatic improvements in
   visual quality and motion field smoothness by properly accounting for
   all rate changes.  This is because a level 0 MV, for example, may be
   used to predict up to 24 other MVs, only 8 of which lie on a given DP
   path.  In a dense mesh, the rate changes off the path may dominate
   the ones on it.

3.3.2.  Complexity Reduction

   Chen and Willson showed that using a logarithmic search instead of an
   exhaustive one for the DP resulted in an average PSNR loss of only
   0.05 dB and an average MV bitrate increase of 55 bits per frame.  We
   take an even more aggressive approach, and replace the logarithmic
   search with a diamond search.  Because the complexity of a given DP
   chain increases quadratically in the number of MV candidates at each
   node, reducing the candidate count can give a substantial performance
   boost.


Terriberry               Expires January 7, 2016               [Page 13]

Internet-Draft                Coding Tools                     July 2015


   The logarithmic search uses a 9-candidate square pattern in each
   stage.  The displacements used in the pattern shrink by a factor of
   two in each phase.  Chen and Willson used a 3-phase search to achieve
   an effective search radius of +/- 7x7 pixels.  In our framework, this
   requires examining 3x9**2 = 243 trellis edges per MV for one full
   iteration.  However, the large displacement patterns only alter a
   small number of MVs that have usually been grossly mis-estimated.
   They are even likely to cause further mis-estimation in areas with
   repeated structure or lack of texture, which makes further refinement
   less effective.

   One alternative is to simply discard them, and only perform the
   square pattern search with one-pixel displacements.  The chance of
   being trapped in a local minimum is increased, but three times as
   many iterations can be performed in the same amount of time.  On the
   other hand, a small diamond search pattern has only 5 candidates,
   making it even more attractive.  This allows more than nine times as
   many iterations as the full logarithmic search in the same amount of
   time.  We support using the diamond search pattern, the square search
   pattern, and the square search pattern with logarithmic step sizes
   with various complexity settings, but by default run with only the
   diamond pattern with a single step size.  The logarithmic square
   pattern search saves between 0.6% and 1.2% rate on metrics, but adds
   50% to the total encode time.

   The computational complexity of this iterative refinement is still
   relatively high.  In a single iteration, each of the four edges of a
   block are traversed exactly once by a DP path, during which its SAD
   is evaluted 25 times, for a total of 100 SAD calculations per block.
   This is nearly as many as full search BMA with a +/- 6x6 window, and
   computing our blended predictors already has higher complexity.  Thus
   it is not suitable for a real-time implementation, but it can easily
   be disabled, or even lighter-weight versions designed.

3.3.3.  Subpel Refinement

   The same samll diamond-pattern search can be used to refine the
   motion vectors to subpel precision.  A square pattern is also
   supported at the highest complexity level, and saves an additional
   0.5% to 0.7% on bitrate, but is half the speed of the default
   settings.  Our implementation supports half-, quarter-, or eigth-pel
   resolution MVs.  First, the DP process is iterated with the small
   diamond and half-pel displacements until the change in Lagrangian
   cost, J, for an iteration falls below a given threshold.

   Finer resolutions are only used if they provide an overall R-D
   benefit, which is tested on a frame-by-frame basis.  First, iteration
   is done with quarter-pel displacements, followed, if successful, by


Terriberry               Expires January 7, 2016               [Page 14]

Internet-Draft                Coding Tools                     July 2015


   eigth-pel.  If the decrease in SAD from the finer resolution MVs
   cannot balance the (approximately) 2 bit per MV increase in bitrate,
   then the coarser vectors are used instead.

   Subpel interpolation is performed using a separable 6-tap polyphase
   filter bank.  Only eight filters are currently used, one for each
   subpel offset at eigth-pel resolution.  If chroma is decimated (for
   4:2:0 video) and eigth-pel MVs are used, then the MV is divided by
   two and rounded to the nearest even value to select an appropriate
   subpel filter.

4.  References

4.1.  Informative References

   [I-D.egge-videocodec-tdlt]
              Egge, N. and T. Terriberry, "Time Domain Lapped Transforms
              for Video Coding", draft-egge-videocodec-tdlt-01 (work in
              progress), March 2015.

   [I-D.terriberry-netvc-codingtools]
              Terriberry, T., "Coding Tools for a Next Generation Video
              Codec", draft-terriberry-netvc-codingtools-00 (work in
              progress), June 2015.

   [I-D.valin-netvc-pvq]
              Valin, J., "Pyramid Vector Quantization for Video Coding",
              draft-valin-netvc-pvq-00 (work in progress), June 2015.

   [AS94]     Agarwal, P. and S. Suri, "Surface Approximation and
              Geometric Partitions", Proc. of the 5th ACM-SIAM Symposium
              on Discrete Algorithms (SODA'94) pp. 24--33, January 1994.

   [Bal01]    Balmelli, L., "Rate-Distortion Optimal Mesh Simplification
              for Communications", PhD thesis, Ecole Polytechnique
              Federale de Lausanne, Switzerland, 2001.

   [CW00]     Chen, M. and A. Willson, "Motion-Vector Optimization of
              Control Grid Interpolation and Overlapped Block Motion
              Compensation Using Iterated Dynamic Programming", IEEE
              Transactions on Image Processing 9(7):1145--1157, July
              2000.

   [For73]    Forney, G., "The Viterbi Algorithm", Proceedings of the
              IEEE 61(3):268--278, March 1973.


Terriberry               Expires January 7, 2016               [Page 15]

Internet-Draft                Coding Tools                     July 2015


   [KO95]     Katto, J. and M. Ohta, "An Analytical Framework for
              Overlapped Motion Compensation", Proc. of the
              International Conference on Acoustics, Speech, and Signal
              Processing (ICASSP'95) vol. 4, pp. 2189--2192, May 1995.

   [KK97]     Kuo, T. and C. Kuo, "A Hybrid BMC/OBMC Motion Compensation
              Scheme", Proc. of the International Conference on Image
              Processing (ICIP'97) vol. 2, pp. 795--798, October 1997.

   [Lee95]    Lee, J., "Optimal Quadtree for Variable Block Size Motion
              Estimation", Proc. of the IEEE International Conference on
              Image Processing vol. 3, pp. 480--483, October 1995.

   [VG00]     Velho, L. and J. Gomes, "Variable Resolution 4-k Meshes:
              Concepts and Applications", Computer Graphics Forum
              19(4):195--212, December 2000.

   [ZAS98]    Zhang, J., Ahamd, M., and M.  Swamy, "New Windowing
              Techinques for Variable-Size Block Motion Compensation",
              IEE Proceedings---Vision, Image, and Signal Processing
              145(6):399--407, December 1998.

   [ZSNKI02]  Zhen, W., Shishikui, Y., Naemure, M., Kanatsugu, Y., and
              S. Itoh, "Analysis of Space-Dependent Characteristics of
              Motion-Compensated Frame Differences Based on a
              Statistical Motion Distribution Model", IEEE Transactions
              on Image Processing 11(4):377--386, April 2002.

   [Daala-website]
              "Daala website", Xiph.Org Foundation , <https://xiph.org/
              daala/>.

4.2.  URIs

   [1] https://git.xiph.org/daala.git

Author's Address

   Timothy B. Terriberry
   Mozilla Corporation
   331 E. Evelyn Avenue
   Mountain View, CA  94041
   USA

   Phone: +1 650 903-0800
   Email: tterribe@xiph.org


Terriberry               Expires January 7, 2016               [Page 16]