Internet Engineering Task Force      Audio-Video Transport Working Group
Internet Draft                                               Neil Harris
draft-harris-rtp-pro-av-00.txt                           Sohonet Limited
                                                           June 13, 1996
                                                         Expires: Jan 97


         A Professional Profile for Audio and Video over RTP?

STATUS OF THIS MEMO

     This document is an  Internet-Draft.  Internet-Drafts  are  working
documents  of the Internet Engineering Task Force (IETF), its areas, and
its working groups.  Note that other groups may also distribute  working
documents as Internet-Drafts.

     Internet-Drafts are draft documents valid  for  a  maximum  of  six
months  and may be updated, replaced, or obsoleted by other documents at
any time.  It is  inappropriate  to  use  Internet-Drafts  as  reference
material or to cite them other than as ``work in progress''.

     To learn the current status of any Internet-Draft, please check the
``1id-abstracts.txt''  listing  contained  in the Internet-Drafts Shadow
Directories   on   ftp.is.co.za   (Africa),   nic.nordu.net    (Europe),
munnari.oz.au   (Pacific  Rim),  ds.internic.net  (US  East  Coast),  or
ftp.isi.edu (US West Coast).

     Distribution of this document is unlimited.


                                ABSTRACT

      This note discusses a proposed embedding for uncompressed video
      and audio within an RTP stream.  Whilst this is not a finished 
      document, and although it was not written with wider distribution
      in mind, I think it may be useful as a 'straw man' proposal to 
      provoke further discussion of the use of RTP for professional 
      media.

1. Rationale

At the time of writing [early 1996] professional audio and video
production is undergoing a fundamental change, from dedicated hardware
processing and dedicated digital transports, to computer-based
processing and general-purpose network transports.

Digital data compression will undoubtedly be the dominant format in
domestic and broadcast media delivery, due to its advantages in terms
of high performance and low bandwidth, combined with high perceived
quality.

However, for the origination of material, data compression is not an
ideal technology.  Firstly, compression of material generates
artifacts which can accumulate over multiple generations of
processing.  Secondly, compressed media are not in a suitable fo rm
for processing: they typically have to be decompressed, either in
hardware or software, processed, and re-compressed again.

Data compression is currently popular because it has allowed late-1980's
technology to process pictures at near-professional quality.

However, for these professional applications, network technologies are
now available which can meet the bandwidth needs of current
professional production, and beyond.

In the next two years, these technologies will move from being
relatively expensive to being relatively inexpensive.  With the
widespread availability of wideband I/O busses, and 500 MHz processors
and beyond, processing uncompressed images on standard wo rkstations
will become a commonplace. In fact, working with compressed images may
well represent an overall increase in complexity and use of system
resources.

At the same time, research into new and better compression methods and
improved picture and sound transducers continues.

Any attempt to create a new standard around an existing compression
method will fix the bandwidth and computation compromises of
present-day technology into professional media for the life of that
standard.

This will have detrimental effects on programmes made using these
standards: in five, ten or twenty years, the effects of their
compression artifacts and compromises will stand out as clearly as
1970's colour-camera artifacts and composite-video standards do today.

In contrast, material originated on film continues to preserve its
value, as it was based on the best technology of its day, and
continues to evolve to this day.  But, note that the sprocket holes
and film gauges are unchanged from the 1930's.  This has not stopped
film-makers from using a large number of frame formats, aspect ratios
and types of stock and processing over that period to create their
artistic effects.

Adopting a poor standard in order to replace it later will have
adverse effects on the growth of the professional media industry: it
would seem reasonable that a standard designed in 1996 should be
targeted for the inexpensive hardware of 1997-8, rather t han the
special-purpose hardware of 1996.

Such a standard can be adopted today for high-end systems, with either
high-performance computing resources or hardware acceleration, with
low-end systems following early next year.

The adoption of a lossless standard for interworking should not
unfairly disadvantage compression-based systems. These
compression-based systems should be able to compress and decompress
material in real-time in order to work at all: exchanging video and
audio over the network in an uncompressed form should not compromise
their internal workings in their proprietary compressed internal
formats and protocols.

Good standards already exist for professional media, in the form of
CCIR 601 video and AES/EBU audio, together with EBU/SMPTE timecode.
These formats are vendor-neutral, and have already allowed for the
explosive growth of the video and audio industries.


Any new standard should provide for all the capabilities of these
existing standards, whilst providing for future expansion to full
uniform sampling, RGB colour space, and higher tonal depths and
resolutions.

The mapping to the underlying network layer should be as simple as
possible, without loss of generality or connectivity.

Such a standard already exists.  The IETF has established a draft
standard for transport of real-time data over networks: RTP.

2. RTP: The Internet Real-Time Protocol

RTP provides a means of transmitting time stamped, sequence-numbered
packets with information relating to synchronisation source and
contribution entities.

Whilst RTP is currently used for low-bandwidth video- and
audio-conferencing over networks, it can naturally be extended to
high-bandwidth media, such as professional audio and video.

To do this, we need to define a profile that describes how digital
video or audio fits into the stream of packets defined by RTP.

Note that RTP does not provide any means for error-correction or
retries.  However, it does allow for error-detection based on packets
being dropped by the lower-level transport, which is typically is
equipped with CRC error detection.

We assume that the underlying transport is already fast and reliable,
and that any further error detection or correction is only there to
make an already reliable system even more reliable.

3. Goals for a professional media RTP profile

The professional media profile should be:

non-proprietary: in the public domain, not owned by any company: this
allows for the development of multiple compatible implementations,
without 'turf wars'.

based on existing standards, wherever possible: there is no need to
re-invent existing standards for colourimetry, sampling, time-coding
and so forth

architecture-neutral: not dependent on any hardware or software
architecture features of current computers; in particular, it should
be designed for efficiency in both hardware and software
implementations.

transport-neutral: independent of the underlying transport
architecture and protocols, whether ATM, HIPPI, Fibre Channel, SMDS or
other; whether IP-based or hardware layer based, whether
connection-oriented or connectionless.

scalable: to higher frame rates, bit-depths, resolutions and large
associations of sources and destinations, without need for re-writing
of software or re-design of hardware.

extensible: to accommodate new features, without breaking existing
implementations

compatible with existing professional video and audio formats

designed explicitly for professional use: not competing with other
profiles or standards such as MPEG, JPEG, DAVIC

provide support for precise synchronisation: a need greatly overlooked
in non-professional media.

provide support for variable picture formats and sampling rates

provide support for varispeed and frame skipping

provide optional support for forward error correction

provide support for infra-black and super-white values: these allow
for better matting, colour gamut and resampling behaviour in extreme
conditions

provide support for labelling of image streams: this includes
timecode, ARRI code, Aaton code, Keykode and other formats, as well as
shot/scene and other information.

not rely on or exclude the use of data compression

limited in ambition: there is no attempt to define a standard for
consumer media, or for medical imaging, pre-press, astronomy, or multi
spectral satellite imagery.

The professional media profile should not:

attempt to enforce ones-density: this is a network hardware layer
function

attempt to provide channel coding, such as 8-to-10 bit coding: this is
a function of the network hardware layer.

code horizontal or vertical blanking or sync pulses: this is a
function of the analog video I/O interface

perform packet sequence numbering: this is a function of RTP

provide multiplexing: this is a function of RTP

attempt to provide framing: this is provided by the network layer

restrict colour component values, slew rates, or bandwidth: these are
restrictions of certain conventional video standards, and may be
applied as part of post-processing or viewing, if necessary.

attempt to distribute precise timing information along the media
stream: this is better done by methods such as NTP or GPS.

attempt to provide error-detection: this is a function of the network
layer

4. Recommended migration path

Although the parametric header format allows many possible formats to
be defined, this profile does not recommend the selection and use of
arbitrary formats.

The recommended migration path is:

1       Encapsulation of existing CCIR and AES/EBU video formats
2       Movement to RGB picture formats only
3       Movement to progressive scan picture formats only
4       Movement to higher bit depths, sample rate, and picture
	resolutions 

In particular, 10 bit video and 20 bit audio are recommended.

5. Design principles

The profile consists of two parts:

format description
data encoding

6. Format Description: General

The format of the data is described by a binary header transmitted at
intervals with the data.  The format of the binary header follows the
'chunk' format of EA IFF '85, as used in formats such as PNG.  This is
easy to parse and generate, whilst allowing both extensibility and
efficiency.

This header information is transmitted once per 'frame', at the head
of an RTP packet.  Such packets will have the RTP profile-dependent
marker bit set, to indicate that they contain header information.

The header information is not regarded as taking up any 'time' in the
synchronisation timestamp.

The header is a parametric description of the contents of the
stream. The description format is intended to be orthogonal, even
though only a subset of the possible image formats may ever be used,
for compatibility and efficiency reasons.

7. Data encoding: General

Wherever possible, the data encoding uses the most 'neutral' possible
design decisions.  Network byte ordering (that is, 'big endian', with
the most significant bits/bytes first) is used throughout.

Padding of bulk data to word boundaries is not used by default, as
this is a gratuitous waste of network resources.

8. General header parameters

Profile version, 4 bits, coded, mandatory

0       undefined/invalid
1       this specification
2..15   RESERVED FOR FUTURE USE

Media flag, 2 bits, coded, mandatory

0       undefined/invalid
1       audio
2       video
3       RESERVED FOR FUTURE USE

Synchronisation flag, 2 bits, coded, mandatory

This represents the synchronisation of the source's 'media sync
spindle'.

        0       unsynchronised, free-running or internal sync
        1       synchronised to external hardware sync or RTP sync
		source 
        2       synchronised to UTC
        3       RESERVED FOR FUTURE USE

FEC parity stripe factor, 8 bits, unsigned, mandatory

This is the number of RTP data packets to be coded in each group of
packets.  The parity stripe packet is not included in this count.

The value 0 indicates that no parity striping is present.

9. Audio-specific parameters

Sample rate: 2 x 32 bits, unsigned, mandatory

This is the audio sampling frequency, expressed in Hertz, as a ratio
of two 32-bit unsigned numbers. This represents the nominal
presentation rate of the sample stream in its natural form.

Note that this is not necessarily the rate of transport of the sample
stream over RTP; the stream may be presented faster or slower than the
natural sample rate, for example during varispeed play.

This is a nominal frequency: the exact frequency will depend on the
synchronisation source.  The true frequency should, however, be within
100 ppm of this value.

Frequently-used rates will be:

32000
44100
48000
32000/1.001
44100/1.001
48000/1.001

Play rate: 32 bits, signed, mandatory

This is the current playout rate of the material, expressed in units
of 1/65536 of the nominal sample rate. Note that the playout rate may
be zero, or reverse.

Loudness reference: 8 bits, unsigned, mandatory

This is a reference for the 'natural' level of the sound. The value
encoded is the level in dBA that a full-intensity 1 kHz peak-to-peak
sine wave in the current encoding represents. This is coded in dBA,
with the value 255 reserved for unknown/undefined levels.

Channels: 16 bits

This is the number of sound channels carried.  All sound channels
carried together are regarded as being in sync; that is, the data
values from a group of sound channels should be emitted from the audio
outputs of those channels simultaneously.

Bit-depth: 8 bits

This is the number of bits in the sound samples.  Whilst any number
may be coded in this field, only the values

16
20
24

are recommended.  Audio samples are always linearly sampled.

Timecode

This is a timecode descriptor: see the section 'timecode' for more
information.

AES-EBU Auxiliary information

This contains pre-emphasis, sample rate, and other auxiliary
information, compatible with the AES-EBU audio data format.

10. Video-specific parameters

Sample rate: 2 x 32 bits, unsigned, mandatory

This is expressed in Hertz, as a ratio of two 32-bit numbers. In this
context, a video 'sample' is a frame.

Note that this is not necessarily the rate of transport of the sample
stream over RTP; the stream may be presented faster or slower than the
natural sample rate, for example during varispeed play.

This is a nominal frequency: the exact frequency will depend on the
synchronisation source.  The true frequency should, however, be within
100 ppm of this value.

Typical values will be

24
25
24/1.001
30
30/1.001

Play rate: 32 bits, signed, mandatory

This is the current playout rate of the material, expressed in units
of 1/65536 of the nominal sample rate. Note that the playout rate may
be zero, or reverse.

Horizontal samples: 32 bits, unsigned, mandatory

This is a 32-bit unsigned integer, specifying the number of sampling
cells across the picture.

Vertical samples: 32 bits, unsigned, mandatory

This is a 32-bit unsigned integer, specifying the number of lines of
sampling cells down the picture.

Image aspect ratio: 2 x 32 bits, unsigned, mandatory

This is the ratio of the width of the picture to its height, when
displayed in its natural form, specified as the ratio of two 32-bit
unsigned integers.  These integers should have been reduced so as to
have no common factors: 4:3 rather than 768:576.

Note that this, combined with the vertical and horizontal sample
parameters, define the pixel aspect ratio.

For compatibility reasons with CCIR 601 a pixel aspect ratio of 15:16
may be used.  A square pixel is preferred whenever possible, unless
deliberate anamorphic processing is desired.

Bit-depth: 8 bits

This is the number of bits in the each colour channel's samples.
Whilst any number may be coded in this field, only the values

8
10
12
16

are recommended.

Interlace factor: 4 bits, unsigned, mandatory

This is the number of 'fields' the picture is divided into.  The
number of lines is defined to be divisible by the number of fields.
Where a field structure is present, the data values are coded
field-sequentially, in an order determined by the field ordering
parameter.

Typical values are 

1       progressive scan: no field structure
2       two fields

Note: interlace is deprecated, and is only present to accommodate
historical picture formats.  For this reason, the field is limited to
four bits, in the earnest hope that more will never be needed, and
that interlace will die a quiet death over the next few years.

The value zero is reserved.

Field ordering: 16 bits, unsigned, mandatory

The value in this field is determined as follows:

Take the permutation of [0..n-1] made by the indices of the first
lines of each field in order of transmission.  The field ordering
parameter is the ordinal number of this permutation in the table of
all permutations of [0..n-1] arranged in 'dictionary' order, where the
first entry is counted as zero.  This may seem rather elaborate, but
is designed to accommodate any conceivable field ordering, for up to 8
fields, within a 16-bit identifier.

In particular, it gives simple encodings for progressive scan and
two-field interlace!

For example:

Progressive scan: permutations of [0]: 

there's only one: so code zero in this field.

Two fields: permutations of [0 1].  There are two,

[0 1]: even field first, code 0
[1 0]: odd field first, code 1

Three fields: permutations of [0 1 2]. There are six,

[0 1 2]: code 0
[0 2 1]: code 1
[1 0 2]: code 2
[1 2 0]: code 3
[2 0 1]: code 4
[2 1 0]: code 5

Values of more than (n! - 1), where n is the interlace factor, are
reserved, and should not be coded.

Colour component count: 4 bits, unsigned, mandatory

This describes the number of colour components in the image.
Typically, this value is 1, 3 or 4.

Colour space descriptor: 4 bits, coded

This describes the image's colour space, to a first approximation.
Currently, the colour space descriptors available are:

0       unknown, undefined or not appropriate
1       monochrome
2       CIE XYZ system [for purists only]
3       EBU RGB phosphors
4       SMPTE RGB phosphors
5       colour film positive
6       colour film negative
7       colour film interpositive

Whilst this field is not intended to replace exact colourimetry values
(for which, see the section 'colourimetry description') the nearest
value appropriate should be coded whenever possible.

Colour space encoding law: 4 bits, coded, mandatory

This describes the colour space encoding law of the image, in the
simplest terms.  The options available are:

0       unknown, undefined or not appropriate
1       linear
2       gamma law other than linear
3       logarithmic intensity
4       logarithmic density

Wherever possible, the nearest approximate value should be coded.

Component coding structure: 8 bits, coded, mandatory

0       undefined
1       monochrome
2       RGB
3       A
4       RGBA [?: separate RGB and A signals are more general]
5       YUV, 4:4:4 [deprecated: RGB is preferred]
6       YUV, 4:2:2 [that is, conventional D-1 video]
7       YUV CIF, 4:1:1 [deprecated, but useful for MPEG coding
	purposes]

Where YUV coding is applied, the coding of CCIR recommendation 601 is
assumed.  Note that this includes the assumption of coding of
transformed gamma-uncorrected signals, and precludes any other
component-encoding laws.

Extension for simple colour reference

/// PROVISIONAL///

Colour space examples: in data encoding format, optional

The following picture values should now be coded, as if appearing in
picture material.

10% white, yellow, cyan, green, red, blue; black
90% white, yellow, cyan, green, red, blue; black
100% white, yellow, cyan, green, red, blue; black

Note: this is not intended to replace detailed colourimetry
information.  However, an attempt should always be made to encode
these values for all encodings.

By incorporating these values in the 'picture', we ensure that even
after incorrect processing, some colour reference information is
preserved.


Audio data encoding rules

Samples are packed in big-endian bit order into a stream of bits.
This stream is then broken into a stream of bytes in big-endian bit
order.

Samples are packed in the following order:
        most significant: time, earliest samples first
        less significant: sample channel, lowest numbered first

Video data encoding rules

Samples are packed in big-endian bit order into a stream of bits.
This stream is then broken into a stream of bytes in big-endian bit
order.

Samples are packed in the following order:
        most significant: time, earliest frames first
        field: first transmitted field first
        line: topmost lines first
        pixel: leftmost pixel first
        least significant: sample channel, lowest numbered first

Channel ordering for RGB, YUV and XYZ formats

Red, then Green, then Blue [then optional Alpha].
X, then Y, then Z.
Y, then U, then V.

Sub-sampled formats [YUV 4:2:2, Y:U:V 4:1:1 CIF]

Where 4:2:2 subsampled video is used, the samples are stored as
follows:

   even samples [0,2,4...] Y, U, V [co-sited]
   odd samples  [1,3,5...] Y only, no U or V coded

Where 4:1:1 CIF subsampled video is used, the samples are coded as
follows:

even lines   [0,2,4...]
   even samples [0,2,4...] Y, U, V [co-sited]
   odd samples  [1,3,5...] Y only, no U or V coded
odd lines    [1,3,5...]
   even samples [0,2,4...] Y only, no U or V coded
   odd samples  [1,3,5...] Y only, no U or V coded

Forward error correction: optional

Forward error correction (FEC) is preferred to selective
re-transmission.  The reasons for this are:

Selective re-transmission is difficult in a high-delay transmission
path, such as a satellite or transatlantic cable link, as it needs an
extra delay of at least a round-trip time to allow for
re-transmission.  It also requires extra buffering at both the
transmitter and the receiver.

Selective re-transmission does not scale well to large multicast
conferences.

On a high-performance network, forward error correction may be
unnecessary or be performed at the lower hardware layers, using
techniques like interleaved Reed-Solomon codes.  We believe that if
FEC is necessary, it is best implemented in the network hard ware
layer.

However, there may be times when added FEC is necessary, when
operating over a network with degraded performance or no provision for
hardware error correction.

The preferred mode is parity striping.  For every n packets of image
or audio data, a parity stripe packet will be sent, consisting of the
bit-wise XOR of the date of the preceding n packets. Note that this
requires that all the data packets be equal leng ths.  The group of
packets counter is reset after a parameter header.

After reception, any single lost packet can be reconstructed from the
surrounding group of packets and their parity stripe packet.

Two lost packets will preclude any reconstruction of either lost
packet, but is at least no worse than before.

This simple FEC strategy involves an overhead of 1/n extra data to be
sent, and is not recommended except in the presence of excessive data
loss.  There is also a buffering delay of n packets for any receiver
that desires to perform parity-stripe reconstruction.

Encoders are not required to emit parity-striped data, but the
capability is recommended.

All decoders must be able to receive parity-striped data, even if they
cannot perform the reconstruction function.  In this case, the parity
stripe packets should be discarded.

Lossless compression of sampled values

///TO BE WRITTEN///

Timecode and labels

Support for missing, mis-synchronised and degraded timecode
LTC and VITC
Off-by-one encoding
Leap seconds to match UTC: once per year
Generic label format
Extended timecode including full NTP timestamp???
///TO BE WRITTEN///

Extensions for precise colourimetry description

///TO BE WRITTEN///

Extensions for precise geometry description

///TO BE WRITTEN///

Extensions for encryption and authentication

///TO BE WRITTEN///

Transport over an arbitrary byte-stream [TCP, HTTP etc]

///TO BE WRITTEN///

Storage in files

///TO BE WRITTEN///

Reference implementation

///TO BE WRITTEN///

Profile examples

///TO BE WRITTEN///

Go-faster stripes

MTU, page alignment and efficiency
Hardware acceleration and alignment issues
///TO BE WRITTEN///

Specification of 'artistic frame' vs. sampled frame vs. reference
frame

///TO BE WRITTEN///

11. Address of author

Neil Harris
Sohonet Limited
11 Bear Street
London WC2H 7AS
England

neil@sohonet.co.uk