Internet Engineering Task Force Audio-Video Transport Working Group Internet Draft Neil Harris draft-harris-rtp-pro-av-00.txt Sohonet Limited June 13, 1996 Expires: Jan 97 A Professional Profile for Audio and Video over RTP? STATUS OF THIS MEMO This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress''. To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. ABSTRACT This note discusses a proposed embedding for uncompressed video and audio within an RTP stream. Whilst this is not a finished document, and although it was not written with wider distribution in mind, I think it may be useful as a 'straw man' proposal to provoke further discussion of the use of RTP for professional media. 1. Rationale At the time of writing [early 1996] professional audio and video production is undergoing a fundamental change, from dedicated hardware processing and dedicated digital transports, to computer-based processing and general-purpose network transports. Digital data compression will undoubtedly be the dominant format in domestic and broadcast media delivery, due to its advantages in terms of high performance and low bandwidth, combined with high perceived quality. However, for the origination of material, data compression is not an ideal technology. Firstly, compression of material generates artifacts which can accumulate over multiple generations of processing. Secondly, compressed media are not in a suitable fo rm for processing: they typically have to be decompressed, either in hardware or software, processed, and re-compressed again. Data compression is currently popular because it has allowed late-1980's technology to process pictures at near-professional quality. However, for these professional applications, network technologies are now available which can meet the bandwidth needs of current professional production, and beyond. In the next two years, these technologies will move from being relatively expensive to being relatively inexpensive. With the widespread availability of wideband I/O busses, and 500 MHz processors and beyond, processing uncompressed images on standard wo rkstations will become a commonplace. In fact, working with compressed images may well represent an overall increase in complexity and use of system resources. At the same time, research into new and better compression methods and improved picture and sound transducers continues. Any attempt to create a new standard around an existing compression method will fix the bandwidth and computation compromises of present-day technology into professional media for the life of that standard. This will have detrimental effects on programmes made using these standards: in five, ten or twenty years, the effects of their compression artifacts and compromises will stand out as clearly as 1970's colour-camera artifacts and composite-video standards do today. In contrast, material originated on film continues to preserve its value, as it was based on the best technology of its day, and continues to evolve to this day. But, note that the sprocket holes and film gauges are unchanged from the 1930's. This has not stopped film-makers from using a large number of frame formats, aspect ratios and types of stock and processing over that period to create their artistic effects. Adopting a poor standard in order to replace it later will have adverse effects on the growth of the professional media industry: it would seem reasonable that a standard designed in 1996 should be targeted for the inexpensive hardware of 1997-8, rather t han the special-purpose hardware of 1996. Such a standard can be adopted today for high-end systems, with either high-performance computing resources or hardware acceleration, with low-end systems following early next year. The adoption of a lossless standard for interworking should not unfairly disadvantage compression-based systems. These compression-based systems should be able to compress and decompress material in real-time in order to work at all: exchanging video and audio over the network in an uncompressed form should not compromise their internal workings in their proprietary compressed internal formats and protocols. Good standards already exist for professional media, in the form of CCIR 601 video and AES/EBU audio, together with EBU/SMPTE timecode. These formats are vendor-neutral, and have already allowed for the explosive growth of the video and audio industries. Any new standard should provide for all the capabilities of these existing standards, whilst providing for future expansion to full uniform sampling, RGB colour space, and higher tonal depths and resolutions. The mapping to the underlying network layer should be as simple as possible, without loss of generality or connectivity. Such a standard already exists. The IETF has established a draft standard for transport of real-time data over networks: RTP. 2. RTP: The Internet Real-Time Protocol RTP provides a means of transmitting time stamped, sequence-numbered packets with information relating to synchronisation source and contribution entities. Whilst RTP is currently used for low-bandwidth video- and audio-conferencing over networks, it can naturally be extended to high-bandwidth media, such as professional audio and video. To do this, we need to define a profile that describes how digital video or audio fits into the stream of packets defined by RTP. Note that RTP does not provide any means for error-correction or retries. However, it does allow for error-detection based on packets being dropped by the lower-level transport, which is typically is equipped with CRC error detection. We assume that the underlying transport is already fast and reliable, and that any further error detection or correction is only there to make an already reliable system even more reliable. 3. Goals for a professional media RTP profile The professional media profile should be: non-proprietary: in the public domain, not owned by any company: this allows for the development of multiple compatible implementations, without 'turf wars'. based on existing standards, wherever possible: there is no need to re-invent existing standards for colourimetry, sampling, time-coding and so forth architecture-neutral: not dependent on any hardware or software architecture features of current computers; in particular, it should be designed for efficiency in both hardware and software implementations. transport-neutral: independent of the underlying transport architecture and protocols, whether ATM, HIPPI, Fibre Channel, SMDS or other; whether IP-based or hardware layer based, whether connection-oriented or connectionless. scalable: to higher frame rates, bit-depths, resolutions and large associations of sources and destinations, without need for re-writing of software or re-design of hardware. extensible: to accommodate new features, without breaking existing implementations compatible with existing professional video and audio formats designed explicitly for professional use: not competing with other profiles or standards such as MPEG, JPEG, DAVIC provide support for precise synchronisation: a need greatly overlooked in non-professional media. provide support for variable picture formats and sampling rates provide support for varispeed and frame skipping provide optional support for forward error correction provide support for infra-black and super-white values: these allow for better matting, colour gamut and resampling behaviour in extreme conditions provide support for labelling of image streams: this includes timecode, ARRI code, Aaton code, Keykode and other formats, as well as shot/scene and other information. not rely on or exclude the use of data compression limited in ambition: there is no attempt to define a standard for consumer media, or for medical imaging, pre-press, astronomy, or multi spectral satellite imagery. The professional media profile should not: attempt to enforce ones-density: this is a network hardware layer function attempt to provide channel coding, such as 8-to-10 bit coding: this is a function of the network hardware layer. code horizontal or vertical blanking or sync pulses: this is a function of the analog video I/O interface perform packet sequence numbering: this is a function of RTP provide multiplexing: this is a function of RTP attempt to provide framing: this is provided by the network layer restrict colour component values, slew rates, or bandwidth: these are restrictions of certain conventional video standards, and may be applied as part of post-processing or viewing, if necessary. attempt to distribute precise timing information along the media stream: this is better done by methods such as NTP or GPS. attempt to provide error-detection: this is a function of the network layer 4. Recommended migration path Although the parametric header format allows many possible formats to be defined, this profile does not recommend the selection and use of arbitrary formats. The recommended migration path is: 1 Encapsulation of existing CCIR and AES/EBU video formats 2 Movement to RGB picture formats only 3 Movement to progressive scan picture formats only 4 Movement to higher bit depths, sample rate, and picture resolutions In particular, 10 bit video and 20 bit audio are recommended. 5. Design principles The profile consists of two parts: format description data encoding 6. Format Description: General The format of the data is described by a binary header transmitted at intervals with the data. The format of the binary header follows the 'chunk' format of EA IFF '85, as used in formats such as PNG. This is easy to parse and generate, whilst allowing both extensibility and efficiency. This header information is transmitted once per 'frame', at the head of an RTP packet. Such packets will have the RTP profile-dependent marker bit set, to indicate that they contain header information. The header information is not regarded as taking up any 'time' in the synchronisation timestamp. The header is a parametric description of the contents of the stream. The description format is intended to be orthogonal, even though only a subset of the possible image formats may ever be used, for compatibility and efficiency reasons. 7. Data encoding: General Wherever possible, the data encoding uses the most 'neutral' possible design decisions. Network byte ordering (that is, 'big endian', with the most significant bits/bytes first) is used throughout. Padding of bulk data to word boundaries is not used by default, as this is a gratuitous waste of network resources. 8. General header parameters Profile version, 4 bits, coded, mandatory 0 undefined/invalid 1 this specification 2..15 RESERVED FOR FUTURE USE Media flag, 2 bits, coded, mandatory 0 undefined/invalid 1 audio 2 video 3 RESERVED FOR FUTURE USE Synchronisation flag, 2 bits, coded, mandatory This represents the synchronisation of the source's 'media sync spindle'. 0 unsynchronised, free-running or internal sync 1 synchronised to external hardware sync or RTP sync source 2 synchronised to UTC 3 RESERVED FOR FUTURE USE FEC parity stripe factor, 8 bits, unsigned, mandatory This is the number of RTP data packets to be coded in each group of packets. The parity stripe packet is not included in this count. The value 0 indicates that no parity striping is present. 9. Audio-specific parameters Sample rate: 2 x 32 bits, unsigned, mandatory This is the audio sampling frequency, expressed in Hertz, as a ratio of two 32-bit unsigned numbers. This represents the nominal presentation rate of the sample stream in its natural form. Note that this is not necessarily the rate of transport of the sample stream over RTP; the stream may be presented faster or slower than the natural sample rate, for example during varispeed play. This is a nominal frequency: the exact frequency will depend on the synchronisation source. The true frequency should, however, be within 100 ppm of this value. Frequently-used rates will be: 32000 44100 48000 32000/1.001 44100/1.001 48000/1.001 Play rate: 32 bits, signed, mandatory This is the current playout rate of the material, expressed in units of 1/65536 of the nominal sample rate. Note that the playout rate may be zero, or reverse. Loudness reference: 8 bits, unsigned, mandatory This is a reference for the 'natural' level of the sound. The value encoded is the level in dBA that a full-intensity 1 kHz peak-to-peak sine wave in the current encoding represents. This is coded in dBA, with the value 255 reserved for unknown/undefined levels. Channels: 16 bits This is the number of sound channels carried. All sound channels carried together are regarded as being in sync; that is, the data values from a group of sound channels should be emitted from the audio outputs of those channels simultaneously. Bit-depth: 8 bits This is the number of bits in the sound samples. Whilst any number may be coded in this field, only the values 16 20 24 are recommended. Audio samples are always linearly sampled. Timecode This is a timecode descriptor: see the section 'timecode' for more information. AES-EBU Auxiliary information This contains pre-emphasis, sample rate, and other auxiliary information, compatible with the AES-EBU audio data format. 10. Video-specific parameters Sample rate: 2 x 32 bits, unsigned, mandatory This is expressed in Hertz, as a ratio of two 32-bit numbers. In this context, a video 'sample' is a frame. Note that this is not necessarily the rate of transport of the sample stream over RTP; the stream may be presented faster or slower than the natural sample rate, for example during varispeed play. This is a nominal frequency: the exact frequency will depend on the synchronisation source. The true frequency should, however, be within 100 ppm of this value. Typical values will be 24 25 24/1.001 30 30/1.001 Play rate: 32 bits, signed, mandatory This is the current playout rate of the material, expressed in units of 1/65536 of the nominal sample rate. Note that the playout rate may be zero, or reverse. Horizontal samples: 32 bits, unsigned, mandatory This is a 32-bit unsigned integer, specifying the number of sampling cells across the picture. Vertical samples: 32 bits, unsigned, mandatory This is a 32-bit unsigned integer, specifying the number of lines of sampling cells down the picture. Image aspect ratio: 2 x 32 bits, unsigned, mandatory This is the ratio of the width of the picture to its height, when displayed in its natural form, specified as the ratio of two 32-bit unsigned integers. These integers should have been reduced so as to have no common factors: 4:3 rather than 768:576. Note that this, combined with the vertical and horizontal sample parameters, define the pixel aspect ratio. For compatibility reasons with CCIR 601 a pixel aspect ratio of 15:16 may be used. A square pixel is preferred whenever possible, unless deliberate anamorphic processing is desired. Bit-depth: 8 bits This is the number of bits in the each colour channel's samples. Whilst any number may be coded in this field, only the values 8 10 12 16 are recommended. Interlace factor: 4 bits, unsigned, mandatory This is the number of 'fields' the picture is divided into. The number of lines is defined to be divisible by the number of fields. Where a field structure is present, the data values are coded field-sequentially, in an order determined by the field ordering parameter. Typical values are 1 progressive scan: no field structure 2 two fields Note: interlace is deprecated, and is only present to accommodate historical picture formats. For this reason, the field is limited to four bits, in the earnest hope that more will never be needed, and that interlace will die a quiet death over the next few years. The value zero is reserved. Field ordering: 16 bits, unsigned, mandatory The value in this field is determined as follows: Take the permutation of [0..n-1] made by the indices of the first lines of each field in order of transmission. The field ordering parameter is the ordinal number of this permutation in the table of all permutations of [0..n-1] arranged in 'dictionary' order, where the first entry is counted as zero. This may seem rather elaborate, but is designed to accommodate any conceivable field ordering, for up to 8 fields, within a 16-bit identifier. In particular, it gives simple encodings for progressive scan and two-field interlace! For example: Progressive scan: permutations of [0]: there's only one: so code zero in this field. Two fields: permutations of [0 1]. There are two, [0 1]: even field first, code 0 [1 0]: odd field first, code 1 Three fields: permutations of [0 1 2]. There are six, [0 1 2]: code 0 [0 2 1]: code 1 [1 0 2]: code 2 [1 2 0]: code 3 [2 0 1]: code 4 [2 1 0]: code 5 Values of more than (n! - 1), where n is the interlace factor, are reserved, and should not be coded. Colour component count: 4 bits, unsigned, mandatory This describes the number of colour components in the image. Typically, this value is 1, 3 or 4. Colour space descriptor: 4 bits, coded This describes the image's colour space, to a first approximation. Currently, the colour space descriptors available are: 0 unknown, undefined or not appropriate 1 monochrome 2 CIE XYZ system [for purists only] 3 EBU RGB phosphors 4 SMPTE RGB phosphors 5 colour film positive 6 colour film negative 7 colour film interpositive Whilst this field is not intended to replace exact colourimetry values (for which, see the section 'colourimetry description') the nearest value appropriate should be coded whenever possible. Colour space encoding law: 4 bits, coded, mandatory This describes the colour space encoding law of the image, in the simplest terms. The options available are: 0 unknown, undefined or not appropriate 1 linear 2 gamma law other than linear 3 logarithmic intensity 4 logarithmic density Wherever possible, the nearest approximate value should be coded. Component coding structure: 8 bits, coded, mandatory 0 undefined 1 monochrome 2 RGB 3 A 4 RGBA [?: separate RGB and A signals are more general] 5 YUV, 4:4:4 [deprecated: RGB is preferred] 6 YUV, 4:2:2 [that is, conventional D-1 video] 7 YUV CIF, 4:1:1 [deprecated, but useful for MPEG coding purposes] Where YUV coding is applied, the coding of CCIR recommendation 601 is assumed. Note that this includes the assumption of coding of transformed gamma-uncorrected signals, and precludes any other component-encoding laws. Extension for simple colour reference /// PROVISIONAL/// Colour space examples: in data encoding format, optional The following picture values should now be coded, as if appearing in picture material. 10% white, yellow, cyan, green, red, blue; black 90% white, yellow, cyan, green, red, blue; black 100% white, yellow, cyan, green, red, blue; black Note: this is not intended to replace detailed colourimetry information. However, an attempt should always be made to encode these values for all encodings. By incorporating these values in the 'picture', we ensure that even after incorrect processing, some colour reference information is preserved. Audio data encoding rules Samples are packed in big-endian bit order into a stream of bits. This stream is then broken into a stream of bytes in big-endian bit order. Samples are packed in the following order: most significant: time, earliest samples first less significant: sample channel, lowest numbered first Video data encoding rules Samples are packed in big-endian bit order into a stream of bits. This stream is then broken into a stream of bytes in big-endian bit order. Samples are packed in the following order: most significant: time, earliest frames first field: first transmitted field first line: topmost lines first pixel: leftmost pixel first least significant: sample channel, lowest numbered first Channel ordering for RGB, YUV and XYZ formats Red, then Green, then Blue [then optional Alpha]. X, then Y, then Z. Y, then U, then V. Sub-sampled formats [YUV 4:2:2, Y:U:V 4:1:1 CIF] Where 4:2:2 subsampled video is used, the samples are stored as follows: even samples [0,2,4...] Y, U, V [co-sited] odd samples [1,3,5...] Y only, no U or V coded Where 4:1:1 CIF subsampled video is used, the samples are coded as follows: even lines [0,2,4...] even samples [0,2,4...] Y, U, V [co-sited] odd samples [1,3,5...] Y only, no U or V coded odd lines [1,3,5...] even samples [0,2,4...] Y only, no U or V coded odd samples [1,3,5...] Y only, no U or V coded Forward error correction: optional Forward error correction (FEC) is preferred to selective re-transmission. The reasons for this are: Selective re-transmission is difficult in a high-delay transmission path, such as a satellite or transatlantic cable link, as it needs an extra delay of at least a round-trip time to allow for re-transmission. It also requires extra buffering at both the transmitter and the receiver. Selective re-transmission does not scale well to large multicast conferences. On a high-performance network, forward error correction may be unnecessary or be performed at the lower hardware layers, using techniques like interleaved Reed-Solomon codes. We believe that if FEC is necessary, it is best implemented in the network hard ware layer. However, there may be times when added FEC is necessary, when operating over a network with degraded performance or no provision for hardware error correction. The preferred mode is parity striping. For every n packets of image or audio data, a parity stripe packet will be sent, consisting of the bit-wise XOR of the date of the preceding n packets. Note that this requires that all the data packets be equal leng ths. The group of packets counter is reset after a parameter header. After reception, any single lost packet can be reconstructed from the surrounding group of packets and their parity stripe packet. Two lost packets will preclude any reconstruction of either lost packet, but is at least no worse than before. This simple FEC strategy involves an overhead of 1/n extra data to be sent, and is not recommended except in the presence of excessive data loss. There is also a buffering delay of n packets for any receiver that desires to perform parity-stripe reconstruction. Encoders are not required to emit parity-striped data, but the capability is recommended. All decoders must be able to receive parity-striped data, even if they cannot perform the reconstruction function. In this case, the parity stripe packets should be discarded. Lossless compression of sampled values ///TO BE WRITTEN/// Timecode and labels Support for missing, mis-synchronised and degraded timecode LTC and VITC Off-by-one encoding Leap seconds to match UTC: once per year Generic label format Extended timecode including full NTP timestamp??? ///TO BE WRITTEN/// Extensions for precise colourimetry description ///TO BE WRITTEN/// Extensions for precise geometry description ///TO BE WRITTEN/// Extensions for encryption and authentication ///TO BE WRITTEN/// Transport over an arbitrary byte-stream [TCP, HTTP etc] ///TO BE WRITTEN/// Storage in files ///TO BE WRITTEN/// Reference implementation ///TO BE WRITTEN/// Profile examples ///TO BE WRITTEN/// Go-faster stripes MTU, page alignment and efficiency Hardware acceleration and alignment issues ///TO BE WRITTEN/// Specification of 'artistic frame' vs. sampled frame vs. reference frame ///TO BE WRITTEN/// 11. Address of author Neil Harris Sohonet Limited 11 Bear Street London WC2H 7AS England neil@sohonet.co.uk