Internet Engineering Task Force                               AVT WG 
   Internet-Draft                          Ostermann/Rurainsky/Civanlar 
   draft-ietf-avt-rtp-pfap-00.txt                  AT&T Labs - Research 
   Expires: April 2002                                     October 2001 
    
    
     RTP Payload Format for Phoneme/Facial Animation Parameter (PFAP) 
                                  Streams  
    
    
1. Status of this Memo 
    
   This document is an Internet-Draft and is in full conformance with 
   all provisions of Section 10 of RFC2026. 
    
    
   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups.  Note that 
   other groups may also distribute working documents as Internet-
   Drafts. 
    
   Internet-Drafts are draft documents valid for a maximum of six months 
   and may be updated, replaced, or obsolete by other documents at any 
   time. It is inappropriate to use Internet-Drafts as reference 
   material or to cite them other than as "work in progress." 
    
   The list of current Internet-Drafts can be accessed at 
        http://www.ietf.org/ietf/1id-abstracts.txt 
   The list of Internet-Draft Shadow Directories can be accessed at 
        http://www.ietf.org/shadow.html. 
    
    
2. Abstract 
    
   This document describes a Real-Time Transport Protocol (RTP) payload 
   format for transporting phoneme and facial animation parameter (PFAP) 
   streams over the Internet according to the TtsFAPInterface that is 
   defined as an internal interface of an MPEG-4 client in ISO/IEC 
   14496-3 (MPEG-4 Audio, Subpart 6: Text-to-Speech Interface, 
   TtsFAPInterface) [2]. A recovery strategy for loss-tolerant 
   transmission of such streams is described.  
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                 1 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
   Table of Contents 
    
   1. Status of this Memo............................................1 
   2. Abstract.......................................................1 
   3. Introduction...................................................3 
   4. Requirements language..........................................4 
   5. The MPEG-4 class TtsFAPInterface...............................4 
   6. Payload Format.................................................6 
   6.1.  Packet descriptor...........................................7 
   6.2.  Phoneme descriptor..........................................8 
   6.3.  FAP descriptor..............................................9 
   6.4.  Recovery information, type 1...............................10 
   7. RTP header fields usage:......................................10 
   8. Recovery Strategy.............................................11 
   9. Security Considerations.......................................11 
   10. References....................................................13 
   11. Author's Addresses............................................13 
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                 2 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
    
3. Introduction 
    
   Animated talking heads based on MPEG-4 [1] may be implemented on a 
   client that renders the head and synthesizes the speech using a Text-
   to-Speech (TTS) application on the client. The MPEG-4 standard 
   defines only the input interface and two output interfaces for a 
   compliant TTS application. The output interfaces are supposed to be 
   internal to the MPEG-4 client and, thus, no transport protocol is 
   defined related to transmission of the output data. However, advanced 
   TTS servers may need to be implemented on network-based machines and 
   shared by many users. In order to animate talking heads on a client 
   using a network-based TTS server it will be necessary to stream the 
   outputs of the TTS server to the client. 
    
   The input to an MPEG-4 compliant TTS server is the ŸMPEG-4 audio 
   text-to-speech payload÷ [2] defined for transmitting text to a TTS 
   server. The TTS server synthesizes speech as an audio signal from the 
   text. The text may contain bookmarks that enable the control of the 
   talking head with facial animation parameters (FAP) synchronized with 
   the speech. FAPs may define facial expressions like joy and disgust, 
   head orientation and other deformations of flexible parts of the 
   head. Bookmarks do not influence the synthesized speech. The ŸMPEG-4 
   audio text-to-speech payload÷ may also transport optional TTS control 
   information like Gender, Age, and Speech_Rate. The ŸMPEG-4 audio 
   text-to-speech payload÷ may be transported using the MPEG-4 payload 
   format as specified in [3]. 
    
   One of the outputs of the TTS server is the audio stream. This audio 
   stream with the related timing information is handed to the 
   compositor of the MPEG-4 client. The compositor enables synchronized 
   playback of MPEG-4 supported media. In a network based TTS server, 
   the compositor will be located at the client side and the audio 
   stream produced by the TTS server needs to be transmitted to the 
   client. Several RTP payload formats for audio streams already exist 
   and may be used in this context. 
    
   The other output of the TTS server is the TTS markup information. 
   MPEG-4 defines the class TtsFAPInterface that holds the TTS markup 
   information [2]. This class is used to hand the TTS markup 
   information from the TTS server to the face renderer within the 
   compositor of the MPEG-4 client. The TTS markup information enables 
   an MPEG-4 client to create the animation of the talking head such 
   that the head produces visual speech (mainly lip motion) synchronized 
   with the audio. The TTS markup information contains phonemes, 
   bookmarks, and related timing information.  
    
   A phoneme is the basic spoken unit in a language. Pronouncing a 
   phoneme involves coordinating movements of the lungs, vocal cavities, 
   larynx, lips, tongue, and teeth. The TTS server translates the text 
   to be synthesized into phonemes. Furthermore, the TTS server computes 
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                 3 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
   the start time and duration of each phoneme in the synthesized 
   speech. 
    
   A bookmark is the exact copy of the bookmark in the text sent to the 
   TTS server. MPEG-4 specifies that the start time of a FAP in a 
   bookmark is the start time of the first phoneme of the first word 
   following the bookmark of the current sentence. If there is no word 
   after the bookmark in the current sentence, the start time of the FAP 
   is the same as the start time of the last phoneme of the previous 
   word. Hence, the start time of a FAP always coincides with a phoneme. 
   MPEG-4 allows up to 40 consecutive bookmarks that can be used to 
   render complicated expressions.  
    
   In order to enable networked TTS servers to be used with MPEG-4, a 
   novel payload format for TTS markup information needs to be defined. 
    
   In this document we define an RTP payload format for transporting 
   Phoneme/FAP (PFAP) streams over the Internet using RTP. The payload 
   format is based on the TtsFAPInterface defined in Subpart 6 of the 
   ISO/IEC International Standard 14496-3 [2] and outlined in Section 5 
   of this document. The payload format includes packet loss recovery 
   information. 
    
    
4. Requirements language 
    
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this 
   document are to be interpreted as described in RFC-2119 [5]. 
    
    
5. The MPEG-4 class TtsFAPInterface 
    
   In this section, we describe the class TtsFAPInterface, its 
   parameters and its usage since it is the basic structure carried by 
   the new payload format proposed in this document. The class 
   TtsFAPInterface is used to hand the TTS markup information from the 
   TTS server to the face renderer within the compositor of the MPEG-4 
   client. This class holds one phoneme and related information, namely 
   PhonemeSymbol, PhonemeDuration, f0Average, Stress, WordBegin, 
   Bookmark, and Starttime. 
    
   PhonemeSymbol: 
        This field identifies a phoneme using an 8 bit unsigned integer 
        (PhonemeSymbol). A language usually uses around 50 phonemes. 
        Phonemes may be specified by Unicode. Since MPEG-4 uses the 
        class TtsFAPInterface only internally in a client, it does not 
        specify the mapping of a phoneme specified in Unicode to this 8 
        bit PhonemeSymbol. 
    
   PhonemeDuration: 
        This field identifies the duration of the PhonemeSymbol in units 
        of milliseconds using a 12 bit unsigned integer. 
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                 4 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
    
   f0Average: 
        This field defines the frequency of the synthesized audio signal 
        for this phoneme in units of 2 Hz using an 8 bit unsigned 
        integer. 
    
   Stress: 
        Stress indicates a stressed phoneme using 1 bit. 
    
   Bookmark: 
        This field is a string that contains one or more bookmarks that 
        are associated with the current PhonemeSymbol. A definition of 
        the bookmark structure is given in [1], Annex C. A bookmark 
        starts with Ÿ<FAP÷ and ends with Ÿ>÷. Between the start and end 
        strings of a bookmark, there are four fields defined: n (FAP 
        number 2<=n<=68), FAPfield (see below), T (transition time), and 
        C (time curve for computation of the amplitude during the 
        transition time).  
         
        In case of n=2, FAPfield holds the four numbers Ÿe1 a1 e2 a2÷, 
        with the two facial expressions e1 and e2 and their target 
        amplitudes a1 and a2, respectively. There are six different 
        facial expressions (1<= e1,e2<=6) defined in Annex C of [1]. In 
        case of 3<=n<=68, FAPfield holds only the target amplitude Ÿa÷ 
        for FAP n. 
         
        Amplitudes are given in different units. The unit of an 
        amplitude is determined by the FAP n. The maximum value of the 
        amplitude is signed 2529600. It may be reached for head and eye 
        rotations. In these cases, the unit is AU (Angle Units, 0.00001 
        RAD), and the maximum value corresponds to 25.296 RAD.  
         
        There are no limits on the transition time T specified in ms.  
         
        The field C can be 1, 2, or 3, which is an identifier for a time 
        curve equation defined in [1], Annex C. The time curve describes 
        the transition of the FAP amplitude from its current amplitude 
        to the target amplitude a (a1 and a2 in case of n=2) of the FAP 
        at the end of the transition time T. The amplitude of the FAP at 
        the beginning of the transition depends on the previous 
        bookmarks and can be equal to: 
                - 0 if no bookmark with FAP number was used before. 
                - a of the previous bookmark with the same FAP number if 
                a time longer than the previous transition time T has 
                elapsed between these two FAP bookmarks. 
                - The actual reached amplitude due to a previous 
                bookmark with the same FAP number if a time shorter than 
                the previous transition time T has elapsed between the 
                previous bookmark and the current one. 
        At the end of the transition time T, target amplitude a is 
        maintained until another bookmark gives a new target amplitude. 
        To reset a FAP, a bookmark with the same FAP number with a=0 is 
        included in the text. 
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                 5 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
         
        In case of C=1, the face renderer will linearly change the 
        amplitude of FAP n from its current amplitude to the target 
        amplitude within the transition time T. In case of C=2, a 
        triangle function is used which linearly changes the amplitude 
        of FAP n from its current value to the target amplitude a within 
        the transition time T/2. After that the amplitude is linearly 
        changed back to the value prior to encountering the bookmark 
        within the transition time T/2. In case of C=3, a spline 
        function is used to change the amplitude from its current 
        amplitude to the target amplitude a within the transition time 
        T. 
         
        Bookmarks with n=2 allow to change the facial expression of the 
        face (joy, anger, etc.), and n in the range of 3 to 68 allow to 
        animate parts of the head (lips, eyebrow, etc.)  
         
 
   Starttime: 
        Start time for this phonemeSymbol with respect to the start of 
        the MPEG-4 session in ms using a long int. MPEG-4 computes the 
        duration of the phonemes by subtracting the start times of 
        consecutive phonemes. In the PFAP payload format, we transmit 
        time durations with each phoneme. 
    
    
6. Payload Format 
    
   The PFAP payload consists of three types of information: phoneme 
   descriptor, FAP descriptor, and recovery information. Each payload 
   starts with a Ÿpacket descriptor÷ field followed by optional recovery 
   information. Phoneme descriptors and FAP descriptors may follow the 
   packet descriptor or the recovery information if available. FAPs are 
   associated with phonemes to determine their timing in a sentence (see 
   section 3, or [2]). The start time of a FAP is the same as the start 
   time of the first phoneme following the FAP(s).  In case that the 
   input to the TTS server ends with a bookmark, the server could send 
   these bookmarks as FAPs prior to the last phoneme of the previous 
   word. Alternatively, the server could create a short silence phoneme 
   that is sent after the final FAP. Therefore, a packet MUST end with a 
   phoneme if it contains any information other than recovery 
   information.  
    
   The following sections define the specific formats for the packet 
   descriptor and each of the three information types. 
    
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                 6 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |Pkt descriptor |           (optional)Recovery Info             | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |                                                               | 
   |(optional)Recovery Info, (optional)((optional)FAP and Phoneme),| 
   |..., (optional)((optional)FAP and Phoneme)                     | 
   |-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 
    
                        Figure 1 “ PFAP Payload 
    
    
  6.1. Packet descriptor 
 
                            0 1 2 3 4 5 6 7 
                           +-+-+-+-+-+-+-+-+ 
                           |C| T | PP  |IB | 
                           +-+-+-+-+-+-+-+-+ 
    
                       Figure 2 “ Packet descriptor 
    
   Complete (C): 1 bit 
        Distinguish between dynamic, and complete recovery information. 
        Zero stands for dynamic, and one for complete recovery 
        information. In case of complete recovery information, the 
        packet MUST only contain recovery information. Recovery 
        information is defined in section Ÿ6.4 Recovery information, 
        type 1÷. 
    
   Type (T): 2 bits 
        This field identifies the structure of recovery information with 
        the following meaning: 
           00   no recovery information 
           01   recovery information (defined in Ÿ6.4 Recovery 
                information, type 1÷) 
           10   reserved 
           11   reserved 
    
   prevPackets (PP): 3 bits 
        For dynamical recovery (C=0) this field defines the number of 
        previous packets that can be recovered with the following 
        recovery information. For complete recovery information (C=1) 
        this field can be ignored. The interpretation of these three 
        bits is given as follows: 
        (Every packet counts.) 
           000  reserved 
           001  one previous packet is covered 
           010  two previous packets are covered 
           011  four previous packets are covered 
           100  seven previous packets are covered 
           101  15 previous packets are covered 
           110  25 previous packets are covered 
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                 7 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
           111  40 previous packets are covered 
    
   InfoBits (IB): 2 bits 
        Indicate the type of the descriptor following the recovery 
        information: 
           00   a Phoneme descriptor follows 
           01   a FAP descriptor follows 
           10   end of packet 
           11   reserved  
    
    
  6.2. Phoneme descriptor 
    
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | PhonemeSymbol |    PhonemeDuration    |   f0Average   |S|W|IB | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    
                       Figure 3 “ Phoneme descriptor 
    
   PhonemeSymbol: 8 bits 
        This field identifies each phoneme from a phoneme alphabet. The 
        mapping of Phonemes to this 8 bit number is signaled out of 
        band. 
    
   PhonemeDuration: 12 bits 
        This field identifies the duration of the PhonemeSymbol in units 
        of milliseconds. 
    
   f0Average: 8 bits 
        This field defines the frequency of the synthesized audio signal 
        for this phoneme in units of 2 Hz. 
    
   Stress (S): 1 bit 
        S=1 indicates a stressed phoneme.  
    
   WordBegin (W): 1 bit 
        W=1 indicates the beginning of a word. 
    
   InfoBits (IB): 2 bits 
      These bits identify the following descriptor (phoneme, FAP) in the 
      stream or indicate the end of text, which can be after a sentence 
      or a paragraph. The meanings of the binary combinations are: 
           00   a Phoneme descriptor follows 
           01   a FAP descriptor follows 
           10   end of packet 
           11   end of text, which implies end of packet. End of text 
                can be at the end of a sentence or paragraph. The 
                renderer/client should expect a pause of undefined 
                length prior to the next utterance. 
    
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                 8 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
    
  6.3. FAP descriptor 
    
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |   FAPind    |s|           Amplitude (22 bits)             |     
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     Transition (14 bits)  | C |IB | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    
                         Figure 4 “ FAP descriptor 
    
    
   FAPind: 7 bits 
        This field identifies FAPs in the range of 3 to 74. Facial 
        expressions are indicated using FAP numbers larger than 68. For 
        FAP numbers larger than 68, (FAP number “ 68) gives the facial 
        expression number e1 or e2. Amplitude, transition and curve are 
        not mapped, they stay the same. 
        Example: 
                bookmark sequence for expression: <FAP 2 1 40 2 30 2000 
                3> 
                transformed bookmark sequence: <FAP 69 40 2000 3><FAP 70 
                30 2000 3> 
    
   Sign (s): 1 bit 
        Sign of the FAP target amplitude. 0 stands for plus, and 1 for 
        minus. 
    
   Amplitude: 22 bits 
        This field holds the target amplitude for this FAP. The maximum 
        possible target amplitude is 2529600.  
    
   Transition: 14 bits 
        Holds the desired transition time during which the target 
        amplitude of the FAP has to be reached. The maximum transition 
        time is not specified in MPEG-4. In this payload format, it is 
        limited to 16383 ms. 
    
   Curve (C): 2 bits 
        Describes the time curve (1, 2, or 3) used for computation of 
        the FAP amplitude. 
    
   InfoBits (IB): 2 bits 
        These bits identify the following descriptor (phoneme, FAP) in 
        the stream. The meanings of the binary combinations are: 
           00   a Phoneme descriptor follows 
           01   a FAP descriptor follows 
           10   reserved 
           11   reserved 
    
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                 9 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
    
  6.4. Recovery information, type 1 
    
   Only FAPs can be recovered with the recovery information. In case of 
   complete recovery information, only FAPs with nonzero amplitudes are 
   specified. In case of dynamic recovery, only FAPs from bookmarks that 
   were specified during the prevPackets packets and still have an 
   effect on the FAPs are specified. This might include FAPs with a 
   target amplitude of 0. As an example, if a FAP is changed during a 
   previous packet using a triangle function (C=2) and the transition 
   time is already in the past, the FAP is not included in the recovery 
   bit structure. 
    
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |   FAPind    |s|           Amplitude (22 bits)             |     
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     Transition (14 bits)  | C |IB | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    
                  Figure 5 “ Recovery information, type 1 
    
    
   FAPind: 7 bits 
        see FAP descriptor in Ÿ6.3 FAP descriptor÷ 
    
   Sign (s): 1 bit 
        see FAP descriptor in Ÿ6.3 FAP descriptor÷ 
    
   Amplitude: 22 bits 
        see FAP descriptor in Ÿ6.3 FAP descriptor÷ 
    
   Transition: 14 bits 
        Holds the transition time adjusted for the moment of sending of 
        each transmitted FAP. This new transition time should be set to 
        the greater of 0 or the end time of transition minus the 
        timestamp of the packet. 
    
   Curve (C): 2 bits 
        see FAP descriptor in Ÿ6.3 FAP descriptor÷ 
    
   InfoBits (IB): 2 bits 
        These Bits are describing the following data. 
        The meanings of the binary combinations are: 
           00   recovery information, type 1 follows 
           01   reserved 
           10   reserved 
           11   indicates the end of recovery information 
    
    
7. RTP header fields usage: 
    
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                10 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
  Payload Type: The assignment of an RTP payload type for this payload 
  format is outside the scope of this document, and will not be 
  specified here. It is expected that the RTP profile for a particular 
  class of applications will assign a payload type for this format, or 
  if that is not done then a payload type in the dynamic range shall be 
  chosen. 
    
   M bit: Marker Bit equals one indicates the start of a sentence with 
   the first phoneme in the current packet. This non-speech related 
   information is to be used with the renderer. 
    
   Timestamp: Represents the presentation time of the first phoneme in 
   this packet based on a 44.1 kHz clock unless specified otherwise out-
   of-band. For packets without phonemes (complete recovery) the 
   timestamp specifies the time when the state of the bookmarks was 
   sampled. 
    
    
8. Recovery Strategy 
    
   Recovery information is sent using the 6.4 Recovery information, type 
   1. Complete recovery information MAY be sent between two regular data 
   packets. Dynamical recovery information MAY be sent with each regular 
   data packet. Dynamical recovery information contains FAPs that were 
   transmitted during the recovery period prevPackets. Complete recovery 
   only contains non-zero FAPs. Complete recovery packets are only sent 
   for new clients/users or burst losses exceeding the limits of 
   dynamical recovery. 
    
9. Security Considerations 
    
   RTP packets using the payload format defined in this specification 
   are subject to the security considerations discussed in the RTP 
   specification [5], and any appropriate profile. This implies that 
   confidentiality of the media streams is achieved by encryption.  
   Because the data encoding used with this payload format is applied 
   end-to-end, encryption may be performed after encoding so there is no 
   conflict between the two operations. 
    
   A potential denial-of-service threat exists for data encodings using 
   receiver side decoding.  The attacker can inject pathological 
   datagrams into the stream, which are complex to decode and cause the 
   receiver to be overloaded.  The decoder software should consider this 
   possibility and take the necessary precautions. 
    
   As with any IP-based protocol, in some circumstances, a receiver may 
   be overloaded simply by the receipt of too many packets, either 
   desired or undesired.  Network-layer authentication may be used to 
   discard packets from undesired sources, but the processing cost of 
   the authentication itself may be too high.  In a multicast 
   environment, pruning of specific sources may be implemented in future 
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                11 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
   versions of IGMP [6] and in multicast routing protocols to allow a 
   receiver to select which sources are allowed to reach it. 
    
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                12 
                        RTP Payload Format for            October 2001 
              Phoneme/Facial Animation Parameter (PFAP) 
    
10.     References 
    
   [1] ISO/IEC International Standard 14496-2; "Generic coding of audio-
   visual objects - Part 2: Visual", 1998 
    
   [2] ISO/IEC International Standard 14496-3; "Generic coding of audio-
   visual objects - Subpart 6: Text-to-Speech Interface", 1998 
    
   [3] Avaro, et. al., ŸRTP Payload Format for MPEG-4 Streams÷, IETF 
   work in progress, draft-ietf-avt-mpeg4-multisl-00.txt, June 2001. 
    
   [4] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, 
   "RTP: A Transport Protocol for Real-Time Applications", RFC 1889, 
   January 1996. 
    
   [5] RFC 2119 Bradner, S., "Key words for use in RFCs to Indicate 
   Requirement Levels", BCP 14, RFC 2119, March 1997 
   
   [6] Deering, S., "Host Extensions for IP Multicasting", STD 5, RFC 
   1112, August 1989. 
    
 
 
11.     Author's Addresses 
    
   Joern Ostermann 
   AT&T Labs - Research, Rm A5-4E02 
   200 Laurel Ave South                 Phone:  1-732-420-9116 
   Middletown, NJ 07748 USA             Email:  
   osterman@research.att.com 
    
   Juergen Th. Rurainsky 
   AT&T Labs - Research, Rm A5-4F27 
   200 Laurel Ave South                 Phone:  1-732-420-9138 
   Middletown, NJ 07748 USA             Email:  jru@research.att.com 
    
   M. Reha Civanlar 
   AT&T Labs - Research, Rm A5-4D04 
   200 Laurel Ave South                 Phone:  1-732-420-9170 
   Middletown, NJ 07748 USA             Email: civanlar@research.att.com  
     
   Ostermann/Rurainsky/Civanlar - Expires April 2002                13