idnits 2.17.1 draft-speakman-pgm-spec-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 102 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 192 instances of too long lines in the document, the longest one being 2 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 2595 has weird spacing: '... If so then ...' == Line 2882 has weird spacing: '...receive a par...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (24 June 1999) is 9071 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '2' -- Possible downref: Non-RFC (?) normative reference: ref. '3' == Outdated reference: A later version (-03) exists of draft-miller-mftp-spec-02 -- Possible downref: Normative reference to a draft: ref. '4' -- Unexpected draft version: The latest known version of draft-katz-router-alert is -03, but you're referring to -04. -- Possible downref: Non-RFC (?) normative reference: ref. '7' -- Possible downref: Non-RFC (?) normative reference: ref. '8' ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. '9') ** Obsolete normative reference: RFC 1700 (ref. '10') (Obsoleted by RFC 3232) -- Possible downref: Non-RFC (?) normative reference: ref. '11' -- Possible downref: Non-RFC (?) normative reference: ref. '12' -- Possible downref: Non-RFC (?) normative reference: ref. '13' Summary: 9 errors (**), 0 flaws (~~), 5 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT PGM Reliable Transport Protocol Tony Speakman 2 Expires 24 December 1999 Nidhi Bhaskar 3 Richard Edmonstone 4 Dino Farinacci 5 Steven Lin 6 Alex Tweedly 7 Lorenzo Vicisano 8 cisco Systems 10 Jim Gemmell 11 Microsoft 13 24 June 1999 15 PGM Reliable Transport Protocol Specification 16 18 Status of this Memo 20 This document is an Internet-Draft and is in full conformance with all 21 provisions of Section 10 of RFC2026. 23 Internet-Drafts are working documents of the Internet Engineering Task 24 Force (IETF), its areas, and its working groups. Note that other groups 25 may also distribute working documents as Internet-Drafts. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet- Drafts as reference material 30 or to cite them other than as "work in progress." 32 The list of current Internet-Drafts can be accessed at 33 http://www.ietf.org/ietf/1id-abstracts.txt 35 The list of Internet-Draft Shadow Directories can be accessed at 36 http://www.ietf.org/shadow.html. 38 Abstract 40 Pragmatic General Multicast (PGM) is a reliable multicast transport pro- 41 tocol for applications that require ordered or unordered, duplicate- 42 free, multicast data delivery from multiple sources to multiple 43 receivers. PGM guarantees that a receiver in the group either receives 44 all data packets from transmissions and repairs, or is able to detect 45 unrecoverable data packet loss. PGM is specifically intended as a work- 46 able solution for multicast applications with basic reliability 47 requirements. Its central design goal is simplicity of operation with 48 due regard for scalability and network efficiency. 50 Revision History 52 draft-speakman-pgm-spec-00.txt January 1998 54 Original draft. 56 draft-speakman-pgm-spec-01.txt January 1998 58 Deleted reference to proprietary trademark. 60 draft-speakman-pgm-spec-02.txt August 1998 62 This revision benefited from general discussions in the forum of 63 the Reliable Multicast IRTF as well as from individual discussion 64 with Dan Leshchiner concerning source addressing and NAK elimina- 65 tion, with Chetan Rai concerning TPDU ordering and local 66 retransmission, and with Jim Gemmell, Luigi Rizzo, and Lorenzo 67 Vicisano concerning FEC. 69 Clarified that RDATA from DLRs and NCFs from network elements must 70 bear the ODATA source's source NLA. 72 Added NAK elimination timer and corresponding procedures to net- 73 work elements. 75 Added procedures and packet formats to incorporate FEC. 77 Changed all the packet type encodings to anticipate versioning and 78 extension. 80 Added work-in-progress items for RDATA delay at the source and 81 minimum NAK back-off at receivers. 83 Added work-in-progress items for SPMRs. 85 draft-speakman-pgm-spec-03.txt June 1999 87 The polling and implosion control procedures in this document were 88 developed jointly with Jim Gemmell who contributed invaluable 89 review, revision, and critique to this revision. This revision 90 was edited by Nidhi Bhaskar, Richard Edmonstone, Jim Gemmell, and 91 Lorenzo Vicisano all of whom contributed to the simplification and 92 clarification of the text as well as new ideas for PGM operation, 93 polling, and implosion control. The work on SPMRs arose from dis- 94 cussions with Dan Leshchiner. 96 Removed range NAKs for re-working. 98 Generalized and simplified methods for advancing transmit window. 100 Removed increment sequence number from SPM packets. 102 Removed Appendix B's information for congestion avoidance. 104 Removed "local retransmission" in favour of full DLR functional- 105 ity. 107 Added generic polling capability within a single PGM hop. 109 Added procedures to adjust NAK_BO_IVL dynamically and to address 110 potential NAK implosion problems 112 Added SPMR procedures and packet formats. 114 Table of Contents 116 Abbreviations ..................................................... 6 117 1. Introduction and Overview ..................................... 7 118 2. Architectural Description ..................................... 13 119 3. Terms and Concepts ............................................ 15 120 4. Procedures - General .......................................... 25 121 5. Procedures - Sources .......................................... 25 122 6. Procedures - Receivers ........................................ 29 123 7. Procedures - Network Elements ................................. 33 124 8. Packet Formats ................................................ 38 125 9. Options ....................................................... 48 126 10. Security Considerations ....................................... 58 127 Appendix A - Forward Error Correction ............................. 60 128 Appendix B - Congestion Avoidance ................................. 71 129 Appendix C - Flow Control ......................................... 72 130 Appendix D - SPM Requests ......................................... 80 131 Appendix E - Poll Mechanism ....................................... 84 132 Appendix F - Implosion Prevention ................................. 93 133 Work in Progress .................................................. 99 134 Acknowledgements .................................................. 100 135 References ........................................................ 101 136 Abbreviations 138 ACK Acknowledgement 139 AFI Address Family Indicator 140 ALF Application Level Framing 141 APDU Application Protocol Data Unit 142 ARQ Automatic Repeat reQuest 143 DLR Designated Local Repairer 144 GSI Globally Unique Source Identifier 145 FEC Forward Error Correction 146 MD5 Message-Digest Algorithm 147 MTU Maximum Transmission Unit 148 NAK Negative Acknowledgement 149 NCF NAK Confirmation 150 NLA Network Layer Address 151 NNAK Null Negative Acknowledgment 152 ODATA Original Data 153 RDATA Repair Data 154 RSN Receive State Notification 155 SPM Source Path Message 156 SPMR SPM Request 157 TG Transmission Group 158 TGSIZE Transmission Group Size 159 TPDU Transport Protocol Data Unit 160 TSI Transport Session Identifier 161 TSN Transmit State Notification 162 1. Introduction and Overview 164 A variety of reliable protocols have been proposed for multicast data 165 delivery, each with an emphasis on particular types of applications, 166 network characteristics, or definitions of reliability ([1], [2], [3], 167 [4]). In this tradition, Pragmatic General Multicast (PGM) is a reli- 168 able transport protocol for applications that require ordered or unor- 169 dered, duplicate-free, multicast data delivery from multiple sources to 170 multiple receivers. 172 PGM is specifically intended as a workable solution for multicast appli- 173 cations with basic reliability requirements rather than as a comprehen- 174 sive solution for multicast applications with sophisticated ordering, 175 agreement, and robustness requirements. Its central design goal is sim- 176 plicity of operation with due regard for scalability and network effi- 177 ciency. 179 PGM has no notion of group membership. It simply provides reliable mul- 180 ticast data delivery within a transmit window advanced by a source 181 according to a purely local strategy. Reliable delivery is provided 182 within a source's transmit window from the time a receiver joins the 183 group until it departs. PGM guarantees that a receiver in the group 184 either receives all data packets from transmissions and repairs, or is 185 able to detect unrecoverable data packet loss. PGM supports any number 186 of sources within a multicast group, each fully identified by a globally 187 unique Transport Session Identifier (TSI), but since these 188 sources/sessions operate entirely independently of each other, this 189 specification is phrased in terms of a single source and extends without 190 modification to multiple sources. 192 More specifically, PGM is not intended for use with applications that 193 depend either upon acknowledged delivery to a known group of recipients, 194 or upon total ordering amongst multiple sources. 196 Rather, PGM is best suited to those applications in which members may 197 join and leave at any time, and that are either insensitive to unrecov- 198 erable data packet loss or are prepared to resort to application 199 recovery in the event. Through its optional extensions, PGM provides 200 specific mechanisms to support applications as disparate as stock and 201 news updates, data conferencing, and low-delay, real-time video 202 transfer. 204 In the following text, transport-layer originators of PGM data packets 205 are referred to as sources, transport-layer consumers of PGM data pack- 206 ets are referred to as receivers, and network-layer entities in the 207 intervening network are referred to as network elements. Unless other- 208 wise specified, the term "repair" will be used to indicate both the 209 actual retransmission of a copy of a missing packet or the transmission 210 of an FEC repair packet. 212 1.1. Summary of Operation 214 PGM runs over a datagram multicast protocol such as IP multicast [5]. 215 In the normal course of data transfer, a source multicasts sequenced 216 data packets (ODATA), and receivers unicast selective negative ack- 217 nowledgements (NAKs) for data packets detected to be missing from the 218 expected sequence. Network elements forward NAKs PGM-hop-by-PGM-hop to 219 the source, and confirm each hop by multicasting a NAK confirmation 220 (NCF) in response on the interface on which the NAK was received. 221 Repairs (RDATA) may be provided either by the source itself or by a 222 Designated Local Repairer (DLR) in response to a NAK. 224 Since NAKs provide the sole mechanism for reliability, PGM is particu- 225 larly sensitive to their loss. To minimize NAK loss, PGM defines a 226 network-layer hop-by-hop procedure for reliable NAK forwarding. 228 Upon detection of a missing data packet, a receiver repeatedly unicasts 229 a NAK to the last-hop PGM network element on the distribution tree from 230 the source. A receiver repeats this NAK until it receives a NAK confir- 231 mation (NCF) multicast to the group from that PGM network element. That 232 network element responds with an NCF to the first occurrence of the NAK 233 and any further retransmissions of that same NAK from any receiver. In 234 turn, the network element repeatedly forwards the NAK to the upstream 235 PGM network element on the reverse of the distribution path from the 236 source of the original data packet until it also receives an NCF from 237 that network element. Finally, the source itself receives and confirms 238 the NAK by multicasting an NCF to the group. 240 While NCFs are multicast to the group, they are not propagated by PGM 241 network elements since they act as hop-by-hop confirmations. 243 To avoid NAK implosion, PGM specifies procedures for subnet-based NAK 244 suppression amongst receivers and NAK elimination within network ele- 245 ments. The usual result of this procedure is the propagation of just 246 one copy of a given selective NAK along the reverse of the distribution 247 path from any network with directly connected receivers to a source. 249 The net effect is that unicast NAKs return from a receiver to a source 250 on the reverse of the path on which ODATA was forwarded, that is, on the 251 reverse of the distribution tree from the source. More specifically, 252 they return through exactly the same sequence of PGM network elements 253 through which ODATA was forwarded, but in reverse. The reasons for han- 254 dling NAKs this way will become clear in the discussion of constraining 255 repairs, but first it's necessary to describe the mechanisms for estab- 256 lishing the requisite source path state in PGM network elements. 258 To establish source path state in PGM network elements, the basic data 259 transfer operation is augmented by Source Path Messages (SPMs) from a 260 source, periodically interleaved with ODATA. SPMs function primarily to 261 establish source path state for a given TSI in all PGM network elements 262 on the distribution tree from the source. PGM network elements use this 263 information to address returning unicast NAKs directly to the upstream 264 PGM network element toward the source, and thereby insure that NAKs 265 return from a receiver to a source on the reverse of the distribution 266 path for the TSI. 268 SPMs also act to alert receivers that the oldest data in the transmit 269 window is about to be retired from the transmit window and will, 270 thereafter, not be available for repair from the source. SPMs are sent 271 by a source at least at the rate at which the transmit window is 272 advanced, and they serve to provoke further NAKs from receivers as well 273 as to maintain receive window state in the receivers. 275 As a further efficiency, PGM specifies procedures for the constraint of 276 repairs by network elements so that they reach only those group members 277 that missed the original transmission. As NAKs traverse the reverse of 278 the ODATA path (upward), they establish repair state in the network ele- 279 ments which is used in turn to constrain the (downward) forwarding of 280 the corresponding RDATA. 282 Besides procedures for the source to provide repairs, PGM also specifies 283 options and procedures that permit designated local repairers (DLRs) to 284 announce their availability and to redirect repair requests (NAKs) to 285 themselves rather than to the original source. In addition to these 286 conventional procedures for loss recovery through selective ARQ, Appen- 287 dix A specifies Forward Error Correction (FEC) procedures for sources to 288 provide and receivers to request general error correcting parity packets 289 rather than selective retransmissions. 291 Finally, since PGM operates without regular return traffic from 292 receivers, conventional feedback mechanisms for transport flow and 293 congestion control cannot be applied. Appendix B will specify some 294 preliminary strategies for congestion avoidance to be modified and pro- 295 ven or discarded as experience dictates. Appendix C specifies a basic 296 and optional flow control supplement native to PGM itself that intro- 297 duces a degree of receiver feedback, but it is entirely elective and not 298 meant as a replacement for reservation protocols or other out-of-band 299 resource and conference management strategies. In its basic operation, 300 therefore, PGM relies on a purely rate-limited transmission strategy in 301 the source to bound the bandwidth consumed by PGM transport sessions and 302 to define the transmit window maintained by the source. 304 PGM defines four basic packet types: three that flow downstream (SPMs, 305 DATA, NCFs), and one that flows upstream (NAKs). 307 1.2. Design Goals and Constraints 309 PGM has been designed to serve that broad range of multicast applica- 310 tions that have relatively simple reliability requirements, and to do so 311 in a way that realizes the much advertised but often unrealized network 312 efficiences of multicast data transfer. The usual impediments to real- 313 izing these efficiences are the implosion of negative and positive ack- 314 nowledgements from receivers to senders, repair latency from the source, 315 and the propagation of repairs to disinterested receivers. 317 1.2.1. Reliability. 319 Reliable data delivery across an unreliable network is conventionally 320 achieved through an end-to-end protocol in which a source (implicitly or 321 explicitly) solicits receipt confirmation from a receiver, and the 322 receiver responds positively or negatively. While the frequency of 323 negative acknowledgements is a function of the reliability of the net- 324 work and the receiver's resources (and so, potentially quite low), the 325 frequency of positive acknowledgements is fixed at at least the rate at 326 which the transmit window is advanced, and usually more often. 328 Negative acknowledgements primarily determine repairs and reliability. 329 Positive acknowledgements primarily determine transmit buffer manage- 330 ment. 332 When these principles are extended without modification to multicast 333 protocols, the result, at least for positive acknowledgements, is a bur- 334 den of positive acknowledgments transmitted to the source that quickly 335 threatens to overwhelm it as the number of receivers grows. More suc- 336 cinctly, ACK implosion keeps ACK-based reliable multicast protocols from 337 scaling well. 339 One of the goals of PGM is to get as strong a definition of reliability 340 as possible from as simple a protocol as possible. ACK implosion can be 341 addressed in a variety of effective but complicated ways, most of which 342 require re-transmit capability from other than the original source. 344 An alternative is to dispense with positive acknowledgements altogether, 345 and to resort to other strategies for buffer management while retaining 346 negative acknowledgements for repairs and reliability. The approach 347 taken in PGM is to retain negative acknowledgements, but to dispense 348 with positive acknowledgements and resort instead to timeouts at the 349 source to manage transmit resources. 351 The definition of reliability with PGM is a direct consequence of this 352 design decision. PGM guarantees that a receiver either receives all 353 data packets from transmissions and repairs, or is able to detect unre- 354 coverable data packet loss. 356 PGM includes strategies for repeatedly soliciting NAKs from receivers, 357 and for adding reliability to the NAKs themselves. By reinforcing the 358 NAK mechanism, PGM minimizes the probability that a receiver will detect 359 a missing data packet so late that the packet is unavailable for repair 360 either from the source or from a designated local repairer (DLR). 361 Without ACKs and knowledge of group membership, however, PGM cannot 362 eliminate this possibility. 364 1.2.2. Group Membership 366 A second consequence of eliminating ACKs is that knowledge of group 367 membership is neither required nor provided by the protocol. Although a 368 source may receive some PGM packets (NAKs for instance) from some 369 receivers, the identity of the receivers does not figure in the process- 370 ing of those packets. Group membership may change during the course of 371 a PGM transport session without the knowledge of or consequence to the 372 source or the remaining receivers. 374 1.2.3. Efficiency 376 While PGM avoids the implosion of positive acknowledgements simply by 377 dispensing with ACKs, the implosion of negative acknowledgements is 378 addressed directly. 380 Receivers observe a random back-off prior to generating a NAK during 381 which interval the NAK is suppressed by the receiver upon receipt of a 382 matching NCF. In addition, PGM network elements eliminate duplicate 383 NAKs received on different interfaces on the same network element. The 384 combination of these two strategies usually results in the source 385 receiving just a single NAK for any given lost data packet. 387 Whether a repair is provided from a DLR or the original source, it is 388 important to constrain that repair to only those network segments con- 389 taining members that negatively acknowledged the original transmission 390 rather than propagating it throughout the group. PGM specifies pro- 391 cedures for network elements to use the pattern of NAKs to define a 392 sub-tree within the group upon which to forward the corresponding repair 393 so that it reaches only those receivers that missed it in the first 394 place. 396 1.2.4. Simplicity 398 PGM is designed to achieve the greatest improvement in reliability (as 399 compared to the usual UDP) with the least complexity. As a result, PGM 400 does NOT address conference control, global ordering amongst multiple 401 sources in the group, nor recovery from network partitions. 403 1.2.5. Operability 405 PGM is designed to function, albeit with less efficiency, even when some 406 or all of the network elements in the multicast tree have no knowledge 407 of PGM. To that end, all PGM data packets can be conventionally multi- 408 cast routed by non-PGM network elements with no loss of functionality, 409 but with some inefficiency in the propagation of RDATA and NCFs. 411 In addition, since NAKs are unicast to the last-hop PGM network element 412 and NCFs are multicast to the group, NAK/NCF operation is also con- 413 sistent across non-PGM network elements. Note that for NAK suppression 414 to be most effective, receivers should always have a PGM network element 415 as a first hop network element between themselves and every path to 416 every PGM source. If receivers are several hops removed from the first 417 PGM network element, the efficacy of NAK suppression may degrade. 419 1.3. Options 421 In addition to the basic data transfer operation described above, PGM 422 specifies several end-to-end options to address specific application 423 requirements. PGM specifies options to support fragmentation, late 424 joining, time-stamping, reception quality reports, sequence number dro- 425 pout, redirection, and Forward Error Correction (FEC). Options may be 426 appended to PGM packet headers only by their original transmitters. 427 While they may be interpreted by network elements, options are neither 428 added nor removed by network elements. 430 All options are receiver-significant (i.e., they must be interpreted by 431 receivers). Some options are also network-significant (i.e., they must 432 be interpreted by network elements). 434 Fragmentation may be used in conjunction with data packets to allow a 435 transport-layer entity at the source to break up application-layer data 436 packets into multiple PGM data packets to conform with the maximum 437 transmission unit (MTU) supported by the network layer. Fragmentation 438 is incompatible with the sequence number dropout option. 440 Late joining allows a source to indicate whether or not receivers may 441 request all available repairs when they initially join a particular 442 transport session. 444 Time stamps may be used in conjunction with NAKs to allow receivers to 445 specify the interval in which the requested RDATA is relevant to them. 446 That interval is interpreted by both network elements and sources to 447 determine whether to continue with or abandon a given repair. 449 Reception quality reports may be used in conjunction with NAKs to allow 450 receivers to provide a reception quality metric for local interpretation 451 at the source for the purpose of congestion control. 453 Sequence number dropout may be used in conjunction with data packets to 454 allow sources and network elements to selectively eliminate PGM data 455 packets and convey the resulting sequence-number discontinuity to 456 receivers so that reliability can be preserved across the dropout. 457 Sequence number dropout is incompatible with the fragmentation option. 459 Redirection may be used in conjunction with NCFs to allow a DLR to 460 respond to normal NCFs with a redirecting NCF advertising its own 461 address as an alternative to the original source. Recipients of 462 redirecting NCFs may then direct subsequent NAKs to the DLR rather than 463 to the original source. In addition, DLRs that receive redirected NAKs 464 for which they have RDATA must send a NULL NAK to provide flow control 465 to the original source without also provoking a repair from that source. 467 FEC techniques may be applied by receivers to use source-provided parity 468 packets rather than selective retransmissions to effect loss recovery. 470 2. Architectural Description 472 As an end-to-end transport protocol, PGM specifies packet formats and 473 procedures for sources to transmit and for receivers to receive data. 474 To enhance the efficiency of this data transfer, PGM also specifies 475 packet formats and procedures for network elements to improve the relia- 476 bility of NAKs and to constrain the propagation of repairs. The divi- 477 sion of these functions is described in this section and expanded in 478 detail in the next section. 480 2.1. Source Functions 482 Data Transmission 484 Sources multicast ODATA packets to the group within the transmit 485 window at a given transmit rate. 487 Source Path State 489 Sources multicast SPMs to the group, interleaved with ODATA if 490 present, to establish source path state in PGM network elements. 492 NAK Reliability 494 Sources multicast NCFs to the group in response to any NAKs they 495 receive. 497 Repairs 498 Sources multicast RDATA packets to the group in response to NAKs 499 received for data packets within the transmit window. 501 Transmit Window Advance 503 Sources may advance the trailing edge of the window according to 504 one of a number of strategies. Implementations may support 505 automatic adjustments such as keeping the window at a fixed size 506 in bytes, a fixed number of packets or a fixed real time duration. 507 In addition, they may optionally delay window advancement based on 508 NAK-silence for a certain period. Some possible strategies are 509 outlined later in this document. 511 2.2. Receiver Functions 513 Source Path State 515 Receivers use SPMs to determine the last-hop PGM network element 516 for a given TSI to which to direct their NAKs. 518 Data Reception 520 Receivers receive ODATA within the transmit window and eliminate 521 any duplicates. 523 Repair Requests 525 Receivers unicast NAKs to the last-hop PGM network element and may 526 optionally multicast a NAK with TTL=1 to the local group for data 527 packets within the receive window detected to be missing from the 528 expected sequence. A receiver must repeatedly transmit a given 529 NAK until it receives a matching NCF. 531 NAK Suppression 533 Receivers suppress NAKs for which a matching NCF or NAK is 534 received during the NAK transmit back-off interval. 536 Receive Window Advance 538 Receivers immediately advance their receive windows upon receipt 539 of any PGM data packet or SPM within the receive window that 540 advances the receive window. 542 2.3. Network Element Functions 544 Network elements forward ODATA without intervention. 546 Source Path State 548 Network elements intercept SPMs and use them to establish source 549 path state for the corresponding source and group before multicast 550 forwarding them in the usual way. 552 NAK Reliability 554 Network elements multicast NCFs to the group in response to any 555 NAK they receive. For each NAK received, network elements create 556 repair state recording the transport session identifier, the 557 sequence number of the NAK, and the input interface on which the 558 NAK was received. 560 Constrained NAK Forwarding 562 Network elements repeatedly unicast forward only the first copy of 563 any NAK they receive to the upstream PGM network element on the 564 distribution path for the TSI and in addition they may optionally 565 multicast this NAK upstream with TTL=1. They do this until they 566 receive an NCF in response. 568 NAK Elimination 570 Network elements discard exact duplicates of any NAK for which 571 they already have repair state (i.e., that has been forwarded 572 either by themselves or a neighbouring PGM network element), and 573 respond with a matching NCF. 575 Constrained RDATA Forwarding 577 Network elements use NAKs to maintain repair state consisting of a 578 list of interfaces upon which a given NAK was received, and they 579 return the corresponding RDATA only on these interfaces. 581 NAK Anticipation 583 If a network element hears an upstream NCF (i.e., on the upstream 584 interface for the distribution tree for the TSI), it establishes 585 repair state without outgoing interfaces in anticipation of 586 responding to and eliminating duplicates of the NAK that may 587 arrive from downstream. 589 3. Terms and Concepts 591 Before proceeding from the preceding overview to the detail in the sub- 592 sequent Procedures, this section presents some concepts and definitions 593 that make that detail more intelligible. 595 3.1. Transport Session Identifiers 597 Every PGM packet is identified by a: 599 TSI transport session identifier 601 TSIs must be globally unique, and only one source at a time may act as 602 the source for a transport session. (Note that repairers do not change 603 the TSI in any RDATA they transmit). TSIs are composed of the concate- 604 nation of a globally unique source identifier (GSI) and a source- 605 assigned data-source port. 607 Since all PGM packets originated by receivers are in response to PGM 608 packets originated by a source, receivers simply echo the TSI heard from 609 the source in any corresponding packets they originate. 611 Since all PGM packets originated by network elements are in response to 612 PGM packets originated by a receiver, network elements simply echo the 613 TSI heard from the receiver in any corresponding packets they originate. 615 3.2. Sequence Numbers 617 PGM uses a circular sequence number space from 0 through ((2**32) - 1) 618 to identify and order ODATA packets. Sources must number ODATA packets 619 in unit increments in the order in which the corresponding application 620 data is submitted for transmission. Within a transmit or receive window 621 (defined below), a sequence number x is "less" or "older" than sequence 622 number y if it numbers an ODATA packet preceding ODATA packet y, and a 623 sequence number y is "greater" or "more recent" than sequence number x 624 if it numbers an ODATA packet subsequent to ODATA packet x. 626 3.3. Transmit Window 628 The description of the operation of PGM rests fundamentally on the 629 definition of the source-maintained transmit window. This definition in 630 turn is derived directly from the amount of transmitted data (in 631 seconds) a source retains for repair (TXW_SECS), and the maximum 632 transmit rate (in bytes/second) maintained by a source to regulate its 633 bandwidth utilization (TXW_MAX_RTE). 635 The size of the transmit window in seconds is simply TXW_SECS. The size 636 of the transmit window in bytes (TXW_BYTES) is (TXW_MAX_RTE * TXW_SECS). 637 The size of the transmit window in sequence numbers (TXW_SQNS) is 638 (TXW_BYTES / bytes-per-packet). 640 In terms of sequence numbers, the transmit window is the range of 641 sequence numbers consumed by the source for sequentially numbering and 642 transmitting the most recent TXW_SECS of ODATA packets. The trailing 643 (or left) edge of the transmit window (TXW_TRAIL) is defined as the 644 sequence number of the oldest data packet available for repair from a 645 source. The leading (or right) edge of the transmit window (TXW_LEAD) 646 is defined as the sequence number of the most recent data packet a 647 source has transmitted. 649 The size of the transmit window in sequence numbers (TXW_SQNS) (i.e., 650 the difference between the leading and trailing edges) must be no 651 greater than half the PGM sequence number space less one. 653 The fraction of the transmit window size (in seconds of data) by which 654 the transmit window is advanced (TXW_ADV_SECS) is called the window 655 increment. The trailing (oldest) such fraction of the transmit window 656 itself is called the increment window. 658 In terms of sequence numbers, the increment window is the range of 659 sequence numbers that will be the first to be expired from the transmit 660 window. The trailing (or left) edge of the increment window is just 661 TXW_TRAIL, the trailing (or left) edge of the transmit window. The 662 leading (or right) edge of the increment window (TXW_INC) is defined as 663 one less than the sequence number of the first data packet transmitted 664 by the source TXW_ADV_SECS after transmitting TXW_TRAIL. 666 A data packet is described as being "in" the transmit or increment win- 667 dow, respectively, if its sequence number is in the range defined by the 668 transmit or increment window, respectively. 670 The transmit window is advanced across the increment window by the 671 source when it increments TXW_TRAIL to TXW_INC. When the transmit win- 672 dow is advanced across the increment window, the increment window is 673 emptied (i.e., TXW_TRAIL is momentarily equal to TXW_INC), begins to 674 refill immediately as transmission proceeds, is full again TXW_ADV_SECS 675 later (i.e., TXW_TRAIL is separated from TXW_INC by TXW_ADV_SECS of 676 data), at which point the transmit window is advanced again, and so on. 678 Consider the following example: 680 Assuming a constant transmit rate of 128kbps and a constant data 681 packet size of 1500 bytes, if a source maintains the past 30 seconds 682 of data for repair and increments its transmit window in 5 second 683 increments, then 685 TXW_MAX_RTE = 16kBps 686 TXW_ADV_SECS = 5 seconds, 687 TXW_SECS = 35 seconds, 688 TXW_BYTES = 560kB, 689 TXW_SQNS = 383 (rounded up), 691 and the size of the increment window in sequence numbers 692 (TXW_MAX_RTE * TXW_ADV_SECS / 1500) = 54 (rounded down). 694 Continuing this example, the following is a diagram of the transmit win- 695 dow and the increment window therein in terms of sequence numbers. 697 TXW_TRAIL TXW_LEAD 698 | | 699 | | 700 |--|--------------- Transmit Window -------------|----| 701 v | | v 702 v v 703 ... +-----+-----+-...-+------+------+-...-+-------+-------+ ..... 704 n-1 | n | n+1 | ... | n+53 | n+54 | ... | n+381 | n+382 | n+383 705 ... +-----+-----+-...-+------+------+-...-+-------+-------+ ..... 706 ^ 707 ^ | ^ 708 |--- Increment Window|---| 709 | 710 | 711 TXW_INC 713 So the values of the sequence numbers defining these windows are: 715 TXW_TRAIL = n 716 TXW_INC = n+53 717 TXW_LEAD = n+382 719 NOTA BENE: In this example the window sizes in terms of sequence 720 numbers can be determined only because of the assumption of a con- 721 stant data packet size of 1500 bytes. When the data packet sizes are 722 variable, more or fewer sequence numbers may be consumed transmitting 723 the same amount (TXW_BYTES) of data. 725 So, for a given transport session identified by a TSI, a source main- 726 tains: 728 TXW_MAX_RTE a maximum transmit rate in kBytes per second, the cumula- 729 tive transmit rate of some combination of SPMs, ODATA, 730 and RDATA depending on the transmit window advancement 731 strategy 733 TXW_TRAIL the sequence number defining the trailing edge of the 734 transmit window, the sequence number of the oldest data 735 packet available for repair 737 TXW_LEAD the sequence number defining the leading edge of the 738 transmit window, the sequence number of the most recently 739 transmitted ODATA packet 741 TXW_INC the sequence number defining the leading edge of the 742 increment window, the sequence number of the most 743 recently transmitted data packet amongst those that will 744 expire upon the next increment of the transmit window 746 PGM does not constrain the strategies that a source may use for advanc- 747 ing the transmit window. A source may implement any scheme or number of 748 schemes. This is possible because a PGM receiver must obey the window 749 provided by the source in its packets. Three strategies are suggested 750 within this document. 752 In the first, called "Advance with Time", the transmit window maintains 753 the last TXW_SECS of data in real-time, regardless of whether any data 754 was sent in that real time period or not. The actual number of bytes 755 maintained at any instant in time will vary between 0 and TXW_BYTES, 756 depending on traffic during the last TXW_SECS. In this case, 757 TXW_MAX_RTE is the cumulative transmit rate of SPMs and ODATA. 759 In the second, called "Advance with Data", the transmit window maintains 760 the last TXW_BYTES bytes of data for repair. That is, it maintains the 761 theoretical maximum amount of data that could be transmitted in the time 762 period TXW_SECS, regardless of when they were transmitted. In this 763 case, TXW_MAX_RTE is the cumulative transmit rate of SPMS, ODATA, and 764 RDATA. 766 The third strategy leaves control of the window in the hands of the 767 application. The API provided by a source implementation for this, 768 could allow the application to control the window in terms of APDUs and 769 to manually step the window. This gives a form of Application Level 770 Framing (ALF). In this case, TXW_MAX_RTE is the cumulative transmit 771 rate of SPMs, ODATA, and RDATA. 773 Happily, everything else in this section is a LOT easier to explain than 774 the transmit window. 776 3.4. Receive Window 778 The receive window at the receivers is determined entirely by PGM pack- 779 ets from the source. That is, a receiver simply obeys what the source 780 tells it in terms of window state and advancement. 782 For a given transport session identified by a TSI, a receiver maintains: 784 RXW_TRAIL the sequence number defining the trailing edge of the 785 receive window, the sequence number (known from data 786 packets and SPMs) of the oldest data packet available for 787 repair from the source 789 RXW_LEAD the sequence number defining the leading edge of the 790 receive window, the greatest sequence number of any 791 received data packet 793 The receive window is the range of sequence numbers a receiver is 794 expected to use to identify receivable ODATA. 796 A data packet is described as being "in" the receive window if its 797 sequence number is in the receive window. 799 The receive window is advanced by the receiver when it receives an SPM 800 or ODATA packet within the transmit window that increments RXW_TRAIL. 801 Receivers also advance their receive windows upon receipt of any PGM 802 data packet within the receive window that advances the receive window. 804 3.5. Source Path State 806 To establish the repair state required to constrain RDATA, it's essen- 807 tial that NAKs return from a receiver to a source on the reverse of the 808 distribution tree from the source. That is, they must return through 809 the same sequence of PGM network elements through which the ODATA was 810 forwarded, but in reverse. There are two reasons for this, the less 811 obvious one being by far the more important one. 813 The first and obvious reason is that RDATA is forwarded on the same path 814 as ODATA and so repair state must be established on this path if it is 815 to constrain the propagation of RDATA. 817 The second and less obvious reason is that in the absence of repair 818 state, PGM network elements do NOT forward RDATA, so the default 819 behaviour is to discard repairs. If repair state is not properly esta- 820 blished for interfaces on which ODATA went missing, then receivers on 821 those interfaces will continue to NAK for lost data and ultimately 822 experience unrecoverable data loss. 824 The principle function of SPMs is to provide the source path state 825 required for PGM network elements to forward NAKs from one PGM network 826 element to the next on the reverse of the distribution tree for the TSI, 827 establishing repair state each step of the way. This source path state 828 is simply the address of the upstream PGM network element on the reverse 829 of the distribution tree for the TSI. That upstream PGM network element 830 may be more than one subnet hop away. SPMs establish the identity of 831 the upstream PGM network element on the distribution tree for each TSI 832 in each group in each PGM network element, a sort of virtual PGM topol- 833 ogy. So although NAKs are unicast addressed, they are NOT unicast 834 routed by PGM network elements in the conventional sense. Instead PGM 835 network elements use the source path state established by SPMs to direct 836 NAKs PGM-hop-by-PGM-hop toward the source. The idea is to constrain 837 NAKs to the pure PGM topology spanning the more heterogeneous underlying 838 topology of both PGM and non-PGM network elements. 840 The result is repair state in every PGM network element between the 841 receiver and the source so that the corresponding RDATA is never dis- 842 carded by a PGM network element for lack of repair state. 844 SPMs also maintain transmit window state in receivers by advertising the 845 trailing and leading edges of the transmit window (SPM_TRAIL and 846 SPM_LEAD). In the absence of data, SPMs may be used to close the 847 transmit window in time by advancing the transmit window until SPM_TRAIL 848 and SPM_LEAD are equal. 850 3.6. Packet Contents 852 This section just provides enough short-hand to make the Procedures 853 intelligible. For the full details of packet contents, please refer to 854 Packet Formats below. 856 3.6.1. Source Path Messages 858 3.6.1.1. SPMs 860 SPMs are transmitted by sources to establish source-path state in PGM 861 network elements, and to provide transmit-window state in receivers. 863 SPMs are multicast to the group and contain: 865 SPM_TSI the source-assigned TSI for the session to which the SPM 866 corresponds 868 SPM_SQN a sequence number assigned sequentially by the source in 869 unit increments and scoped by SPM_TSI 871 NOTA BENE: this is an entirely separate sequence than is used 872 to number ODATA and RDATA. 874 SPM_TRAIL the sequence number defining the trailing edge of the 875 source's transmit window (TXW_TRAIL) 877 SPM_LEAD the sequence number defining the leading edge of the 878 source's transmit window (TXW_LEAD) 880 SPM_PATH the network-layer address (NLA) of the interface on the 881 PGM network element on which the SPM is forwarded 883 3.6.2. Data Packets 885 3.6.2.1. ODATA - Original Data 887 ODATA packets are transmitted by sources to send application data to 888 receivers. 890 ODATA packets are multicast to the group and contain: 892 OD_TSI the globally unique source-assigned TSI 894 OD_TRAIL the sequence number defining the trailing edge of the 895 source's transmit window (TXW_TRAIL) 897 OD_TRAIL makes the protocol more robust in the face of 898 lost SPMs. By including the trailing edge of the 899 transmit window on every data packet, receivers that have 900 missed any SPMs that advanced the transmit window can 901 still detect the case, recover the application, and 902 potentially resynchronize to the transport session. 904 OD_SQN a sequence number assigned sequentially by the source in 905 unit increments and scoped by OD_TSI 907 3.6.2.2. RDATA - Repair Data 909 RDATA packets are repair packets transmitted by sources or DLRs in 910 response to NAKs. 912 RDATA packets are multicast to the group and contain: 914 RD_TSI OD_TSI of the ODATA packet for which this is a repair 916 RD_TRAIL the sequence number defining the trailing edge of the 917 source's transmit window (TXW_TRAIL), not necessarily the 918 same as OD_TRAIL of the ODATA packet for which this is a 919 repair 921 RD_SQN OD_SQN of the ODATA packet for which this is a repair 923 3.6.3. Negative Acknowledgements 925 3.6.3.1. NAKs - Negative Acknowledgments 927 NAKs are transmitted by receivers to request repairs for missing data 928 packets. 930 NAKs are unicast (PGM-hop-by-PGM-hop) to the source and contain: 932 NAK_TSI OD_TSI of the ODATA packet for which a repair is 933 requested 935 NAK_SQN OD_SQN of the ODATA packet for which a repair is 936 requested 938 NAK_SRC the unicast NLA of the original source of the missing 939 ODATA. 941 NAK_GRP the multicast group NLA 943 3.6.3.2. NNAKs - Null Negative Acknowledgments 945 NNAKs are transmitted by a DLR that receives NAKs redirected to it by 946 either receivers or network elements to provide flow-control feed-back 947 to a source. 949 NNAKs are unicast (PGM-hop-by-PGM-hop) to the source and contain: 951 NNAK_TSI NAK_TSI of the corresponding re-directed NAK. 953 NNAK_SQN NAK_SQN of the corresponding re-directed NAK. 955 NNAK_SRC NAK_SRC of the corresponding re-directed NAK. 957 NNAK_GRP NAK_GRP of the corresponding re-directed NAK. 959 3.6.4. Negative Acknowledgement Confirmations 961 3.6.4.1. NCFs - NAK confirmations 963 NCFs are transmitted by network elements and sources in response to 964 NAKs. 966 NCFs are multicast to the group and contain: 968 NCF_TSI NAK_TSI of the NAK being confirmed 970 NCF_SQN NAK_SQN of the NAK being confirmed 972 NCF_SRC NAK_SRC of the NAK being confirmed 974 NCF_GRP NAK_GRP of the NAK being confirmed 976 3.6.5. Option Encodings 978 OPT_FRAGMENT - Fragmentation 979 OPT_JOIN - Late Joining 981 OPT_TIME - Time Stamp 983 OPT_RXQ - Reception Quality Report 985 OPT_DROP - Sequence Number Dropout 987 OPT_REDIRECT - Redirect 989 OPT_PARITY - Forward Error Correction 990 4. Procedures - General 992 Since SPMs, NCFs, and RDATA must be treated conditionally by PGM network 993 elements, they must be distinguished from other packets in the chosen 994 multicast network protocol if PGM network elements are to extract them 995 from the usual switching path. 997 The most obvious way for network elements to achieve this is to examine 998 every packet in the network for the PGM transport protocol and packet 999 types. However, the overhead of this approach is costly for high- 1000 performance, multi-protocol network elements. An alternative, and a 1001 requirement for PGM over IP multicast, is that SPMs, NCFs, and RDATA 1002 must be transmitted with the IP Router Alert Option [6]. This option 1003 gives network elements a network-layer indication that a packet should 1004 be extracted from IP switching for more detailed processing. 1006 5. Procedures - Sources 1008 5.1. Data Transmission 1010 Since PGM relies on a purely rate-limited transmission strategy in the 1011 source to bound the bandwidth consumed by PGM transport sessions, an 1012 assortment of techniques is assembled here to make that strategy as con- 1013 servative and robust as possible. These techniques are the minimum 1014 required of a PGM source, and others may be added as experience dic- 1015 tates. 1017 5.1.1. Maximum Cumulative Transmit Rate 1019 A source must number ODATA packets in the order in which they are sub- 1020 mitted for transmission by the application. A source must transmit 1021 ODATA packets in sequence and only within the transmit window beginning 1022 with TXW_TRAIL at no greater a rate than TXW_MAX_RTE. 1024 In the advance with data strategy, TXW_MAX_RTE is the maximum cumulative 1025 transmit rate of SPM, ODATA, and RDATA. The reason for calculating 1026 TXW_MAX_RTE in this way is so that the aggregate bandwidth remains 1027 within TXW_MAX_RATE. 1029 In the advance with time strategy, TXW_MAX_RTE is the maximum cumulative 1030 transmit rate of SPMS and ODATA only. The assumption in calculating 1031 TXW_MAX_RTE in this way is that delivery at a constant rate is the main 1032 concern. 1034 Other transmission strategies may define TXW_MAX_RTE as appropriate for 1035 the implementation. 1037 5.1.2. Transmit Rate Regulation 1039 To regulate its transmit rate, a source must use a token bucket scheme 1040 or any other traffic management scheme that yields equivalent behaviour. 1041 A token bucket [7] is characterized by a continually sustainable data 1042 rate (the token rate) and the extent to which the data rate may exceed 1043 the token rate for short periods of time (the token bucket size). Over 1044 any arbitrarily chosen interval, the number of bytes the source may 1045 transmit cannot exceed the token bucket size plus the product of the 1046 token rate and the chosen interval. 1048 In addition, a source must bound the maximum rate at which successive 1049 packets may be transmitted using a leaky bucket scheme drained at a max- 1050 imum transmit rate, or equivalent mechanism. 1052 5.1.3. TPDU Ordering 1054 To preserve the logic of PGM's transmit window, a source must implement 1055 strict priority queueing of pending SPMs, pending RDATA, and pending 1056 ODATA from three separate queues in that order, or implement any mechan- 1057 ism that results in equivalent behaviour. 1059 5.1.4. Ambient SPMs 1061 Interleaved with ODATA and RDATA, a source must transmit SPMs at a rate 1062 at least sufficient to maintain current source path state in PGM network 1063 elements. Note that source path state in network elements does not 1064 track underlying changes in the distribution tree from a source until an 1065 SPM traverses the altered distribution tree. The consequence is that 1066 NAKs may go unconfirmed both at receivers and amongst network elments 1067 while changes in the underlying distribution tree take place. 1069 5.1.5. Heartbeat SPMs 1071 In the absence of data to transmit, a source should transmit SPMs at a 1072 decaying rate in order to assist early detection of lost data, to main- 1073 tain current source path state in PGM network elements, and to maintain 1074 current receive window state in the receivers. 1076 In this scheme [8], a source maintains an inter-heartbeat timer IHB_TMR 1077 which times the interval between the most recent packet (ODATA, RDATA, 1078 or SPM) transmission and the next heartbeat transmission. IHB_TMR is 1079 initialized to a minimum interval IHB_MIN after the transmission of any 1080 data packet. If IHB_TMR expires, the source transmits a heartbeat SPM 1081 and initializes IHB_TMR to double its previous value. The transmission 1082 of consecutive heartbeat SPMs doubles IHB each time up to a maximum 1083 interval IHB_MAX. The transmission of any data packet initializes 1084 IHB_TMR to IHB_MIN once again. The effect is to provoke prompt 1085 detection of missing packets in the absence of data to transmit, and to 1086 do so with minimal bandwidth overhead. 1088 5.1.6. Ambient and Heartbeat SPMs 1090 Ambient and heartbeat SPMs are described as driven by separate timers in 1091 this specification to highlight their contrasting functions. Ambient 1092 SPMs are driven by a count-down timer that expires regularly while 1093 heartbeat SPMs are driven by a count-down timer that keeps being reset 1094 by data, and the interval of which changes once it begins to expire. 1095 The first timer is just counting down in real-time while the second is 1096 measuring the inter-data-packet interval. 1098 In the presence of data, no heartbeat SPMs will be transmitted since the 1099 transmission of data keeps setting the IHB_TMR back to its initial 1100 value. At the same time however, ambient SPMs must be interleaved into 1101 the data as a matter of course, not necessarily as a heartbeat mechan- 1102 ism. This ambient transmission of SPMs is required to keep the distri- 1103 bution tree information in the network current and to allow new 1104 receivers to synchronize with the session. 1106 It is in the interest of an implementation to de-couple ambient and 1107 heartbeat SPM timers sufficiently to permit them to be configured 1108 independently of each other. 1110 5.2. Negative Acknowledgement Confirmation 1112 A source must immediately multicast an NCF in response to any NAK it 1113 receives. The NCF is required since the alternative of responding 1114 immediately with RDATA would not allow other PGM network elements on the 1115 same subnet to do NAK anticipation, nor would it allow DLRs on the same 1116 subnet to provide repairs. The generation of NCFs should be rate- 1117 limited to protect against a denial of service in the presence of a NAK 1118 storm. 1120 5.3. Repairs 1122 A source must then multicast RDATA (while respecting TXW_MAX_RTE) in 1123 response to any NAK it receives for data packets within the transmit 1124 window. A source should transmit RDATA at priority over concurrent 1125 ODATA. The effect of this priority is to back off the transmission of 1126 ODATA in favour of RDATA. 1128 Note that work in progress is looking at algorithms for delaying RDATA 1129 transmission, to make the overall repair strategy more efficient. 1130 Implementations should not preclude a delay being introduced before 1131 RDATA transmission. 1133 5.4. Transmit Window Advance 1135 5.4.1. Advancing across the Increment Window 1137 In anticipation of advancing the transmit window, the source starts a 1138 timer TXW_ADV_IVL_TMR which runs for time period TXW_ADV_IVL. 1139 TXW_ADV_IVL has a value in the range (0, TXW_ADV_SECS). The value may 1140 be configurable or may be determined statically by the strategy used for 1141 advancing the transmit window. 1143 When TXW_ADV_IVL_TMR is running, a source may reset TXW_ADV_IVL_TMR if 1144 NAKs are received for packets in the increment window. In addition, a 1145 source may transmit RDATA in the increment window with priority over 1146 other data within the transmit window. 1148 When TXW_ADV_IVL_TMR expires, a source should advance the trailing edge 1149 of the transmit window from TXW_TRAIL to TXW_INC. 1151 Once the transmit window is advanced across the increment window, 1152 SPM_TRAIL, OD_TRAIL and RD_TRAIL are set to the new value of TXW_TRAIL 1153 in all subsequent transmitted packets, until the next window advance- 1154 ment. 1156 PGM does not constrain the strategies that a source may use for advanc- 1157 ing the transmit window. The source may implement any scheme or number 1158 of schemes. Three suggested strategies are outlined below. 1160 5.4.2. Advancing with Data 1162 In the first strategy, TXW_MAX_RTE is calculated from SPMS and both 1163 ODATA and RDATA, and NAKs reset TXW_ADV_IVL_TMR. In this mode of opera- 1164 tion the transmit window maintains the last TXW_BYTES bytes of data for 1165 repair. That is, it maintains the theoretical maximum amount of data 1166 that could be transmitted in the time period TXW_SECS. This means that 1167 the following timers are not treated as real-time timers, instead they 1168 are "data driven". That is, they expire when the amount of data that 1169 could be sent in the time period they define is sent. They are the SPM 1170 ambient time interval, TXW_ADV_SECS, TXW_SECS, TXW_ADV_IVL, 1171 TXW_ADV_IVL_TMR and the join interval. Note that the SPM heartbeat 1172 timers still run in real-time. 1174 While TXW_ADV_IVL_TMR is running, a source uses the receipt of a NAK for 1175 ODATA within the increment window to reset timer TXW_ADV_IVL_TMR to 1176 TXW_ADV_IVL so that transmit window advancement is delayed until no NAKs 1177 for data in the increment window are seen for TXW_ADV_IVL seconds. If 1178 the transmit window should fill in the meantime, further transmissions 1179 would be suspended until the transmit window can be advanced. 1181 A source must advance the transmit window across the increment window 1182 only upon expiry of TXW_ADV_IVL_TMR. 1184 This mode of operation is intended for non-real-time, messaging applica- 1185 tions based on the receipt of complete data at the expense of delay. 1187 5.4.3. Advancing with Time 1189 This strategy advances the transmit window in real-time. In this mode 1190 of operation, TXW_MAX_RTE is calculated from SPMs and ODATA only to 1191 maintain a constant data throughput rate by consuming extra bandwidth 1192 for repairs. TXW_ADV_IVL has the value 0 which advances the transmit 1193 window without regard for whether NAKs for data in the increment window 1194 are still being received. 1196 In this mode of operation, all timers are treated as real-time timers. 1198 This mode of operation is intended for real-time, streaming applications 1199 based on the receipt of timely data at the expense of completeness. 1201 5.4.4. Advancing under explicit application control 1203 Some applications may wish more explicit control of the transmit window 1204 than that provided by the advance with data / time strategies above. An 1205 implementation may provide this mode of operation and allow an applica- 1206 tion to explicitly control the window in terms of APDUs. 1208 6. Procedures - Receivers 1210 6.1. Data Reception 1212 Initial data reception 1214 A receiver should initiate data reception beginning with the first data 1215 packet it receives within the advertised transmit window. This packet's 1216 sequence number (ODATA_SQN) temporarily defines the trailing edge of the 1217 transmit window from the receiver's perspective. That is, it is 1218 assigned to RXW_TRAIL_INIT within the receiver, and until the trailing 1219 edge sequence number advertised in subsequent packets (SPMs or ODATA or 1220 RDATA) increments through RXW_TRAIL_INIT, the receiver must only request 1221 repairs for sequence numbers subsequent to RXW_TRAIL_INIT. Thereafter, 1222 it may request repairs anywhere in the transmit window. This temporary 1223 restriction on repair requests prevents receivers from requesting a 1224 potentially large amount of history when they first begin to receive a 1225 given PGM transport session. 1227 Note that the JOIN option, discussed later, can be used to provide a 1228 different value for RXW_TRAIL_INIT. 1230 Receiving and discarding data packets 1232 Within a given transport session, a receiver must receive any ODATA or 1233 RDATA packets within the receive window. A receiver must discard any 1234 data packet that duplicates one already received in the transmit window. 1235 A receiver must discard any data packet outside of the receive window. 1237 Contiguous data 1239 Contiguous data is comprised of those data packets within the receive 1240 window that have been received and are in the range from RXW_TRAIL up to 1241 (but not including) the first missing sequence number in the receive 1242 window. The most recently received data packet of contiguous data 1243 defines the leading edge of contiguous data. 1245 As its default mode of operation, a receiver must deliver only contigu- 1246 ous data packets to the application, and it must do so in the order 1247 defined by those data packets' sequence numbers. This provides applica- 1248 tions with a reliable ordered data flow. 1250 Non contiguous data 1252 PGM receiver implementations may optionally provide a mode of operation 1253 in which data is delivered to an application in the order received. 1254 However, the implementation must only deliver complete application pro- 1255 tocol data units (APDUs) to the application. That is, APDUs that have 1256 been fragmented into different TPDUs must be reassembled before delivery 1257 to the application. 1259 6.2. Source Path Messages 1261 Receivers must receive and sequence SPMs for any TSI they are receiving. 1262 For each TSI, receivers must use the most recent SPM to determine the 1263 NLA of the upstream PGM network element for use in NAK addressing. Note 1264 that a receiver cannot initiate repair requests until it has received at 1265 least one SPM for the corresponding TSI. 1267 6.3. Negative Acknowledgment 1269 Detecting missing data packets 1271 Receivers must detect gaps in the expected data sequence by comparing 1272 the sequence number on the most recently received ODATA or RDATA packet 1273 with the leading edge of contiguous data. If the receiver has not 1274 received all intervening data packets, it must initiate selective NAK 1275 generation for each intervening missing sequence number. Receivers 1276 should temper the initiation of NAK generation to account for simple 1277 mis-ordering introduced by the network. 1279 Receivers must also detect gaps in the expected data sequence by compar- 1280 ing SPM_LEAD of the most recently received SPM with the leading edge of 1281 contiguous data. If the receiver has not received all intervening data 1282 packets, it must initiate selective NAK generation for each missing 1283 sequence number. 1285 Generating NAKs 1287 NAK generation requires that a receiver listen to NCFs and NAKs for the 1288 same transport session. 1290 NAK generation also requires that a receiver observe four time out 1291 intervals for any given NAK (i.e., per NAK_TSI and NAK_SQN). 1293 The first time out interval, the NAK random back-off interval 1294 NAK_RB_IVL, randomly delays the transmission of a given NAK from a 1295 receiver. NAK_RB_IVL is counted down from the time a missing data 1296 packet is detected. Expiry of NAK_RB_IVL causes NAK transmission. NAK 1297 transmission is defined as sending a unicast NAK to the PGM upstream 1298 neighbour and a multicast NAK with ttl 1. 1300 The second time out interval, the NAK repeat interval NAK_RPT_IVL, lim- 1301 its the length of time for which a receiver will repeat a NAK while 1302 waiting for a corresponding NCF. NAK_RPT_IVL is counted down from the 1303 transmission of a NAK. Expiry of NAK_RPT_IVL cancels NAK generation and 1304 indicates unrecoverable data loss (due to missing NCF). 1306 The third time out interval, the NAK RDATA interval NAK_RDATA_IVL, lim- 1307 its the length of time for which a receiver will wait for the RDATA 1308 corresponding to a confirmed NAK. NAK_RDATA_IVL is counted down from 1309 the time a matching NCF is received. Expiry of NAK_RDATA_IVL causes the 1310 receiver to select a new value of NAK_RB_IVL, and start again. 1312 The fourth time out interval, the NAK generation interval NAK_GEN_IVL, 1313 limits the length of time for which a receiver will retry a NAK while 1314 waiting for the corresponding RDATA. NAK_GEN_IVL is counted down from 1315 the time a missing data packet is detected. Expiry of NAK_GEN_IVL can- 1316 cels NAK generation and indicates unrecoverable data loss (due to miss- 1317 ing RDATA). 1319 NAK generation follows the detection of a missing data packet and is the 1320 cycle of waiting for NAK_RB_IVL, listening for matching NCFs or NAKs, 1321 transmitting a NAK if a matching NCF or NAK is not heard, waiting 1322 NAK_RDATA_IVL, and recommencing NAK generation if the matching data is 1323 not received. During NAK_RB_IVL, a NAK is said to be pending. During 1324 NAK_RDATA_IVL, a NAK is said to be outstanding. 1326 Suspending NAK generation 1328 Suspending NAK generation just means waiting for either NAK_RB_IVL or 1329 NAK_RDATA_IVL to pass. 1331 A receiver must suspend NAK generation if a duplicate of the NAK is 1332 already pending from this receiver. A NAK is pending from this receiver 1333 if NAK_RB_IVL for this NAK has been initiated in this receiver but has 1334 not yet passed. 1336 A receiver must suspend NAK generation if a duplicate of the NAK is 1337 already outstanding from this or another receiver. A NAK is outstanding 1338 from this or another receiver if NAK_RDATA_IVL for this NAK has been 1339 initiated in this receiver but has not yet passed. 1341 Backing off NAK transmission 1343 Before transmitting a NAK, a receiver must wait some interval NAK_RB_IVL 1344 chosen randomly and uniformly over NAK_BO_IVL during which it listens 1345 for a matching NAK that may have been transmitted by another receiver or 1346 a matching NCF that may be transmitted in response to the same NAK from 1347 another receiver. 1349 When a receiver has to transmit a sequence of NAKs, it should transmit 1350 the NAKs in order from oldest to newest. The receiver should pace the 1351 NAK sequence so as not to cause a NAK storm on the network. 1353 NAK suppression 1355 A receiver must suspend NAK generation and wait at least NAK_RDATA_IVL 1356 before recommencing NAK generation if it hears a matching NCF or NAK 1357 during NAK_RB_IVL. A matching NCF must match NCF_TSI with NAK_TSI, and 1358 NCF_SQN with NAK_SQN. 1360 Transmitting a NAK 1362 Upon expiry of NAK_RB_IVL, a receiver must transmit a NAK to the 1363 upstream PGM network element for the TSI specifying the transport ses- 1364 sion identifier and missing sequence number. It must repeat the NAK at 1365 a rate of NAK_RPT_RTE for an interval of NAK_RPT_IVL until it receives a 1366 matching NCF. It must then wait NAK_RDATA_IVL before recommencing NAK 1367 generation. If it hears a matching NCF during NAK_RDATA_IVL, it must 1368 wait anew for NAK_RDATA_IVL before recommencing NAK generation (i.e., 1369 NCFs restart NAK_RDATA_IVL). 1371 Receivers should transmit NAKs for data packets in the increment window 1372 at priority over NAKs for data packets in the remainder of the receive 1373 window. 1375 Completion of NAK generation 1377 NAK generation is complete only upon the reception of the matching RDATA 1378 (or even ODATA) packet at any time during NAK generation. 1380 Cancellation of NAK generation 1382 NAK generation is canceled upon the advancing of the receive window so 1383 as to exclude the matching sequence number of a pending or outstanding 1384 NAK, or the expiry of NAK_GEN_IVL. Cancellation of NAK generation indi- 1385 cates unrecoverable data loss. 1387 Addressing NAKs 1389 A receiver (unicast) addresses a NAK to the upstream PGM network element 1390 for the TSI. In addition, it may optionally multicast a NAK with TTL=1 1391 to the group. It also records both the address of the source of the 1392 corresponding ODATA and the address of the group in the NAK header. 1394 Receiving NCFs and multicast NAKs 1396 A receiver must discard any NCFs or NAKs it hears for data packets out- 1397 side the receive window. 1399 If a receiver hears an NCF or NAK for a data packet in the receive win- 1400 dow for which it has no repair state, it should discard the NCF/NAK only 1401 if it has already received the matching data packet. If it has not 1402 already received the matching data packet, it should wait NAK_RDATA_IVL 1403 and then commence NAK generation itself, beginning with the random back 1404 off procedure. 1406 7. Procedures - Network Elements 1408 7.1. Source Path State 1410 Upon receipt of an SPM, a network element records the Source Path 1411 Address SPM_PATH with the multicast routing information for the TSI. If 1412 the receiving network element is on the same subnet as the forwarding 1413 network element, this address will be the same as the address of the 1414 immediately upstream network element on the distribution tree for the 1415 TSI. If, however, non-PGM network elements intervene between the for- 1416 warding and the receiving network elements, this address will be the 1417 address of the first PGM network element across the intervening network 1418 elements. 1420 The network element then forwards the SPM on each outgoing interface for 1421 that TSI. As it does so, it encodes the network address of the outgoing 1422 interface in SPM_PATH in each copy of the SPM it forwards. 1424 7.2. NAK Confirmation 1426 Network elements must immediately transmit an NCF in response to any 1427 unicast NAK they receive. The NCF must be multicast to the group on the 1428 interface on which the NAK was received. 1430 NOTA BENE: In order to avoid creating multicast routing state 1431 for PGM network elements across non-PGM-capable clouds, NCFs 1432 transmitted by network elements must bear the ODATA source's 1433 NLA, not the network element's NLA as might be expected. 1435 The generation of NCFs should be rate-limited to protect against a 1436 denial of service in the presence of a NAK storm. 1438 Simultaneously, network elements must establish repair state for the NAK 1439 if such state does not already exist, and add the interface on which the 1440 NAK was received to the corresponding repair interface list if the 1441 interface is not already listed. 1443 7.3. Constrained NAK Forwarding 1445 The NAK forwarding procedures for network elements are quite similar to 1446 those for receivers, but three important differences should be noted. 1447 First, network elements do NOT back off before forwarding a NAK (i.e., 1448 there is no NAK_BO_IVL) since the resulting delay of the NAK would com- 1449 pound with each hop. Note that NAK arrivals will be randomized by the 1450 receivers from which they originate, and this factor in conjunction with 1451 NAK anticipation and elimination will combine to forestall NAK storms on 1452 subnets with a dense network element population. 1454 Second, network elements do NOT retry confirmed NAKs (i.e., there is no 1455 NAK_GEN_IVL) if RDATA is not seen; they simply discard the repair state 1456 and rely on receivers to re-request the repair. This approach keeps the 1457 repair state in the network elements relatively ephemeral and responsive 1458 to underlying routing changes. 1460 Third, note that ODATA does NOT cancel NAK forwarding in network ele- 1461 ments since it is switched by network elements without transport-layer 1462 intervention. 1464 NAK forwarding requires that a network element listen to NCFs for the 1465 same transport session. NAK forwarding also requires that a network 1466 element observe two time out intervals for any given NAK (i.e., per 1467 NAK_TSI and NAK_SQN). 1469 The first, the NAK repeat interval NAK_RPT_IVL, limits the length of 1470 time for which a network element will repeat a NAK while waiting for a 1471 corresponding NCF. NAK_RPT_IVL is counted down from the transmission of 1472 a NAK. Expiry of NAK_RPT_IVL cancels NAK forwarding (due to missing 1473 NCF). 1475 The second, the NAK RDATA interval NAK_RDATA_IVL, limits the length of 1476 time for which a network element will wait for the corresponding RDATA. 1477 NAK_RDATA_IVL is counted down from the time a matching NCF is received. 1478 Expiry of NAK_RDATA_IVL causes the network element to discard the 1479 corresponding repair state (due to missing RDATA). 1481 During NAK_RPT_IVL, a NAK is said to be pending. During NAK_RDATA_IVL, 1482 a NAK is said to be outstanding. 1484 A Network element must forward NAKs only to the upstream PGM network 1485 element for the TSI. 1487 A network element must repeat a NAK at a rate of NAK_RPT_RTE for an 1488 interval of NAK_RPT_IVL until it receives a matching NCF. A matching 1489 NCF must match NCF_TSI with NAK_TSI, and NCF_SQN with NAK_SQN. 1491 Upon reception of the corresponding NCF, network elements must wait at 1492 least NAK_RDATA_IVL for the corresponding RDATA. Receipt of the 1493 corresponding RDATA at any time during NAK forwarding cancels NAK for- 1494 warding and tears down the corresponding repair state in the network 1495 element. 1497 7.4. NAK elimination 1499 Two NAKs duplicate each other if they bear the same NAK_TSI and NAK_SQN. 1500 Network elements must discard all duplicates of a NAK that is pending. 1502 Once a NAK is outstanding, network elements must discard all duplicates 1503 of that NAK for NAK_ELIM_IVL. Upon expiry of NAK_ELIM_IVL, network ele- 1504 ments must suspend NAK elimination for that TSI/SQN until the first 1505 duplicate of that NAK is seen after the expiry of NAK_ELIM_IVL. This 1506 duplicate must be forwarded in the usual manner. Once this duplicate 1507 NAK is outstanding, network elements must once again discard all dupli- 1508 cates of that NAK for NAK_ELIM_IVL, and so on. NAK_RDATA_IVL must be 1509 reset each time a NAK for the corresponding TSI/SQN is confirmed (i.e., 1510 each time NAK_ELIM_IVL is reset). NAK_ELIM_IVL must be some small frac- 1511 tion of NAK_RDATA_IVL. 1513 NAK_ELIM_IVL acts to balance implosion prevention against repair state 1514 liveness. That is, it results in the elimination of all but at most one 1515 NAK per NAK_ELIM_IVL thereby allowing repeated NAKs to keep the repair 1516 state alive in the PGM network elements. 1518 7.5. NAK Anticipation 1520 An unsolicited NCF is one that is received by a network element when the 1521 network element has no corresponding pending or outstanding NAK. Net- 1522 work elements must process unsolicited NCFs differently depending on the 1523 interface on which they are received. 1525 If the interface on which an NCF is received is the same interface the 1526 network element would use to reach the upstream PGM network element, the 1527 network element simply establishes repair state for NCF_TSI and NCF_SQN 1528 without adding the interface to the repair interface list, and discards 1529 the NCF. If the repair state already exists, the network element res- 1530 tarts the NAK_RDATA and NAK_ELIM_IVL timers and discards the NCF. 1532 If the interface on which an NCF is received is not the same interface 1533 the network element would use to reach the upstream PGM network element, 1534 the network element does not establish repair state and just discards 1535 the NCF. 1537 Anticipated NAKs permit the elimination of any subsequent matching NAKs 1538 from downstream. Upon establishing anticipated repair state, network 1539 elements must eliminate subsequent NAKs only for a period of 1540 NAK_ELIM_IVL. Upon expiry of NAK_ELIM_IVL, network elements must 1541 suspend NAK elimination for that TSI/SQN until the first duplicate of 1542 that NAK is seen after the expiry of NAK_ELIM_IVL. This duplicate must 1543 be forwarded in the usual manner. Once this duplicate NAK is outstand- 1544 ing, network elements must once again discard all duplicates of that NAK 1545 for NAK_ELIM_IVL, and so on. NAK_RDATA_IVL must be reset each time a 1546 NAK for the corresponding TSI/SQN is confirmed (i.e., each time 1547 NAK_ELIM_IVL is reset). NAK_ELIM_IVL must be some small fraction of 1548 NAK_RDATA_IVL. 1550 7.6. NAK Shedding 1552 Network elments may implement local procedures for withholding NAK con- 1553 firmations for receivers detected to be reporting excessive loss. The 1554 result of these procedures would ultimately be unrecoverable data loss 1555 in the receiver. 1557 7.7. Addressing NAKs 1559 A PGM network element uses the *contained* source and group addresses to 1560 find the source/group multicast routing information, looks up the 1561 corresponding upstream PGM network element's address, uses it to re- 1562 address the (unicast) NAK, and unicasts it on the upstream interface for 1563 the distribution tree for the TSI. 1565 7.8. Constrained RDATA Forwarding 1567 Network elements must maintain repair state for each interface on which 1568 a given NAK is received at least once. Network elements must then use 1569 this list of interfaces to constrain the forwarding of the corresponding 1570 RDATA packet only to those interfaces in the list. An RDATA packet 1571 corresponds to a NAK if it matches NAK_TSI and NAK_SQN. 1573 Network elements must maintain this repair state only until either the 1574 corresponding RDATA is received and forwarded, or NAK_RDATA_IVL passes 1575 after forwarding the most recent instance of a given NAK. Thereafter, 1576 the corresponding repair state must be discarded. 1578 Network elements should discard and not forward RDATA packets for which 1579 they have no repair state. Note that the consequence of this procedure 1580 is that, while it constrains repairs to the interested sub-set of the 1581 network, loss of repair state precipitates further NAKs from neglected 1582 receivers. 1584 8. Packet Formats 1586 All of the packet formats described in this section are transport-layer 1587 headers that must immediately follow the network-layer header in the 1588 packet. Only data packet headers (ODATA and RDATA) may be followed in 1589 the packet by application data. For each packet type, the source and 1590 destination network-layer addresses (NLAs) are specified in addition to 1591 the format and contents of the transport layer header. Recall from Gen- 1592 eral Procedures that, for PGM over IP multicast, SPMs, NCFs, and RDATA 1593 must also bear the IP Router Alert Option. 1595 For PGM over IP, the IP protocol number is 113. 1597 In all packets the descriptions of Data-Source Port, Data-Destination 1598 Port, Options, Checksum, Global Source ID (GSI), and TPDU Length are: 1600 Data-Source Port: 1602 A random port number generated by the source. This port number 1603 must be unique within the source. Source Port together with Glo- 1604 bal Source ID forms the TSI. 1606 Data-Destination Port: 1608 A globally well-known port number assigned to the given PGM appli- 1609 cation. 1611 Options: 1613 This field encodes binary indications of the presence and signifi- 1614 cance of any options. It also directly encodes some options. 1616 bit 0 set => One or more Option Extensions are present 1618 bit 1 set => One or more Options are network-significant 1620 Note that this bit is clear when OPT_FRAGMENT and/or OPT_JOIN 1621 are the only options present. 1623 bit 6 set => Parity packet for a variable-size transmission group 1624 (OPT_VAR_SIZE). This can only be present in parity packets, 1625 i.e. when OPT_PARITY is present 1627 bit 7 set => Packet is a parity packet (OPT_PARITY) 1629 All the other options (option extensions) are encoded in exten- 1630 sions to the PGM header. 1632 Checksum: 1634 This field is the usual 1's complement of the 1's complement sum 1635 of the entire PGM packet including header. 1637 The checksum does not include a network-layer pseudo header for 1638 compatibility with network address translation. If the computed 1639 checksum is zero, it is transmitted as all ones. A value of zero 1640 in this field means the transmitter generated no checksum. 1642 Note that if any entity between a source and a receiver modifies 1643 the PGM header for any reason (such as editing the Previous 1644 Sequence Number field of OPT_DROP), it must either recompute the 1645 checksum or clear it. The checksum is mandatory on data packets 1646 (ODATA and RDATA) that do NOT also have OPT_DROP. 1648 Global Source ID: 1650 A globally unique source identifier. This ID must not change 1651 throughout the duration of the transport session. A recommended 1652 identifier is the low-order 48 bits of the MD5 [9] signature of 1653 the DNS name of the source. Global Source ID together with Data- 1654 Source Port forms the TSI. 1656 TPDU Length: 1658 The length in octets of the PGM packet including the size of the 1659 header and any options. 1661 The high-order two bits of the Type field encode a version number, 0x0 1662 in this instance. The low-order nibble of the type field encodes the 1663 specific packet type. The intervening two bits (the low-order two bits 1664 of the high-order nibble) are reserved and must be zero. 1666 Within the low-order nibble of the Type field: 1668 values in the range 0x0 through 0x3 represent SPM-like packets (i.e., 1669 session-specific, sourced by a source, periodic), 1671 values in the range 0x4 through 0x7 represent DATA-like packets 1672 (i.e., data and repairs), 1674 values in the range 0x8 through 0xB represent NAK-like packets (i.e., 1675 hop-by-hop reliable NAK forwarding procedures), 1677 and values in the range 0xC through 0xF represent SPMR-like packets 1678 (i.e., session-specific, sourced by a receiver, asynchronous). 1680 Address Family Indicators (AFIs) are as specified in [10]. 1682 8.1. Source Path Messages 1684 SPMs are sent by a source to establish source path state in network ele- 1685 ments and to provide transmit window state to receivers. 1687 The source NLA of an SPM is the unicast NLA of the entity that ori- 1688 ginates the SPM. 1690 The destination NLA of an SPM is a multicast group NLA. 1692 0 1 2 3 1693 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1694 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1695 | Source Port | Destination Port | 1696 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1697 | Type | Options | Checksum | 1698 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1699 | Global Source ID ... | 1700 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1701 | ... Global Source ID | TPDU Length | 1702 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1703 | SPM's Sequence Number | 1704 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1705 | Trailing Edge Sequence Number | 1706 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1707 | Leading Edge Sequence Number | 1708 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1709 | NLA AFI | reserved | 1710 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1711 | Path NLA ... | 1712 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+-+ 1713 | Option Extensions when present ... | 1714 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ... -+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1716 Source Port: 1718 SPM_SPORT 1720 Data-Source Port, together with SPM_GSI forms SPM_TSI 1722 Destination Port: 1724 SPM_DPORT 1726 Data-Destination Port 1728 Type: 1730 SPM_TYPE = 0x00 1732 Global Source ID: 1734 SPM_GSI 1736 Together with SPM_SPORT forms SPM_TSI 1738 SPM's Sequence Number 1740 SPM_SQN 1742 The sequence number assigned to the SPM by the source. 1744 Trailing Edge Sequence Number: 1746 SPM_TRAIL 1748 The sequence number defining the current trailing edge of the 1749 source's transmit window (TXW_TRAIL). 1751 Leading Edge Sequence Number: 1753 SPM_LEAD 1755 The sequence number defining the current leading edge of the 1756 source's transmit window (TXW_LEAD). 1758 Path NLA: 1760 SPM_PATH 1762 The NLA of the interface on the network element on which this SPM 1763 was forwarded. Initialized by a source to the source's NLA, 1764 rewritten by each PGM network element upon forwarding. 1766 Option Extensions: 1768 SPMs may bear OPT_JOIN. 1770 8.2. Data Packets 1772 Data packets carry application data from a source or a repairer to 1773 receivers. 1775 ODATA: 1777 Original data packets transmitted by a source. 1779 RDATA: 1781 Repairs transmitted by a source or by a designated local repairer 1782 (DLR) in response to a NAK. 1784 The source NLA of a data packet is the unicast NLA of the entity that 1785 originates the data packet. 1787 The destination NLA of a data packet is a multicast group NLA. 1789 0 1 2 3 1790 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1791 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1792 | Source Port | Destination Port | 1793 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1794 | Type | Options | Checksum | 1795 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1796 | Global Source ID ... | 1797 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1798 | ... Global Source ID | TPDU Length | 1799 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1800 | Trailing Edge Sequence Number | 1801 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1802 | Data Packet Sequence Number | 1803 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1804 | Option Extensions when present ... | 1805 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ... -+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1806 | Data ... 1807 +-+-+- ... 1809 Source Port: 1811 OD_SPORT, RD_SPORT 1813 Data-Source Port, together with Global Source ID forms: 1815 OD_TSI, RD_TSI 1817 Destination Port: 1819 OD_DPORT, RD_DPORT 1821 Data-Destination Port 1823 Type: 1825 OD_TYPE = 0x04 1826 RD_TYPE = 0x05 1828 Global Source ID: 1830 OD_GSI, RD_GSI 1832 Together with Source Port forms: 1834 OD_TSI, RD_TSI 1836 Trailing Edge Sequence Number: 1838 OD_TRAIL, RD_TRAIL 1840 The sequence number defining the current trailing edge of the 1841 source's transmit window (TXW_TRAIL). In RDATA, this may not be 1842 the same as OD_TRAIL of the ODATA packet for which it is a repair. 1844 Data Packet Sequence Number: 1846 OD_SQN, RD_SQN 1848 The sequence number originally assigned to the ODATA packet by the 1849 source. 1851 Option Extensions: 1853 Data packets may bear OPT_FRAGMENT or OPT_DROP (not both) 1855 Data: 1857 Application data. 1859 8.3. Negative Acknowledgements and Confirmations 1861 NAK: 1863 Negative Acknowledgements are sent by receivers to request the 1864 repair of an ODATA packet detected to be missing from the expected 1865 sequence. 1867 N-NAK: 1869 Null Negative Acknowledgements are sent by DLRs to provide flow 1870 control feedback to the source of ODATA for which the DLR has pro- 1871 vided the corresponding RDATA. 1873 The source NLA of a NAK is the unicast NLA of the entity that originates 1874 the NAK. The source NLA of NAK is rewritten by each PGM network element 1875 with its own. 1877 The destination NLA of a NAK is initialized by the originator of the NAK 1878 (a receiver) to the unicast NLA of the upstream PGM network element 1879 known from SPMs. The destination NLA of a NAK is rewritten by each PGM 1880 network element with the unicast NLA of the upstream PGM network element 1881 to which this NAK is forwarded. On the final hop, the destination NLA 1882 of a NAK is rewritten by the PGM network element with the unicast NLA of 1883 the original source or the unicast NLA of a DLR. 1885 NCF: 1887 NAK Confirmations are sent by network elements and sources to con- 1888 firm the receipt of a NAK. 1890 The source NLA of an NCF is the ODATA source's NLA, not the network 1891 element's NLA as might be expected. 1893 The destination NLA of an NCF is a multicast group NLA. 1895 Note that in NAKs and N-NAKs, unlike the other packets, the field SPORT 1896 contains the Data-Destination port and the field DPORT contains the 1897 Data-Source port. As a general rule, the content of SPORT/DPORT is 1898 determined by the direction of the flow: in packets which travel down- 1899 stream SPORT is the port number chosen in the data source (Data-Source 1900 Port) and DPORT is the data destination port number (Data-Destination 1901 Port). The opposite holds for packets which travel upstream. This makes 1902 DPORT the protocol endpoint in the recipient host, regardless of the 1903 direction of the packet. 1905 0 1 2 3 1906 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1907 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1908 | Source Port | Destination Port | 1909 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1910 | Type | Options | Checksum | 1911 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1912 | Global Source ID ... | 1913 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1914 | ... Global Source ID | TPDU Length | 1915 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1916 | Requested Sequence Number | 1917 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1918 | NLA AFI | reserved | 1919 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1920 | Source NLA ... | 1921 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+-+ 1922 | NLA AFI | reserved | 1923 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1924 | Multicast Group NLA ... | 1925 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+-+ 1926 | Option Extensions when present ... 1927 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ... 1929 Source Port: 1931 NAK_SPORT, NNAK_SPORT 1933 Data-Destination Port 1935 NCF_SPORT 1937 Data-Source Port, together with Global Source ID forms NCF_TSI 1939 Destination Port: 1941 NAK_DPORT, NNAK_DPORT 1943 Data-Source Port, together with Global Source ID forms: 1945 NAK_TSI, NNAK_TSI 1947 NCF_DPORT 1949 Data-Destination Port 1951 Type: 1953 NAK_TYPE = 0x08 1954 NNAK_TYPE = 0x09 1956 NCF_TYPE = 0x0A 1958 Global Source ID: 1960 NAK_GSI, NNAK_GSI, NCF_GSI 1962 Together with Data-Source Port forms 1964 NAK_TSI, NNAK_TSI, NCF_TSI 1966 Requested Sequence Number: 1968 NAK_SQN, NNAK_SQN 1970 NAK_SQN is the sequence number of the ODATA packet for which a 1971 repair is requested. 1973 NNAK_SQN is the sequence number of the RDATA packet for which a 1974 repair has been provided by a DLR. 1976 NCF_SQN 1978 NCF_SQN is NAK_SQN from the NAK being confirmed. 1980 Source NLA: 1982 NAK_SRC, NNAK_SRC, NCF_SRC 1984 The unicast NLA of the original source of the missing ODATA. 1986 Multicast Group NLA: 1988 NAK_GRP, NNAK_GRP, NCF_GRP 1990 The multicast group NLA. 1992 Option Extensions: 1994 NAKs may bear OPT_TIME 1995 NCFs may bear OPT_REDIRECT 1997 9. Options 1999 PGM specifies several end-to-end options to address specific application 2000 requirements. PGM specifies options to support fragmentation, late 2001 joining, time-stamping, reception quality reports, sequence number dro- 2002 pout, and redirection. 2004 Options may be appended to PGM packet headers only by their original 2005 transmitters. While they may be interpreted by network elements, 2006 options are neither added nor removed by network elements. 2008 NOTA BENE: PGM network elements and receivers must pass over 2009 any options for which they do not have a definition and pro- 2010 cess the packet as though it did not bear those undefined 2011 options. 2013 9.1. Option extension length - OPT_LENGTH 2015 When option extensions are appended to the standard PGM header, the 2016 extensions must be preceded by an option extension length field specify- 2017 ing the total length of all option extensions. 2019 In addition, the PGM packet length must be incremented by the total 2020 length of all options, and the presence of the options must be encoded 2021 in the Options field of the standard PGM header before the Checksum is 2022 computed. 2024 All network-significant options must be appended before any exclusively 2025 receiver-significant options. 2027 To provide an indication of the end of option extensions, OPT_END (0x80) 2028 must be set in the Option Type field of the trailing option extension. 2030 9.1.1. OPT_LENGTH - Packet Extension Format 2032 0 1 2 3 2033 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2034 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2035 | Option Type | Option Length | Total length of all options | 2036 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2038 Option Type = 0x00 2040 Option Length = 4 octets 2042 Total length of all options 2043 The total length in octets of all option extensions including 2044 OPT_LENGTH. 2046 9.2. Fragmentation Option - OPT_FRAGMENT 2048 Fragmentation allows transport-layer entities at a source to break up 2049 application protocol data units (APDUs) into multiple PGM data packets 2050 (TPDUs) to conform with the MTU supported by the network layer. The 2051 fragmentation option may be applied to ODATA and RDATA packets only. 2053 This option is incompatible with the sequence number dropout 2054 option since dropout is based upon application-layer informa- 2055 tion available only at the beginning of the APDU. Trailing 2056 fragments of such packets would not have sufficient informa- 2057 tion to which to apply the drop out algorithm and so would be 2058 pass through filters designed to discard the APDU as a whole. 2060 Architecturally, the accumulation of TPDUs into APDUs is applied to 2061 TPDUs that have already been received, duplicate eliminated, and con- 2062 tiguously sequenced by the receiver. Thus APDUs may be reassembled 2063 across increments of the transmit window. 2065 9.2.1. OPT_FRAGMENT - Packet Extension Contents 2067 OPT_FRAG_OFF the offset of the fragment from the beginning of the APDU 2069 OPT_FRAG_LEN the total length of the original APDU 2071 9.2.2. OPT_FRAGMENT - Procedures - Sources 2073 A source fragments APDUs into a contiguous series of fragments no larger 2074 than the MTU supported by the network layer. A source sequentially and 2075 uniquely assigns OD_SQNs to these fragments in the order in which they 2076 occur in the APDU. A source then sets OPT_FRAG_OFF to the value of the 2077 offset of the fragment in the original APDU (where the first byte of the 2078 APDU is at offset 0, and OPT_FRAG_OFF numbers the first byte in the 2079 fragment), and set OPT_FRAG_LEN to the value of the total length of the 2080 original APDU. 2082 9.2.3. OPT_FRAGMENT - Procedures - Receivers 2084 Receivers detect and accumulate fragmented packets until they have 2085 received an entire contiguous sequence of packets comprising an APDU. 2086 This sequence begins with the fragment bearing OPT_FRAG_OFF of 0, and 2087 terminates with the fragment whose length added to its OPT_FRAG_OFF is 2088 OPT_FRAG_LEN. 2090 9.2.4. OPT_FRAGMENT - Procedures - Network Elements 2092 This option is not network-significant. 2094 9.2.5. OPT_FRAGMENT - Packet Extension Format 2096 0 1 2 3 2097 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2098 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2099 | Option Type | Option Length | | 2100 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2101 | Offset | 2102 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2103 | Length | 2104 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2106 Option Type = 0x01 2108 Option Length = 12 octets 2110 Offset 2112 The offset of the fragment from the beginning of the APDU 2113 (OPT_FRAG_OFF). 2115 Length 2117 The total length of the original APDU (OPT_FRAG_LEN). 2119 9.3. Late Joining Option - OPT_JOIN 2121 Late joining allows a source to bound the amount of repair history 2122 receivers may request when they initially join a particular transport 2123 session. 2125 This option indicates that receivers that join a transport session in 2126 progress may request repair of all data as far back as the given minimum 2127 sequence number from the time they join the transport session. The 2128 default is for receivers to receive data only from the first packet they 2129 receive and onward. 2131 9.3.1. OPT_JOIN - Packet Extensions Contents 2133 OPT_JOIN_MIN the minimum sequence number for repair 2134 9.3.2. OPT_JOIN - Procedures - Receivers 2136 If a PGM packet (ODATA, RDATA, or SPM) bears OPT_JOIN, a receiver may 2137 initialize the trailing edge of the receive window (RXW_TRAIL_INIT) to 2138 the given Minimum Sequence Number and proceeds with normal data recep- 2139 tion. 2141 9.3.3. OPT_JOIN - Procedures - Network Elements 2143 This option is not network-significant. 2145 9.3.4. OPT_JOIN - Packet Extension Format 2147 0 1 2 3 2148 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2149 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2150 | Option Type | Option Length | | 2151 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2152 | Minimum Sequence Number | 2153 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2155 Option Type = 0x03 2157 Option Length = 8 octets 2159 Minimum Sequence Number 2161 The minimum sequence number defining the initial trailing edge of 2162 the receive window for a late joining receiver. 2164 9.4. Time Stamp Option - OPT_TIME 2166 Time stamps may be used in conjunction with NAKs to allow receivers to 2167 specify the interval in which the requested RDATA is relevant to them. 2168 That interval is interpreted by both network elements and sources to 2169 determine whether to continue with or abandon a given repair. 2171 9.4.1. OPT_TIME - Packet Extensions Contents 2173 OPT_TIME_STAMP absolute time interval in milliseconds 2175 9.4.2. OPT_TIME - Procedures - Receivers 2177 Receivers may append the Time Stamp option to a NAK to indicate the 2178 absolute interval from the time of transmitting the NAK during which the 2179 receiver can usefully receive the corresponding RDATA. 2181 9.4.3. OPT_TIME - Procedures - Network Elements 2183 Network elements should use the time stamp of a NAK to age the associ- 2184 ated repair state for the specified interval and discard it if the 2185 corresponding RDATA has not already torn it down. 2187 Network elements must eliminate a time-stamped NAK only if its time 2188 stamp is smaller than the remaining time associated with the matching 2189 repair state. Otherwise, such a NAK must be forwarded instead of elim- 2190 inated, and its time stamp must be used to replace the time stamp of 2191 existing repair state. 2193 9.4.4. OPT_TIME - Procedures - Sources 2195 A source should abandon any attempt to transmit RDATA in response to a 2196 time stamped NAK if that repair cannot be completed within the specified 2197 interval. 2199 9.4.5. OPT_TIME - Packet Extension Format 2201 0 1 2 3 2202 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2203 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2204 | Option Type | Option Length | | 2205 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2206 | Time Stamp | 2207 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2209 Option Type = 0x04 2211 Option Length = 8 octets 2213 Time Stamp 2215 Absolute time interval in milliseconds (OPT_TIME_STAMP). 2217 9.5. Reception Quality Option - OPT_RXQ 2219 Reception quality reports may be used in conjunction with NAKs to allow 2220 receivers to provide a reception quality metric to the source. 2222 9.5.1. OPT_RXQ - Packet Extensions Contents 2224 OPT_RXQ_METRIC A reception quality metric defined by a source's local 2225 flow- and congestion-control procedures. 2227 9.5.2. OPT_RXQ - Procedures - Receivers 2229 Receivers may append the Reception Quality option to a NAK to indicate 2230 the rate of packet loss detected at the receiver. Receivers must bias 2231 the transmission of NAKs bearing OPT_RXQ by scaling NAK_BO_IVL with 2232 respect to the reception quality metric. That is, as reception quality 2233 deteriorates, NAK_BO_IVL should be reduced, and as reception quality 2234 improves, NAK_BO_IVL should be increased. 2236 The procedures for NAK suppression apply unchanged with the exception 2237 that NAKs bearing OPT_RXQ are only suppressed by other matching NAKs 2238 bearing OPT_RXQ and a worse reception quality metric. 2240 9.5.3. OPT_RXQ - Procedures - Network Elements 2242 Network elements must eliminate a NAK bearing OPT_RXQ only if its recep- 2243 tion quality metric is larger (worse) than the reception quality metric 2244 associated with the matching repair state. Otherwise, such a NAK must 2245 be forwarded instead of eliminated, and its reception quality metric 2246 must be used to replace the reception quality metric of existing repair 2247 state. 2249 9.5.4. OPT_RXQ - Procedures - Sources 2251 Sources may interpret reception quality reports in a local manner to 2252 adjust their transmission rate. 2254 9.5.5. OPT_RXQ - Packet Extension Format 2256 0 1 2 3 2257 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2258 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2259 | Option Type | Option Length | | 2260 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2261 | Reception Quality Metric | 2262 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2264 Option Type = 0x05 2266 Option Length = 8 octets 2268 Reception Quality Metric 2270 TBD 2272 9.6. Sequence Number Dropout Option - OPT_DROP 2274 Sequence number dropout may be used in conjunction with data packets to 2275 allow sources and network elements to selectively eliminate PGM data 2276 packets and convey the resulting sequence-number discontinuity to 2277 receivers so that sequencing can be preserved across the dropout. 2278 Sequence number dropout is incompatible with the fragmentation option. 2280 This option is incompatible with fragmentation since dropout 2281 is based upon application-layer information available only at 2282 the beginning of the APDU. Trailing fragments of such packets 2283 would not have sufficient information to which to apply the 2284 drop out algorithm and so would be pass through filters 2285 designed to discard the APDU as a whole. 2287 9.6.1. OPT_DROP - Packet Extensions Contents 2289 OPT_DROP_PREV the sequence number of the packet that should be regarded 2290 by the receiver as the logical predecessor to the packet 2291 bearing this option 2293 9.6.2. OPT_DROP - Procedures - Sources 2295 On a per-packet basis, a source may selectively permit intermediate 2296 application-layer filters to be applied to a data packet by appending 2297 OPT_DROP to ODATA/RDATA packets and setting the value of OPT_DROP_PREV 2298 to OD_SQN/RD_SQN. 2300 9.6.3. OPT_DROP - Procedures - Network Elements 2302 Network elements may apply intermediate application-layer filters only 2303 to ODATA/RDATA packets bearing OPT_DROP. If such a data packet passes 2304 the filters, it must be forwarded out each interface with OPT_DROP_PREV 2305 set to the value of the sequence number of the highest numbered data 2306 packet within OD_TSI/RD_TSI that has already been forward on that inter- 2307 face. 2309 9.6.4. OPT_DROP - Procedures - Receivers 2311 Receivers must do drop detection on packets bearing OPT_DROP by verify- 2312 ing that they have also received the data packet numbered OPT_DROP_PREV 2313 rather than checking for the numerical predecessor of OD_SQN/RD_SQN. If 2314 a receiver has received OPT_DROP_PREV, then no drop has occurred. If a 2315 receiver has not received OPT_DROP_PREV, then a receiver must NAK only 2316 for OPT_DROP_PREV and no other intervening sequence numbers. 2318 9.6.5. OPT_DROP - Packet Extension Format 2320 0 1 2 3 2321 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2322 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2323 | Option Type | Option Length | | 2324 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2325 | Previous Sequence Number | 2326 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2328 Option Type = 0x06 2330 Option Length = 8 octets 2332 Previous Sequence Number 2334 The sequence number of the packet that should be regarded by the 2335 receiver as the logical predecessor to the packet bearing this 2336 option (OPT_DROP_PREV). 2338 9.7. Redirect Option - OPT_REDIRECT 2340 Redirection may be used in conjunction with NCFs to allow a designated 2341 local repairer (DLR) to respond to normal NCFs with a redirecting NCF 2342 advertising its own address as an alternative to the original source. 2343 Recipients of redirecting NCFs may then direct NAKs for subsequent ODATA 2344 sequence numbers to the DLR rather than to the original source. In 2345 addition, DLRs that receive redirected NAKs for which they have RDATA 2346 must send a NULL NAK to provide flow control to the original source 2347 without also provoking a repair from that source. 2349 9.7.1. OPT_REDIRECT - Packet Extensions Contents 2351 OPT_REDIR_NLA the DLR's own unicast network-layer address to which 2352 recipients of the redirecting NCF may direct subsequent 2353 NAKs for the corresponding TSI. 2355 9.7.2. OPT_REDIRECT - Procedures - DLRs 2357 A DLR must receive any PGM sessions for which it wishes to provide a 2358 source of repairs. In addition to acting as an ordinary PGM receiver, a 2359 DLR may then respond to NCFs sourced by neighbouring network elements 2360 (or even by the source itself) by multicasting a repeat of that NCF and 2361 OPT_REDIRECT providing its own network-layer address. if, however, this 2362 NCF completes NAK transmission for this DLR, it must not send a 2363 redirecting NCF. 2365 Further, a DLR must act as an ordinary PGM source in responding to any 2366 NAK it receives (i.e., directed to it). That is, it should respond 2367 first with a normal NCF and then RDATA as usual. In addition a DLR that 2368 receives redirected NAKs for which it has RDATA must send a NULL NAK to 2369 provide flow control to the original source. If it cannot provide the 2370 RDATA it forwards the NAK to the upstream PGM neighbour as usual. 2372 NOTA BENE: In order to propagate on exactly the same distribu- 2373 tion tree as ODATA, RDATA packets transmitted by DLRs and 2374 other receivers must bear the ODATA source's NLA, not the 2375 DLR's or the receiver's NLA as might be expected. 2377 9.7.3. OPT_REDIRECT - Procedures - Network Elements 2379 Upon receiving a redirecting NCF, network elements should record the 2380 redirecting information for the TSI, and should redirect subsequent NAKs 2381 for the same TSI to the network address provided in the redirecting NCF 2382 rather than to the PGM neighbour known via the SPMs. Note, however, 2383 that a redirecting NCF is NOT regarded as matching the NAK that provoked 2384 it, so it does not complete the transmission of that NAK. Only a normal 2385 matching NCF can complete the transmission of a NAK. 2387 For subsequent NAKs, if the network element has recorded redirection 2388 information for the corresponding TSI, it may change the destination 2389 network address of those NAKs and attempt to transmit them to the DLR. 2390 If, however, a corresponding NCF is not received from the DLR within 2391 NAK_RPT_IVL, the network element must discard the redirecting informa- 2392 tion for the TSI and re-attempt to forward the NAK towards the PGM 2393 upstream neighbour. 2395 A NULL NAK is forwarded only if matching repair state has not already 2396 been created. Network elements must not confirm or retry NULL NAKs and 2397 they must not add the receiving interface to the repair state. If a 2398 NULL NAK is used to initially create repair state, this fact must be 2399 recorded so that any subsequent non-NULL NAK will not be eliminated, but 2400 rather will be forwarded to provoke an actual repair. State created by a 2401 NULL NAK exists only for NAK_ELIM_IVL. 2403 9.7.4. OPT_REDIRECT - Procedures - Receivers 2405 These procedures are intended to be applied in instances where a 2406 receiver's first hop router on the reverse path to the source is not a 2407 PGM Network Element. So, receivers must ignore a redirecting NCF from a 2408 DLR on the same IP subnet that the receiver resides on. 2410 Upon receiving a redirecting NCF, receivers should record the redirect- 2411 ing information for the TSI, and may redirect subsequent NAKs for the 2412 same TSI to the network address provided in the redirecting NCF rather 2413 than to the PGM neighbour for the corresponding ODATA for which the 2414 receiver is requesting repair. Note, however, that a redirecting NCF is 2415 NOT regarded as matching the NAK that provoked it, so it does not com- 2416 plete the transmission of that NAK. Only a normal matching NCF can com- 2417 plete the transmission of a NAK. 2419 For subsequent NAKs, if the receiver has recorded redirection informa- 2420 tion for the corresponding TSI, it may change the destination network 2421 address of those NAKs and attempt to transmit them to the DLR. If, how- 2422 ever, a corresponding NCF is not received within NAK_RPT_IVL, the 2423 receiver must discard the redirecting information for the TSI and re- 2424 attempt to forward the NAK to the PGM neighbour for the original source 2425 of the missing ODATA. 2427 9.7.5. OPT_REDIRECT - Packet Extension Format 2429 0 1 2 3 2430 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2431 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2432 | Option Type | Option Length | | 2433 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2434 | NLA AFI | reserved | 2435 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2436 | DLR's NLA ... | 2437 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+-+ 2439 Option Type = 0x07 2441 Option Length = 4 + NLA length 2443 DLR's NLA 2445 The DLR's own unicast network address to which recipients of the 2446 redirecting NCF may direct subsequent NAKs. 2448 10. Security Considerations 2450 In addition to the usual problems of end-to-end authentication, PGM is 2451 vulnerable to a number of security risks that are specific to the 2452 mechanisms it uses to establish source path state, to establish repair 2453 state, to forward NAKs, to identify DLRs, and to distribute repairs. 2454 These mechanisms expose PGM network elements themselves to security 2455 risks since network elements not only switch but also interpret SPMs, 2456 NAKs, NCFs, and RDATA, all of which may legitimately be transmitted by 2457 PGM sources, receivers, and DLRs. Short of full authentication of all 2458 neighbouring sources, receivers, DLRs, and network elements, the proto- 2459 col is not impervious to abuse. 2461 So putting aside the problems of rogue PGM network elements for the 2462 moment, there are enough potential security risks to network elements 2463 associated with sources, receivers, and DLRs alone. These risks include 2464 denial of service through the exhausting of both CPU bandwidth and 2465 memory, as well as loss of (repair) data connectivity through the mud- 2466 dling of repair state. 2468 False SPMs may cause PGM network elements to mis-direct NAKs intended 2469 for the legitimate source with the result that the requested RDATA would 2470 not be forthcoming. 2472 False NAKs may cause PGM network elements to establish spurious repair 2473 state that will expire only upon time-out and could lead to memory 2474 exhaustion in the meantime. 2476 False NCFs may cause PGM network elements to suspend NAK forwarding 2477 prematurely (or to mis-direct NAKs in the case of redirecting NCFs) 2478 resulting eventually in loss of RDATA. 2480 False RDATA may cause PGM network elements to tear down legitimate 2481 repair state resulting eventually in loss of legitimate RDATA. 2483 The development of precautions for network elements to protect them- 2484 selves against incidental or unsophisticated versions of these attacks 2485 is work in progress and includes: 2487 Damping of jitter in the value of either the source NLA of SPMs or 2488 the path NLA in SPMs. While the source NLA is expected to change 2489 seldom, the path NLA is expected to change occasionally as a conse- 2490 quence of changes in underlying multicast routing information. 2492 The extension of NAK shedding procedures to control the volume, not 2493 just the rate, of confirmed NAKs. In either case, these procedures 2494 assist network elements in surviving NAK attacks at the expense of 2495 maintaining service. More efficiently, network elements may use the 2496 knowledge of TSIs and their associated transmit windows gleaned from 2497 SPMs to control the proliferation of repair state. 2499 A three-way handshake between network elements and DLRs that would 2500 permit a network element to ascertain with greater confidence that an 2501 alleged DLR is identified by the alleged NLA, and is PGM conversant. 2503 11. Appendix A - Forward Error Correction 2505 11.1. Introduction 2507 The following procedures incorporate packet-level Reed Solomon Erasure 2508 correcting techniques as described in [11] and [12] into PGM. This 2509 approach to Forward Error Correction (FEC) is based upon the computation 2510 of h parity packets from k data packets for a total of n packets such 2511 that a receiver can reconstruct the k data packets out of any k of the n 2512 packets. More specifically, it is characteristic of the parity packets 2513 that any x of them can be used to reconstruct any x of the original k 2514 data packets for x less than or equal to k. The original k data packets 2515 are referred to as the Transmission Group, and the total n packets as 2516 the FEC Block. 2518 These procedures permit any combination of pro-active FEC or on-demand 2519 FEC with conventional ARQ within a given TSI to provide any flavour of 2520 layered or integrated FEC. Once provided by a source, the actual use of 2521 FEC or ARQ for loss recovery in the session is entirely at the discre- 2522 tion of the receivers. Note that receivers may still resort to selec- 2523 tive NAKs even when parity is available, and sources must still provide 2524 selective retransmissions in response. The two approaches can be used 2525 by the same or different receivers in a single transport session without 2526 conflict. 2528 Pro-active FEC refers to the technique of computing parity packets at 2529 transmission time and transmitting them as a matter of course following 2530 the data packets. Pro-active FEC is recommended for providing loss 2531 recovery over simplex or asymmetric multicast channels over which 2532 returning repair requests is either impossible or costly. It provides 2533 increased reliability at the expense of bandwidth. 2535 On-demand FEC refers to the technique of computing parity packets at 2536 repair time and transmitting them only upon demand (i.e., receiver-based 2537 loss detection and repair request). On-demand FEC is recommended for 2538 providing loss recovery of uncorrelated loss in very large receiver 2539 populations in which the probability of any single packet being lost is 2540 substantial. It provides equivalent reliability to selective NAKs (ARQ) 2541 at the expense of no more and typically less bandwidth. 2543 Selective NAKs are NAKs that request the retransmission of specific 2544 packets by sequence number corresponding to the sequence number of any 2545 data packets detected to be missing from the expected sequence (conven- 2546 tional ARQ). Selective NAKs are recommended for recovering losses 2547 occurring in trailing partial transmission groups. 2549 Parity NAKs are NAKs that request the transmission of a specific number 2550 of parity packets by count corresponding to the count of the number of 2551 data packets detected to be missing from a group of k data packets (on- 2552 demand FEC). 2554 The objective of these procedures is to incorporate these FEC techniques 2555 into PGM so that: 2557 sources may provide parity packets either pro-actively or on-demand, 2558 interchangeably within the same TSI, 2560 receivers may use either selective or parity NAKs interchangeably 2561 within the same TSI, 2563 network elements may maintain repair state based on either selective 2564 or parity NAKs in the same data structure, altering only search, 2565 RDATA constraint, and deletion algorithms in either case, 2567 and only OPTION additions to the basic packet formats are required. 2569 11.2. Overview 2571 Advertising FEC parameters in the transport session 2573 Sources add OPT_PARITY_PRM to SPMs to provide session-specific parame- 2574 ters such as the number of packets (TGSIZE == k) in a transmission 2575 group. This option lets receivers know how many packets in a transmis- 2576 sion group, and it lets network elements sort repair state by transmis- 2577 sion group number. This option includes an indication of whether pro- 2578 active and/or on-demand parity is available from the source. 2580 Distinguishing parity packets from data packets 2582 Sources send pro-active parity packets as ODATA and on-demand parity 2583 packets as RDATA. A source must add OPT_PARITY to the ODATA/RDATA 2584 packet header of parity packets to permit network elements and receivers 2585 to distinguish them from data packets. 2587 Data and parity packet numbering 2589 Parity packets must be calculated over a fixed number k of data packets 2590 known as the Transmission Group. Grouping of packets into transmission 2591 groups effectively partitions a packet sequence number into a high-order 2592 portion (TG_SQN) specifying the transmission group (TG), and a low-order 2593 portion (PKT_SQN) specifying the packet number (PKT-NUM in the range 0 2594 through k-1) within that group. So from an implementation point of 2595 view, it's handy if k, the TG size, is a power of 2. If so then TG_SQN 2596 and PKT_SQN can be mapped side-by-side into the 32 bit SQN. So 2597 log2(TGSIZE) is the size in bits of PKT_SQN. 2599 This mapping does not diminish the effective sequence number space since 2600 parity packets are marked with OPT_PARITY that allows the sequence space 2601 (PKT_SQN) to be reused to number the h parity packets for as long as h 2602 is not greater than k. 2604 In case h is greater than k, a source must add OPT_PARITY_GRP to any 2605 parity packet numbered j greater than k-1 specifying the number m of the 2606 group of k parity packets to which the packet belongs where m is just 2607 the quotient from the integer division of j by k. Correspondingly, 2608 PKT-NUM for such parity packets is just j modulo k. 2610 Note that parity NAKs (and consequently their corresponding parity NCFs) 2611 must also be distinguished by the addition of OPT_PARITY, and that in 2612 these packets, PKT_SQN contains PKT-CNT, the number of missing packets, 2613 rather than PKT-NUM, the number of a specific missing packet. More on 2614 all this later. 2616 Variable TPDU length 2618 If a non constant TPDU length is used within a given transmission group, 2619 the size of parity packets in the corresponding FEC block must be equal 2620 to the size of the largest original data packet in the block. Parity 2621 packets must be computed padding with zeros the original packets up to 2622 the size of the larger. Note that original data packets are transmitted 2623 without padding. Receivers that use a combination of original packets 2624 and FEC packets to rebuild missing packets must pad original packets in 2625 the same way as the sender does before feeding the original packets to 2626 the FEC decoder. The decoder produces original packet padded with zeros 2627 up to the size of the largest original packet in the group. In order to 2628 eliminate the padding, the original size of the packet must be known, 2629 this is accomplished as follows: 2631 The sender, along with the packet payloads, must also encode the TPDU 2632 lengths and append the 2-byte encoded length to the padded FEC pack- 2633 ets. 2635 Receivers which feed the FEC decoder with original packets must also 2636 append their TPDU length to the packets after padding them and before 2637 passing them to the decoder. 2639 This way the decoder produces padded original packets with their origi- 2640 nal TPDU length appended. Receivers use this length to get rid of the 2641 padding. 2643 A sender that transmits variable-size packets must take into account the 2644 fact that FEC packet will have a size equal to the maximum size of the 2645 original packets plus the size of the length field (2 bytes). 2647 If a fixed packet size is used within a transmission group, the encoded 2648 length is not appended to the parity packets. The presence of the option 2649 OPT_VAR_SIZE in parity packets allows receivers to distinguish between 2650 variable-size transmission groups and fixed-size ones, and behave 2651 accordingly. 2653 11.3. Packet Contents 2655 This section just provides enough short-hand to make the Procedures 2656 intelligible. For the full details of packet contents, please refer to 2657 Packet Formats below. 2659 OPT_PARITY indicated in pro-active (ODATA) and on-demand (RDATA) 2660 parity packets to distinguish them from data packets. 2661 This option is directly encoded in the "Option" field of 2662 the PGM header 2664 OPT_VAR_SIZE can be present in pro-active (ODATA) and on-demand 2665 (RDATA) parity packets to indicate that the corresponding 2666 transmission group is composed of variable size data 2667 packets. This option is directly encoded in the "Option" 2668 field of the PGM header 2670 OPT_PARITY_PRM appended by sources to SPMs to specify session-specific 2671 parameters such as the transmission group size and the 2672 availability of pro-active and/or on-demand parity from 2673 the source 2675 OPT_PARITY_GRP the number of the group (greater than 0) of k parity 2676 packets to which the parity packet belongs when more than 2677 k parity packets are provided by the source 2679 11.3.1. Parity NAKs 2681 NAK_TG_SQN the high-order portion of NAK_SQN specifying the 2682 transmission group for which parity packets are requested 2684 NAK_PKT_CNT the low-order portion of NAK_SQN specifying the number of 2685 missing data packets for which parity packets are 2686 requested 2688 11.3.2. Parity NCFs 2690 NCF_TG_SQN the high-order portion of NCF_SQN specifying the 2691 transmission group for which parity packets were 2692 requested 2694 NCF_PKT_CNT the low-order portion of NCF_SQN specifying the number of 2695 missing data packets for which parity packets were 2696 requested 2698 11.3.3. On-demand Parity 2700 RDATA_TG_SQN the high-order portion of RDATA_SQN specifying the 2701 transmission group to which the parity packet belongs 2703 RDATA_PKT_SQN the low-order portion of RDATA_SQN specifying the parity 2704 packet sequence number within the transmission group 2706 11.3.4. Pro-active Parity 2708 ODATA_TG_SQN the high-order portion of ODATA_SQN specifying the 2709 transmission group to which the parity packet belongs 2711 ODATA_PKT_SQN the low-order portion of ODATA_SQN specifying the parity 2712 packet sequence number within the transmission group 2714 11.4. Procedures - Sources 2716 If a source elects to provide parity for a given transport session, it 2717 must first provide the transmission group size PARITY_PRM_TGS in the 2718 OPT_PARITY_PRM option of its SPMs. If a source elects to provide pro- 2719 active parity for a given transport session, it must set PARITY_PRM_PRO 2720 in the OPT_PARITY_PRM option of its SPMs. If a source elects to provide 2721 on-demand parity for a given transport session, it must set 2722 PARITY_PRM_OND in the OPT_PARITY_PRM option of its SPMs. 2724 A source must send any pro-active parity packets for a given transmis- 2725 sion group only after it has first sent all of the corresponding k data 2726 packets in that group. Pro-active parity packets must be sent as ODATA 2727 with OPT_PARITY. 2729 If a source elects to provide on-demand parity, it must respond to a 2730 parity NAK for a transmission group with a parity NCF. The source must 2731 complete the transmission of the k original data packets and the pro- 2732 active parity packets, possibly scheduled, before starting the transmis- 2733 sion of on-demand parity packets. Subsequently, the source must send 2734 the number of parity packets requested by that parity NAK. On-demand 2735 parity packets must be sent as RDATA with OPT_PARITY. Previously 2736 transmitted pro-active parity packets cannot be reused as on-demand par- 2737 ity packets, these must be computed with new, previously unused, 2738 indexes. 2740 In either case, the source must be prepared to also respond to selective 2741 NAKs in the usual way. 2743 In the absence of data to transmit, a source should pad out the 2744 transmission group with padded packets before calculating and providing 2745 parity packets either pro-actively or on demand. 2747 A source may consolidate requests for on-demand parity in the same 2748 transmission group according to the following procedures. If the number 2749 of pending (i.e., unsent) parity packets from a previous request for 2750 on-demand parity packets is equal to or greater than NAK_PKT_CNT in a 2751 subsequent NAK, that subsequent NAK must be confirmed but may otherwise 2752 be ignored. If the number of pending (i.e., unsent) parity packets from 2753 a previous request for on-demand parity packets is less than NAK_PKT_CNT 2754 in a subsequent NAK, that subsequent NAK must be confirmed but the 2755 source need only increase the number of pending parity packets to 2756 NAK_PKT_CNT. 2758 When a source provides parity packets relatively to a variable-size 2759 transmission group, it must compute parity packets padding the original 2760 packets, must append the encoded TPU lengths and add the OPT_VAR_SIZE 2761 option as specified in the overview description. 2763 11.5. Procedures - Receivers 2765 If a receiver elects to make use of parity packets for loss recovery, it 2766 must first learn the transmission group size PARITY_PRM_TGS from 2767 OPT_PARITY_PRM in the SPMs for the TSI. The transmission group size is 2768 used by a receiver to determine the sequence number boundaries between 2769 transmission groups. 2771 Thereafter, if PARITY_PRM_PRO is also set in the SPMs for the TSI, a 2772 receiver may use any pro-active parity packets it receives for loss 2773 recovery, and if PARITY_PRM_OND is also set in the SPMs for the TSI, it 2774 may solicit on-demand parity packets upon loss detection. Parity pack- 2775 ets are ODATA (pro-active) or RDATA (on-demand) packets distinguished by 2776 OPT_PARITY which lets receivers know that ODATA/RDATA_TG_SQN identifies 2777 the group of PARITY_PRM_TGS packets to which the parity may be applied 2778 for loss recovery in the corresponding transmission group, and that 2779 ODATA/RDATA_PKT_SQN is being reused to number the parity packets within 2780 that group. Receivers order parity packets and eliminate duplicates 2781 within a transmission group based on ODATA/RDATA_PKT_SQN and on 2782 OPT_PARITY_GRP if present. 2784 To solicit on-demand parity packets, a receiver must send parity NAKs 2785 upon loss detection. For the purposes of soliciting on-demand parity, 2786 loss detection occurs at transmission group boundaries, i.e. upon 2787 receipt of the last data packet in a transmission group, upon receipt of 2788 any data packet in any subsequent transmission group, or upon receipt of 2789 any parity packet in the current or a subsequent transmission group. 2791 A parity NAK is simply a NAK with OPT_PARITY and NAK_PKT_CNT set to the 2792 count of the number of packets detected to be missing from the transmis- 2793 sion group specified by NAK_TG_SQN. Note that this constrains the 2794 receiver to request no more parity packets than there are data packets 2795 in the transmission group. 2797 A receiver should bias the value of NAK_BO_IVL for parity NAKs inversely 2798 proportional to NAK_PKT_CNT so that NAKs for larger losses are likely to 2799 be scheduled ahead of NAKs for smaller losses in the same receiver popu- 2800 lation. 2802 A confirming NCF for a parity NAK is a parity NCF with NCF_PKT_CNT equal 2803 to or greater than that specified by the parity NAK. 2805 A receiver's NAK_RDATA_IVL timer is not cancelled until all requested 2806 parity packets have been received. 2808 In the absence of data (detected from SPMs bearing SPM_LEAD equal to 2809 RXW_LEAD) on non-transmission-group boundaries, receivers should resort 2810 to selective NAKs for any missing packets in that trailing transmission 2811 group. 2813 When a receiver handles parity packets belonging to a variable-size FEC 2814 block (detected from the presence of the OPT_VAR_SIZE option in the par- 2815 ity packets), it must decode them as specified in the overview descrip- 2816 tion and use the decoded TPDU length to get rid of the padding in the 2817 decoded packet. 2819 11.6. Procedures - Network Elements 2821 Pro-active parity packets (ODATA with OPT_PARITY) are switched by net- 2822 work elements without transport-layer intervention. 2824 On-demand parity packets (RDATA with OPT_PARITY) necessitate modified 2825 request, confirmation and repair constraint procedures for network ele- 2826 ments. In the context of these procedures, repair state is maintained 2827 per NAK_TSI and NAK_TG_SQN, and in addition to recording the interfaces 2828 on which corresponding NAKs have been received, records the largest 2829 value of NAK_PKT_CNT seen in corresponding NAKs on each interface. This 2830 value is referred to as the known packet count. The largest of the 2831 known packet counts recorded for any interface in the repair state for 2832 the transmit group or carried by an NCF is referred to as the largest 2833 known packet count. 2835 Upon receipt of a parity NAK, a network element responds with the 2836 corresponding parity NCF. The corresponding parity NCF is just an NCF 2837 formed in the usual way (i.e., a multicast copy of the NAK with the 2838 packet type changed), but with the addition of OPT_PARITY and with 2839 NCF_PKT_CNT set to the larger of NAK_PKT_CNT and the known packet count 2840 for the receiving interface. The network element then creates repair 2841 state in the usual way with the following modifications. 2843 If repair state for the receiving interface does not exist, the network 2844 element must create it and additionally record NAK_PKT_CNT from the par- 2845 ity NAK as the known packet count for the receiving interface. 2847 If repair state for the receiving interface already exists, the network 2848 element must eliminate the NAK only if NAK_ELIM_IVL has not expired and 2849 NAK_PKT_CNT is equal to or less than the largest known packet count. If 2850 NAK_PKT_CNT is greater than the known packet count for the receiving 2851 interface, the network element must update the latter with the larger 2852 NAK_PKT_CNT. 2854 Upon either adding a new interface or updating the known packet count 2855 for an existing interface, the network element must determine if 2856 NAK_PKT_CNT is greater than the largest known packet count. If so or if 2857 NAK_ELIM_IVL has expired, the network element must forward the parity 2858 NAK in the usual way with a value of NAK_PKT_CNT equal to the largest 2859 known packet count. 2861 Upon receipt of an on-demand parity packet, a network element must 2862 locate existing repair state for the corresponding RDATA_TSI and 2863 RDATA_TG_SQN. If no such repair state exists, the network element must 2864 discard the RDATA as usual. 2866 If corresponding repair state exists, the largest known packet count 2867 must be decremented by one, then the network element must forward the 2868 RDATA on all interfaces in the existing repair state, and decrement the 2869 known packet count for each by one. Any interfaces whose known packet 2870 count is thereby reduced to zero must be deleted from the repair state. 2871 If the number of interfaces is thereby reduced to zero, the repair state 2872 itself must be deleted. 2874 Upon reception of a parity NCF, network elements must cancel pending NAK 2875 retransmission only if NCF_PKT_CNT is greater or equal to the largest 2876 known packet count. Network elements must use parity NCFs to anticipate 2877 NAKs in the usual way with the addition of recording NCF_PKT_CNT from 2878 the parity NCF as the largest known packet count with the anticipated 2879 state so that any subsequent NAKs received with NAK_PKT_CNT equal to or 2880 less than NCF_PKT_CNT will be eliminated, and any with NAK_PKT_CNT 2881 greater than NCF_PKT_CNT will be forwarded. Network elements which 2882 receive a parity NCF with NCF_PKT_CNT larger than the largest known 2883 packet count must also use it to anticipate NAKs, increasing the largest 2884 known packet count to reflect NCF_PKT_CNT (partial anticipation). 2886 Parity NNAKs follow the usual elimination procedures with the exception 2887 that NNAKs are eliminated only if existing NAK state has a NAK_PKT_CNT 2888 greater than NNAK_PKT_CNT. 2890 11.7. Procedures - DLRs 2892 A DLR with the ability to provide FEC repairs must indicate this by set- 2893 ting the OPT_PARITY bit in the redirecting NCF. It must then process any 2894 redirected FEC NAKs in the usual way. 2896 11.8. Packet Formats 2898 11.8.1. OPT_PARITY_PRM - Packet Extension Format 2900 OPT_PARITY_PRM may be appended only to SPMs. 2902 0 1 2 3 2903 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2904 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2905 | Option Type | Option Length | P O| 2906 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2907 | Transmission Group Size | 2908 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2910 Option Type = 0x08 2912 Option Length = 8 octets 2914 P-bit (PARITY_PRM_PRO) 2916 Indicates when set that the source is providing pro-active parity 2917 packets. 2919 O-bit (PARITY_PRM_OND) 2921 Indicates when set that the source is providing on-demand parity 2922 packets. 2924 At least one of PARITY_PRM_PRO and PARITY_PRM_OND must be set. 2926 Transmission Group Size (PARITY_PRM_TGS) 2928 The number of data packets in the transmission group over which 2929 the parity packets are calculated. 2931 11.8.2. OPT_PARITY_GRP - Packet Extension Format 2933 OPT_PARITY_GRP may be appended only to parity packets. 2935 0 1 2 3 2936 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2937 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2938 | Option Type | Option Length | Parity Group Number | 2939 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2941 Option Type = 0x09 2943 Option Length = 4 octets 2945 Parity Group Number (PRM_GROUP) 2947 The number of the group of k parity packets amongst the h parity 2948 packets within the transmission group to which the parity packet 2949 belongs where the first k parity packets are in group zero. 2950 PRM_GROUP must not be zero. 2952 12. Appendix B - Congestion Avoidance 2954 A source should implement strategies for congestion avoidance, aimed at 2955 providing overall network stability, fairness among competing PGM flows 2956 and some degree of fairness towards coexisting TCP flows [13]. This is 2957 work in progress and will be expanded in a later version of this docu- 2958 ment. 2960 13. Appendix C - Flow Control 2962 A degree of flow control native to PGM itself is provided through the 2963 exchange of elective, periodic state notifications between sources 2964 (Transmit State Notifications - TSNs) and receivers (Receive State 2965 Notifications - RSNs). The goal of the flow control strategies in PGM 2966 is to conservatively adapt a source's transmit rate so as to minimize 2967 NAKs due to receiver overrun and to do so with as simple and efficient 2968 an exchange of protocol packets as possible. These strategies are 2969 intended to augment, not substitute for, source-based adaptive stra- 2970 tegies for rate-limiting transmissions based solely on the frequency of 2971 NAKs. 2973 Since PGM has no conference control mechanisms, these mechanisms simply 2974 act to modify a source's transmit rate to suit the slowest receiver the 2975 source is willing to accommodate. The use and frequency of TSNs and 2976 RSNs is left to the discretion of the implementation. 2978 TSNs enable a source to adapt its transmit rate as network and receiver 2979 resources permit. A source may distinguish congestion from flow control 2980 by noting that in the absence of RSNs, it is likely that most NAKs the 2981 source may see are the result of congestion and not end-to-end flow con- 2982 trol problems. So a source may also reduce its transmit rate simply in 2983 response to the pattern of NAKs it receives. 2985 These mechanisms are entirely elective and not meant as a replacement 2986 for reservation protocols or other out-of-band resource and conference 2987 management strategies. They are intended simply to provide a workable 2988 strategy in the absence of anything more sophisticated. PGM's reliable 2989 data transfer service is in no way dependent upon the use of TSNs and 2990 RSNs. 2992 13.1. Architectural Description 2994 To provide an optional mechanism for flow, PGM specifies packet formats 2995 and procedures for sources and receivers to exchange resource state 2996 notifications. 2998 13.1.1. Source Functions 3000 A source may periodically multicast TSNs to the group to advertise its 3001 transmit window and its minimum and current transmit rates. 3003 In response to corresponding RSNs, a source must reduce its transmit 3004 rate to at most the least rate specified in any RSN, and reflect this 3005 reduced current rate in subsequent TSNs. 3007 In the absence of corresponding RSNs, a source may conservatively 3008 increase its transmit rate, and reflect this increased current rate in 3009 subsequent TSNs. 3011 To find the local maximum current transmit rate, a source may continue 3012 to increase its current transmit rate until it receives RSNs (or NAKs) 3013 in response, and then back off appropriately. 3015 13.1.2. Receiver Functions 3017 A receiver unicasts an RSN to a source in response to a TSN only if the 3018 transmit rate advertised in the TSN exceeds the receiver's capacity. To 3019 prevent RSN implosion, receivers must observe a random back off over an 3020 interval three times the TSN period, and monitor TSNs in the meantime 3021 for a reduction in the current transmit rate. 3023 13.1.3. Network Element Functions 3025 Network elements forward TSNs, and RSNs without intervention. 3027 13.2. Terms and Concepts 3029 For a given transport session identified by a TSI, a source maintains: 3031 TXW_MIN_RTE a fixed minimum transmit rate in kBps, the minimum the 3032 transmitter will consider maintaining, equal to or less 3033 than TXW_MAX_RTE 3035 The reduction of TXW_MAX_RTE to TXW_MIN_RTE is negotiated through 3036 exchanges of TSNs and RSNs. 3038 For a given transport session identified by a TSI, a receiver maintains: 3040 RXW_MAX_RTE a fixed maximum reception rate in kBps, the maximum the 3041 receiver will consider maintaining 3043 The reduction of the current transmit rate (advertised in TSNs) to 3044 RXW_MAX_RTE is negotiated through exchanges of TSNs and RSNs. 3046 13.3. Packet Contents 3048 This section just provides enough short-hand to make the Procedures 3049 intelligible. For the full details of packet contents, please refer to 3050 Packet Formats below. 3052 13.3.1. Transmit State Notification (TSN) 3054 TSNs are formed by adding OPT_TSN to SPMs and contain: 3056 TSN_TSI (a.k.a. SPM_TSI) the source-assigned TSI for which RSNs 3057 are solicited 3059 TSN_SQN (a.k.a. SPM_SQN) a sequence number assigned sequentially 3060 by the source in unit increments and scoped by TSN_TSI 3062 NOTA BENE: this is an entirely separate sequence than is used 3063 to number ODATA and RDATA. 3065 TSN_TRAIL (a.k.a. SPM_TRAIL) the source's TXW_TRAIL 3067 TSN_LEAD (a.k.a. SPM_LEAD) the source's TXW_LEAD 3069 TSN_MIN_RTE the source's TXW_MIN_RTE 3071 TSN_MAX_RTE the source's TXW_MAX_RTE 3073 13.3.2. Receive State Notification (RSN) 3075 RSNs are unicast to the source and contain: 3077 RSN_TSI TSN_TSI from the TSN to which this is a response 3079 RSN_SQN TSN_SQN from the TSN to which this is a response 3081 RSN_TRAIL TSN_TRAIL from the TSN to which this is a response 3083 RSN_MAX_RTE the receiver's RXW_MAX_RTE 3085 13.4. Procedures - Sources 3087 13.4.1. Data Transmission Initialization 3089 Sources must sequence TSNs by assigning each a TSN_SQN using a number 3090 sequence separate from that used to number data packets. In addition, 3091 sources associate each TSN with a specific instance of the transmit win- 3092 dow by setting TSN_TRAIL to TXW_TRAIL. 3094 A source may precede initial data transmission to a transport session by 3095 sending TSNs at a rate of TSN_IDL_RTE for an interval of TSN_IDL_IVL. 3096 TSNs are used by the source in this instance simply to provoke RSNs from 3097 any receivers that may protest the advertised TSN_MAX_RTE. A source may 3098 use this procedure to find the largest acceptable initial values for 3099 TXW_MAX_RTE before initiating data transmission. 3101 In the ordinary course of data transmission, a source may periodically 3102 transmit TSNs and adjust the current transmit rate to establish the 3103 optimum rate for the current population of tuned-in receivers. 3105 Specifically, a source may increase the values in the TSN without 3106 increasing them in fact until it provokes RSNs. It should then use the 3107 values in the RSNs to back off to the highest acceptable values for 3108 actual use. 3110 Note, then, that a source may advertise higher values for TSN_MAX_RTE in 3111 its TSNs than it actually uses, but it must never actually use higher 3112 values for TXW_MAX_RTE than it advertises in its TSNs. 3114 13.4.2. Transmit Resource Management 3116 An RSN corresponds to a TSN if RSN_TSI matches TSN_TSI, RSN_SQN matches 3117 TSN_SQN, and RSN_TRAIL matches TSN_TRAIL. That is, an RSN corresponds 3118 to a TSN if it bears the same transport session, sequence, and transmit 3119 window identifiers as the TSN. 3121 Sources should respond to RSNs that correspond to the current TSN by 3122 reducing TXW_MAX_RTE to the minimum values heard in any such RSN as long 3123 as these values are no lower than TXW_MIN_RTE. 3125 13.5. Procedures - Receivers 3127 13.5.1. Data Reception Initialization 3129 TSNs must be sequenced by receivers based on a combination of TSN_SQN 3130 (which numbers TSNs separately from data packets) and TSN_TRAIL which 3131 relates the TSN to a specific transmit window. TSNs bearing the same 3132 TSN_TRAIL may be ordered relative to one another using TSN_SQN. The 3133 highest numbered such TSN should be used to maintain the receiver's 3134 notion of the transmit window and the current and maximum transmit 3135 rates. Ordering of TSNs is particularly important for TSNs in which 3136 transmit rates are increasing or decreasing. 3138 For a given transport session identified by TSI, a receiver may precede 3139 initial data reception by first receiving and accepting the values for 3140 TXW_MAX_RTE in a matching TSN. Accepting this value implies that the 3141 receiver is capable of receiving data at the rate of TXW_MAX_RTE. 3143 If a receiver accepts the advertised value for TXW_MAX_RTE in a matching 3144 TSN, it may initiate data reception in the transmit window provided by 3145 the TSN. 3147 If the TSN bears OPT_JOIN, the receiver initializes the trailing edge of 3148 the receive window to TXW_TRAIL and proceeds with normal data reception. 3150 If the TSN does not bear OPT_JOIN, the receiver may initiate data recep- 3151 tion beginning only with the first ODATA_SQN it receives within the 3152 advertised transmit window. This sequence number temporarily defines 3153 the trailing edge of the transmit window from the receivers perspective. 3154 That is, it is assigned to RXW_TRAIL_INIT within the receiver, and until 3155 trailing edge sequence number advertised in subsequent packets (TSNs or 3156 ODATA or RDATA or SPMs) increments through RXW_TRAIL_INIT, the receiver 3157 must only request repairs for sequence numbers subsequent to 3158 RXW_TRAIL_INIT. Thereafter, it may request repairs anywhere in the 3159 transmit window. This temporary restriction on repair requests prevents 3160 receivers from requesting a potentially large amount of history when 3161 they first begin to receive a given PGM transport session. 3163 13.5.2. Receive Resource Management 3165 >From a receiver's perspective, an acceptable TSN is one in which 3166 TSN_MIN_RTE is equal to or less than RXW_MAX_RTE. The current value of 3167 TSN_MAX_RTE may or may not be within the receiver's capacity. 3169 If a receiver receives an unacceptable TSN, the receiver must neither 3170 initiate nor continue data reception for the given transport session. 3171 In addition, it must not respond to the TSN with an RSN, although it may 3172 continue to receive and inspect TSNs for an acceptable one. 3174 If a receiver receives an acceptable TSN, but the advertised values of 3175 TSN_MAX_RTE exceed RXW_MAX_RTE, the receiver should respond with a 3176 corresponding RSN advertising the maximum value RSN_MAX_RTE with which 3177 it can operate. The receiver may simultaneously initiate or continue 3178 data reception, and it should continue to respond to subsequent TSNs 3179 with this RSN until it receives a TSN advertising a value of TSN_MAX_RTE 3180 with which it can operate. 3182 13.6. Packet Formats 3184 13.6.1. OPT_TSN - Packet Extension Format 3186 The source NLA of a TSN is the unicast address of the entity that 3187 originates the TSN. 3189 The destination NLA of a TSN is a multicast group NLA. 3191 0 1 2 3 3192 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3193 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3194 | Option Type | Option Length | | 3195 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3196 | Minimum Transmit Rate | 3197 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3198 | Maximum Transmit Rate | 3199 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3201 Option Type = 0x0A 3203 Option Length = 12 octets 3205 Minimum Transmit Rate (TSN_MIN_RTE) 3207 The minimum rate of transmission required for receivers to parti- 3208 cipate in the group (TXW_MIN_RTE). 3210 Transmit Rate (TSN_MAX_RTE) 3212 The current rate of transmission required by receivers to partici- 3213 pate in the group (TXW_MAX_RTE). 3215 13.6.2. RSN - Receive State Notification 3217 The source NLA of an RSN is the unicast address of the entity that 3218 originates the RSN. 3220 The destination NLA of an RSN is the unicast address of the source of 3221 the corresponding TSN. 3223 0 1 2 3 3224 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3225 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3226 | Source Port | Destination Port | 3227 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3228 | Type | Options | Checksum | 3229 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3230 | Global Source ID ... | 3231 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3232 | ... Global Source ID | TPDU Length | 3233 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3234 | RSN's Sequence Number | 3235 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3236 | Trailing Edge Sequence Number | 3237 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3238 | Receive Rate | 3239 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3241 Source Port: 3243 RSN_SPORT 3245 Data-Destination Port 3247 Destination Port: 3249 RSN_DPORT 3251 Data-Source Port, together with Global Source ID forms RSN_TSI 3253 Type: 3255 RSN_TYPE = 0x0D 3257 Options 3259 RSNs may bear only OPT_JOIN. 3261 RSN's Sequence Number (RSN_SQN) 3262 TSN_SQN from the corresponding TSN. 3264 Trailing Edge Sequence Number (RSN_TRAIL) 3266 TSN_TRAIL from the corresponding TSN. 3268 Transmit Rate (RSN_MAX_RTE) 3270 The maximum rate of transmission the receiver can sustain 3271 (RXW_MAX_RTE). 3273 14. Appendix D - SPM Requests 3275 14.1. Introduction 3277 SPM Requests (SPMRs) may be used to solicit an SPM from a source in a 3278 non-implosive way. The typical application is for late-joining 3279 receivers to solicit SPMs directly from a source in order to be able to 3280 NAK for missing packets without having to wait for a regularly scheduled 3281 SPM from that source. 3283 14.2. Overview 3285 Allowing for SPMR implosion protection procedures, a receiver may uni- 3286 cast an SPMR to a source to solicit the most current session, window, 3287 and path state from that source any time after the receiver has joined 3288 the group. A receiver may learn the TSI and source to which to direct 3289 the SPMR from any other PGM packet it receives in the group, or by any 3290 other means such as from local configuration or directory services. The 3291 receiver must use the usual SPM procedures to glean the unicast address 3292 to which it should direct its NAKs from the solicited SPM. 3294 14.3. Packet Contents 3296 This section just provides enough short-hand to make the Procedures 3297 intelligible. For the full details of packet contents, please refer to 3298 Packet Formats below. 3300 14.3.1. SPM Requests 3302 SPMRs are transmitted by receivers to solicit SPMs from a source. 3304 SPMs are unicast to a source and contain: 3306 SPMR_TSI the source-assigned TSI for the session to which the SPMR 3307 corresponds 3309 14.4. Procedures - Sources 3311 A source must respond immediately to an SPMR with the corresponding SPM 3312 rate limited to once per IHB_MIN per TSI. The corresponding SPM matches 3313 SPM_TSI to SPMR_TSI and SPM_DPORT to SPMR_DPORT. 3315 14.5. Procedures - Receivers 3317 To moderate the potentially implosive behaviour of SPMRs at least on a 3318 densely populated subnet, receivers must use the following back-off and 3319 suppression procedure based on multicasting the SPMR with a TTL of 1 3320 ahead of and in addition to unicasting the SPMR to the source. The role 3321 of the multicast SPMR is to suppress the transmission of identical SPMRs 3322 from the subnet. 3324 More specifically, before unicasting a given SPMR, receivers must choose 3325 a random delay on SPMR_BO_IVL (~250 msecs) during which they listen for 3326 a multicast of an identical SPMR. If a receiver does not see a matching 3327 multicast SPMR within its chosen random interval, it must first multi- 3328 cast its own SPMR to the group with a TTL of 1 before then unicasting 3329 its own SPMR to the source. If a receiver does see a matching multicast 3330 SPMR within its chosen random interval, it must refrain from unicasting 3331 its SPMR and wait instead for the corresponding SPM. 3333 In addition, receipt of the corresponding SPM within this random inter- 3334 val should cancel transmission of an SPMR. 3336 In either case, the receiver must wait at least SPMR_SPM_IVL before 3337 attempting to repeat the SPMR by choosing another delay on SPMR_BO_IVL 3338 and repeating the procedure above. 3340 The corresponding SPMR matches SPMR_TSI to SPMR_TSI and SPMR_DPORT to 3341 SPMR_DPORT. The corresponding SPM matches SPM_TSI to SPMR_TSI and 3342 SPM_DPORT to SPMR_DPORT. 3344 14.6. Procedures - Network Elements 3346 There are no SPMR procedures for network elements. 3348 14.7. SPM Requests 3350 SPMR: 3352 SPM Requests are sent by receivers to request the immediate 3353 transmission of an SPM for the given TSI from a source. 3355 The source NLA of an SPMR is the unicast NLA of the entity that ori- 3356 ginates the SPMR. 3358 The destination NLA of an SPMR is the unicast NLA of the source from 3359 which the corresponding SPM is requested. 3361 0 1 2 3 3362 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3363 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3364 | Source Port | Destination Port | 3365 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3366 | Type | Options | Checksum | 3367 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3368 | Global Source ID ... | 3369 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3370 | ... Global Source ID | TPDU Length | 3371 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3372 | Option Extensions when present ... 3373 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ... 3375 Source Port: 3377 SPMR_SPORT 3379 Data-Destination Port 3381 Destination Port: 3383 SPMR_DPORT 3385 Data-Source Port, together with Global Source ID forms SPMR_TSI 3387 Type: 3389 SPMR_TYPE = 0x0C 3391 Global Source ID: 3393 SPMR_GSI 3394 Together with Source Port forms 3396 SPMR_TSI 3398 15. Appendix E - Poll Mechanism 3400 15.1. Introduction 3402 These procedures provide PGM network elements and sources with the abil- 3403 ity to poll their downstream PGM neighbours to solicit replies in an 3404 implosion-controlled way. 3406 Both general polls and specific polls are possible. The former provide a 3407 PGM (parent) node with a way to check if there are any PGM (children) 3408 nodes connected to it, both network elements and receivers, and to esti- 3409 mate their number. The latter can be used by PGM parent nodes to search 3410 for nodes with specific properties among its PGM children. An example of 3411 application for this is DLR discovery. 3413 Polling is implemented using two additional PGM packets: 3415 POLL a Poll Request that PGM parent nodes multicast to the group to per- 3416 form the poll. Similarly to NCFs, POLL packets stop at the first 3417 PGM node they reach, as they are not forwarded by PGM network ele- 3418 ments. 3420 POLR a Poll Response that PGM children nodes (either network elements or 3421 receivers) use to reply to a Poll Request by addressing it to the 3422 NLA of the interface from which the triggering POLL was sent. 3424 The polling mechanism dictates that PGM children nodes that receive a 3425 POLL packet reply to it only if certain conditions are satisfied and 3426 ignore the POLL otherwise. Two types of condition are possible: a random 3427 condition that defines a probability of replying for the polled child, 3428 and a deterministic condition. Both the random condition and the deter- 3429 ministic condition are controlled by the polling PGM parent node by 3430 specifying the probability of replying and defining the deterministic 3431 condition(s) respectively. Random-only poll, deterministic-only poll or 3432 a combination of the two are possible. 3434 The random condition in polls allows the prevention of implosion of 3435 replies by controlling their number. Given a probability of replying P 3436 and assuming that each receiver makes an independent decision, the 3437 number of expected replies to a poll is P*N where N is the number of PGM 3438 children relative to the polling PGM parent. The polling node can con- 3439 trol the number of expected replies by specifying P in the POLL packet. 3441 15.2. Packet Contents 3443 This section just provides enough short-hand to make the Procedures 3444 intelligible. For the full details of packet contents, please refer to 3445 Packet Formats below. 3447 15.2.1. POLL (Poll Request) 3449 POLL_SQN a sequence number assigned sequentially by the polling 3450 parent in unit increments and scoped by POLL_PATH and the 3451 TSI of the session. 3453 POLL_PATH the network-layer address (NLA) of the interface on the 3454 PGM network element or source on which the POLL is 3455 transmitted 3457 POLL_BO_IVL the back-off interval that must be used to compute the 3458 random back-off time to wait before sending the response 3459 to a poll. 3461 POLL_RAND a random string used to implement the randomness in 3462 replying 3464 POLL_MASK a bit-mask used to determine the probability of random 3465 replies 3467 POLL_S_TYPE the sub-type of the poll request 3469 Poll request may also contain options which specify deterministic condi- 3470 tions for the reply. No options are currently defined. 3472 15.2.2. POLR (Poll Response) 3474 POLR_SQN POLL_SQN of the poll request of which this is a reply 3476 Poll response may also contain options. No options are currently 3477 defined. 3479 15.3. Procedures - General 3481 Although the poll mechanism can be used for both general polls and 3482 specific polls, no specific polls are currently defined. This section 3483 hence will only specify general polls and extension mechanisms to incor- 3484 porate specific polls. 3486 15.3.1. General Polls 3488 General Polls can be used to check for and count PGM children that are 1 3489 PGM hop downstream of an interface of a given node. They have 3490 POLL_S_TYPE equal to PGM_POLL_GENERAL. PGM children that receive a gen- 3491 eral poll decide whether to reply to it only based on the random condi- 3492 tion present in the POLL. 3494 To prevent response implosion, PGM parents that initiate a general poll 3495 should establish the probability of replying to the poll, P, so that the 3496 expected number of replies is contained. The expected number of replies 3497 is N * P, where N is the number of children. To be able to compute this 3498 number, PGM parents should already have a rough estimate of the number 3499 of children. If they do not have a recent estimate of this number, they 3500 should send the first poll with a very low probability of replying and 3501 increase it in subsequent polls in order to get the desired number of 3502 replies. 3504 PGM children observe a random back-off in replying to a poll. This 3505 spreads out the replies in time and allows a PGM parent to abort the 3506 poll if too many replies are being received. To abort an ongoing poll a 3507 PGM parent must initiate another poll with different POLL_SQN. PGM 3508 children that receive a POLL must cancel any pending reply for POLLs 3509 with POLL_SQN different from the one of the last POLL received. 3511 For a given poll with probability of replying P, a PGM parent estimates 3512 the number of children as M / P, where M is the number of responses 3513 received. PGM parents should keep polling periodically and use some 3514 average of the result of recent polls as their estimate for the number 3515 of children. 3517 15.3.2. Specific Polls 3519 Specific polls provide a way to search for PGM children that comply to 3520 specific requisites. As an example specific poll could be used to search 3521 for down-stream DLRs. A specific poll is characterized by a POLL_S_TYPE 3522 different from PGM_POLL_GENERAL. PGM children decide whether to reply 3523 to a specific poll or not based on the POLL_S_TYPE, on the random condi- 3524 tion and on options possibly present in the POLL. The way options should 3525 be interpreted is defined by POLL_S_TYPE. The random condition must be 3526 interpreted as an additional condition to be satisfied. To disable the 3527 random condition PGM parents must specify a probability of replying P 3528 equal to 1. 3530 PGM children must ignore a POLL packet if they do not understand 3531 POLL_S_TYPE. Some specific POLL_S_TYPE may also require that the chil- 3532 dren ignore a POLL if they do not fully understand all the PGM options 3533 present in the packet. 3535 15.4. Procedures - PGM Parents (Sources or Network Elements) 3537 A PGM parent (source or network element), that wants to poll the first 3538 PGM-hop children connected to one of its outgoing interfaces must send a 3539 POLL packet on that interface with: 3541 POLL_SQN equal to POLL_SQN of the last POLL sent incremented by 3542 one 3544 POLL_PATH set to the NLA of the outgoing interface 3546 POLL_BO_IVL set to the wanted reply back-off interval. As far as the 3547 choice of this is concerned, using NAK_BO_IVL is usually 3548 a conservative option, however a smaller value can be 3549 used, if the number of expected replies can be determined 3550 with a good confidence or if a conservatively low proba- 3551 bility of reply (P) is being used (see POLL_MASK next). 3552 When the number of expected replies is unknown, a large 3553 POLL_BO_IVL should be used, so that the poll can be 3554 effectively aborted if the number of replies being 3555 received is too large. 3557 POLL_RAND should be a random string re-computed each time a new 3558 poll is sent on a given interface 3560 POLL_MASK determines the probability of replying, P, according to 3561 the relationship P = 1 / ( 2 ^ B ), where B is the number 3562 of bit set in POLL_MASK. If this is a deterministic 3563 poll, B must be 0, i.e. POLL_MASK must be a all-zeroes 3564 bit-mask. 3566 POLL_S_TYPE the type of the poll. For general poll use 3567 PGM_POLL_GENERAL 3569 NOTA BENE: POLLs transmitted by network elements must bear the 3570 ODATA source's NLA, not the network element's NLA. POLLs must 3571 also be transmitted with the IP Router Alert Option [6], to be 3572 allow PGM network element to intercept them. 3574 A PGM parent that has started a poll by sending a POLL packet should 3575 wait at least POLL_BO_IVL before starting another poll. During this 3576 interval it should collect all the valid response (the one with POLR_SQN 3577 equal to POLL_SQN of the outstanding POLL) and process them at the end 3578 of the collection interval. 3580 A PGM parent should observe the rules mentioned in the description of 3581 general procedures, to prevent implosion of response. These rules should 3582 in general be observed both for generic polls and specific polls. The 3583 latter however can be performed using deterministic poll (with no implo- 3584 sion prevention) if the expected number of replies is known to be small. 3586 A PGM parent that has started a poll should monitor the number of 3587 replies. If this become too large, the PGM parent should abort the poll 3588 by immediately starting a new poll (different POLL_SQN) and specifying a 3589 very low probability of replying. 3591 15.5. Procedures - PGM Children (Receivers or Network Elements) 3593 PGM receivers and network elements must compute a 32-bit random node 3594 identifier (RAND_NODE_ID) at startup time. When a PGM child (receiver 3595 or network element) receives a POLL it must use its RAND_NODE_ID to 3596 match POLL_RAND of incoming POLLs. The match is limited to the bits 3597 specified by POLL_MASK. If the incoming POLL contain a POLL_MASK made 3598 of all zeroes, the match is successful despite the content of POLL_RAND 3599 (deterministic reply). If the match fails, then the receiver or network 3600 element must discard the POLL without any further action, otherwise it 3601 must check the field POLL_S_TYPE and any PGM option included in the POLL 3602 to determine whether it should reply to the poll. 3604 If POLL_S_TYPE is equal to PGM_POLL_GENERAL, the PGM child must schedule 3605 a reply to the POLL despite the presence of PGM options on the POLL 3606 packet. 3608 If POLL_S_TYPE is different from PGM_POLL_GENERAL, the decision on 3609 whether a reply should be scheduled depends on the actual type and on 3610 the options possibly present in the POLL. 3612 If POLL_S_TYPE is unknown to the recipient of the POLL, it must not 3613 reply and ignore the poll. Currently the only POLL_S_TYPE defined is 3614 PGM_POLL_GENERAL. 3616 If a PGM receiver or network element has decided to reply to a POLL, it 3617 must schedule the transmission of a single POLR at a random time in the 3618 future. The random delay is chosen in the interval [0, POLL_BO_IVL]. 3619 POLL_BO_IVL is the one contained in the POLL received. When this timer 3620 expires, it must send a POLR using POLL_PATH of the received POLL as 3621 destination address. POLR_SQN must be equal to POLL_SQN. The POLR may 3622 contain PGM options according to the semantic of POLL_S_TYPE or the 3623 semantic of PGM options possibly present in the POLL. If POLL_S_TYPE is 3624 PGM_POLL_GENERAL no option is required. 3626 A PGM receiver or network element must cancel any pending transmission 3627 of POLRs if a new POLL is received with POLL_SQN different from POLR_SQN 3628 of the poll that scheduled POLRs. 3630 15.6. Constant Definition 3632 PGM_POLL_GENERAL is equal to 0x0000. This is the only POLL_S_TYPE value 3633 currently defined. 3635 15.7. Packet Formats 3637 The packet formats described in this section are transport-layer headers 3638 that must immediately follow the network-layer header in the packet. 3640 The descriptions of Data-Source Port, Data-Destination Port, Options, 3641 Checksum, Global Source ID (GSI), and TPDU Length are those provided in 3642 Section 8. 3644 15.7.1. Poll Request 3646 POLL are sent by PGM parents (sources or network elements) to initiate a 3647 poll among their first PGM-hop children. 3649 POLLs are sent to the ODATA multicast group. The source NLA of a POLL is 3650 the ODATA source's NLA. POLL must be transmitted with the IP Router 3651 Alert Option. 3653 0 1 2 3 3654 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3655 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3656 | Source Port | Destination Port | 3657 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3658 | Type | Options | Checksum | 3659 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3660 | Global Source ID ... | 3661 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3662 | ... Global Source ID | TPDU Length | 3663 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3664 | POLL's Sequence Number | 3665 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3666 | NLA AFI | reserved | 3667 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3668 | Path NLA ... | 3669 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+-+ 3670 | POLL's Back-off Interval | 3671 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3672 | Random String | 3673 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3674 | Matching Bit-Mask | 3675 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3676 | POLL's Sub-type | Reserved | 3677 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3678 | Option Extensions when present ... | 3679 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ... -+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3681 Source Port: 3683 POLL_SPORT 3685 Data-Source Port, together with POLL_GSI forms POLL_TSI 3687 Destination Port: 3689 POLL_DPORT 3691 Data-Destination Port 3693 Type: 3695 POLL_TYPE = 0x01 3697 Global Source ID: 3699 POLL_GSI 3701 Together with POLL_SPORT forms POLL_TSI 3703 POLL's Sequence Number 3705 POLL_SQN 3707 The sequence number assigned to the POLL by the originator. 3709 Path NLA: 3711 POLL_PATH 3713 The NLA of the interface on the source or network element on which 3714 this POLL was forwarded. 3716 POLL's Back-off Interval 3718 POLL_BO_IVL 3720 The back-off interval used to compute a random back-off for the 3721 reply. 3723 Random String 3725 POLL_RAND 3727 A random string used to implement the random condition in reply- 3728 ing. 3730 Matching Bit-Mask 3732 POLL_MASK 3734 A bit-mask used to determine the probability of random replies. 3736 POLL's Sub-type 3738 POLL_S_TYPE The sub-type of the poll request. 3740 Option Extensions: 3742 No option is currently defined. 3744 15.7.2. Poll Response 3746 POLR are sent by PGM children (receivers or network elements) to reply 3747 to a POLL. 3749 The source NLA of a POLR is the unicast NLA of the entity that ori- 3750 ginates the POLR. The destination NLA of a POLR is initialized by the 3751 originator of the POLL to the unicast NLA of the upstream PGM element 3752 (source or network element) known from the POLL that triggered the POLR. 3754 0 1 2 3 3755 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3756 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3757 | Source Port | Destination Port | 3758 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3759 | Type | Options | Checksum | 3760 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3761 | Global Source ID ... | 3762 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3763 | ... Global Source ID | TPDU Length | 3764 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3765 | POLR's Sequence Number | 3766 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3767 | Option Extensions when present ... | 3768 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ... -+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3770 Source Port: 3772 POLR_SPORT 3774 Data-Destination Port 3776 Destination Port: 3778 POLR_DPORT 3780 Data-Source Port, together with Global Source ID forms POLR_TSI 3782 Type: 3784 POLR_TYPE = 0x02 3786 Global Source ID: 3788 POLR_GSI 3790 Together with POLR_DPORT forms POLR_TSI 3792 POLR's Sequence Number 3794 POLR_SQN 3796 The sequence number (POLL_SQN) of the POLL packet of which this is 3797 a reply. 3799 Option Extensions: 3801 No option is currently defined. 3803 16. Appendix F - Implosion Prevention 3805 16.1. Introduction 3807 These procedures are intended to prevent NAK implosion and to limit its 3808 extent in case of the loss of all or part of the suppressing multicast 3809 distribution tree. They also provide a means to adaptively tune the NAK 3810 back-off interval, NAK_BO_IVL. 3812 The PGM virtual topology is established and refreshed by SPMs. Between 3813 one SPM and the next, PGM nodes can have an out-of-date view of the PGM 3814 topology due to multicast routing changes, flapping, or a link/router 3815 failure. If any of the above happens relative to a PGM parent node, a 3816 potential NAK implosion problem arises because the parent node is unable 3817 to suppress the generation of duplicate NAKs as it cannot reach its 3818 children using NCFs. The procedures described below introduce an alter- 3819 native way of performing suppression in this case. They also attempt to 3820 prevent implosion by adaptively tuning NAK_BO_IVL. 3822 16.2. Tuning NAK_BO_IVL 3824 Sources and network elements continuously monitor the number of dupli- 3825 cated NAKs received and use this observation to tune the NAK back-off 3826 interval (NAK_BO_IVL) for the first PGM-hop receivers connected to them. 3827 Receivers learn the current value of NAK_BO_IVL through OPT_NAK_BO_IVL 3828 appended to NCFs or SPMs. 3830 16.2.1. Procedures - Sources and Network Elements 3832 For each TSI, sources and network elements advertise the value of 3833 NAK_BO_IVL that their first PGM-hop receivers should use. They advertise 3834 a separate value on all the outgoing interface for the TSI and keep 3835 track of the last values advertised. 3837 For each interface and TSI, sources and network elements count the 3838 number of NAKs received for a specific repair state (i.e., per sequence 3839 number per TSI) from the time the interface was first added to the 3840 repair state list until the time the repair state is discarded. Then 3841 they use this number to tune the current value of NAK_BO_IVL as follows: 3843 Increase the current value NAK_BO_IVL when the first duplicate NAK is 3844 received for a given SQN on a particular interface. 3846 Decrease the value of NAK_BO_IVL if no duplicate NAKs are received on 3847 a particular interface for the last NAK_PROBE_NUM measurements where 3848 each measurement corresponds to the creation of a new repair state. 3850 An upper and lower limit are defined for the possible value of 3851 NAK_BO_IVL at any time. These are NAK_BO_IVL_MAX and NAK_BO_IVL_MIN 3852 respectively. The initial value that should be used as a starting point 3853 to tune NAK_BO_IVL is NAK_BO_IVL_DEFAULT. The policies recommended for 3854 increasing and decreasing NAK_BO_IVL are multiplying by two and dividing 3855 by two respectively. 3857 Sources and network elements advertise the current value of NAK_BO_IVL 3858 through the OPT_NAK_BO_IVL that they append to NCFs. They may also 3859 append OPT_NAK_BO_IVL to outgoing SPMs. 3861 In order to avoid forwarding the NAK_BO_IVL advertised by the parent, 3862 network elements network elements must be able to recognize 3863 OPT_NAK_BO_IVL. Network elements that receive SPMs containing 3864 OPT_NAK_BO_IVL must either remove the option or over-write its content 3865 (NAK_BO_IVL) with the current value of NAK_BO_IVL for the outgoing 3866 interface(s), before forwarding the SPMs. 3868 Sources may advertise the value of NAK_BO_IVL_MAX and NAK_BO_IVL_MIN to 3869 the session by appending a OPT_NAK_BO_RNG to SPMs. 3871 16.2.2. Procedures - Receivers 3873 Receivers learn the value of NAK_BO_IVL to use through the option 3874 OPT_NAK_BO_IVL, when this is present in NCFs or SPMs. The initial value 3875 of NAK_BO_IVL is set to NAK_BO_IVL_DEFAULT. 3877 Receivers that receive an SPM containing OPT_NAK_BO_RNG must use its 3878 content to set the local values of NAK_BO_IVL_MAX and NAK_BO_IVL_MIN. 3880 16.2.3. Adjusting NAK_BO_IVL in the absence of NAKs 3882 Monitoring the number of duplicate NAKs provides a means to track 3883 indirectly the change in the size of first PGM-hop receiver population 3884 and adjust NAK_BO_IVL accordingly. Note that the number of duplicate 3885 NAKs for a given SQN is related to the number of first PGM-hop children 3886 that scheduled (or forwarded) a NAK and not to the absolute number of 3887 first PGM-hop children. This mechanism, however, does not work in the 3888 absence of packet loss, hence a large number of duplicate NAKs is possi- 3889 ble after a period without NAKs, if many new receivers have joined the 3890 session in the meanwhile. To address this issue, PGM Sources and network 3891 elements should periodically poll the number of first PGM-hop children 3892 using the "general poll" procedures described in Appendix E. If the 3893 result of the polls shows that the population size has increased signi- 3894 ficantly during a period without NAKs, they should increase NAK_BO_IVL 3895 as a safety measure. 3897 16.3. Containing Implosion in the Presence of Network Failures 3899 16.3.1. Detecting Network Failures 3901 In some cases PGM (parent) network elements can promptly detect the loss 3902 of all or part of the the suppressing multicast distribution tree (due 3903 to network failures or route changes) by checking their multicast con- 3904 nectivity, when they receive NAKs. In some other cases this is not pos- 3905 sible as the connectivity problem might occur at some other non-PGM node 3906 downstream or might take time to reflect in the multicast routing table. 3907 To address these latter cases, PGM uses a simple heuristic: a failure is 3908 assumed for a TSI when the count of duplicated NAKs received for a 3909 repair state reaches the value DUP_NAK_MAX in one of the interfaces. 3911 16.3.2. Containing Implosion 3913 When a PGM source or network element detects or assumes a failure for 3914 which it looses multicast connectivity to down-stream PGM agents (either 3915 receivers or other network elements), it sends unicast NCFs to them in 3916 response to NAKs. Downstream PGM network elements which receive unicast 3917 NCFs and have multicast connectivity to the multicast session send spe- 3918 cial SPMs to prevent further NAKs until a regular SPM sent by the source 3919 refreshes the PGM tree. 3921 Procedures - Sources and Network Elements 3923 PGM sources or network elements which detect or assume a failure that 3924 prevent them from reaching down-stream PGM agents through multicast NCFs 3925 revert to confirming NAKs through unicast NCFs for a given TSI on a 3926 given interface. If the PGM agent is the source itself, than it must 3927 generate an SPM for the TSI, in addition to sending the unicast NCF. 3929 Network elements must keep using unicast NCFs until they receive a regu- 3930 lar SPM from the source. 3932 When a unicast NCF is sent for the reasons described above, it must con- 3933 tain the OPT_NEIGHBOUR_UNREACH option and the OPT_PATH_NLA option. The 3934 former indicates that the sender is unable to use multicast to reach 3935 downstream PGM agents. The latter reports the network layer address of 3936 the sender, namely the NLA of the interface leading to the unreachable 3937 subtree. 3939 When a PGM network element receives an NCF containing the 3940 OPT_NEIGHBOUR_UNREACH option, it must ignore it if OPT_PATH_NLA speci- 3941 fies an upstream neighbour different to the one currently know. Assum- 3942 ing the network element matches the OPT_PATH_NLA to the upstream neigh- 3943 bour address, it must stop forwarding NAKs for the TSI until it receives 3944 a regular SPM for the TSI. In addition, it must also generate a special 3945 SPM to prevent downstream receivers from sending more NAKs. This special 3946 SPM must contain the OPT_NEIGHBOUR_UNREACH option and should have a 3947 SPM_SQN equal to SPM_SQN of the last regular SPM forwarded. The 3948 OPT_NEIGHBOUR_UNREACH option invalidates the windowing information in 3949 SPMs (SPM_TRAIL and SPM_LEAD). These fields should be filled with zeros 3950 by the PGM network element that adds the OPT_NEIGHBOUR_UNREACH option. 3952 PGM network elements which receive an SPM containing the 3953 OPT_NEIGHBOUR_UNREACH option and whose SPM_PATH matches the currently 3954 known PGM parent, must forward them in the normal way and must stop for- 3955 warding NAKs for the TSI until they receive a regular SPM for the TSI. 3956 If the SPM_PATH does not match the currently known PGM parent, the SPM 3957 containing the OPT_NEIGHBOUR_UNREACH option must be ignored. 3959 Procedures - Receivers 3961 PGM receivers which receive either an NCF or an SPM containing the 3962 OPT_NEIGHBOUR_UNREACH option must stop sending NAKs until a regular SPM 3963 is received for the TSI. 3965 On reception of a unicast NCF containing the OPT_NEIGHBOUR_UNREACH 3966 option receivers must generate a multicast copy of the packet with TTL 3967 set to one on the RPF interface for the data source. This will prevent 3968 other receivers in the same subnet from generating NAKs. 3970 Receivers must ignore windowing information in SPMs which contain the 3971 OPT_NEIGHBOUR_UNREACH option. 3973 Receivers must ignore NCFs containing the OPT_NEIGHBOUR_UNREACH option 3974 if the OPT_PATH_NLA specifies a neighbour different to the one currently 3975 know to be the PGM parent neighbour. Similarly receivers must ignore 3976 SPMs containing the OPT_NEIGHBOUR_UNREACH option if SPM_PATH does not 3977 match the current PGM parent. 3979 16.4. Packet Formats 3981 16.4.1. OPT_NAK_BO_IVL - Packet Extension Format 3982 OPT_NAK_BO_IVL may be appended to NCFs or SPMs. 3984 0 1 2 3 3985 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3986 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3987 | Option Type | Option Length | Reserved | 3988 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3989 | NAK Backoff Interval | 3990 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3992 Option Type = 0x0B 3994 NAK Backoff Interval 3996 The value of NAK-generation Backoff Interval in microseconds. 3998 16.4.2. OPT_NAK_BO_RNG - Packet Extension Format 4000 OPT_NAK_BO_RNG may be appended to SPMs. 4002 0 1 2 3 4003 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4004 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4005 | Option Type | Option Length | Reserved | 4006 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4007 | Maximum NAK Backoff Interval | 4008 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4009 | Minimum NAK Backoff Interval | 4010 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4012 Option Type = 0x0B 4014 Maximum NAK Backoff Interval 4016 The maximum value of NAK-generation Backoff Interval in 4017 microseconds. 4019 Minimum NAK Backoff Interval 4021 The minimum value of NAK-generation Backoff Interval in 4022 microseconds. 4024 16.4.3. OPT_NEIGHBOUR_UNREACH - Packet Extension Format 4026 OPT_NEIGHBOUR_UNREACH may be appended to SPMs and NCFs. 4028 0 1 2 3 4029 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4030 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4031 | Option Type | Option Length | Reserved | 4032 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4034 Option Type = 0x0C 4036 When present in SPMs, it invalidates the windowing information. 4038 16.4.4. OPT_PATH_NLA - Packet Extension Format 4040 OPT_PATH_NLA may be appended to NCFs. 4042 0 1 2 3 4043 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4044 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4045 | Option Type | Option Length | Reserved | 4046 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4047 | Path NLA | 4048 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4050 Option Type = 0x0D 4052 Path NLA 4054 The NLA of the interface on the originating PGM network element 4055 that it uses to send multicast SPMs to the recipient of the packet 4056 containing this option. 4058 Work in Progress 4060 In addition to the explicitly speculative material in the foregoing, 4061 work is also in progress on: 4063 Congestion avoidance through transmit rate control. 4065 Throughput control through shedding of lossy receivers. 4067 Reducing the latency of the alignment of source-path state with 4068 underlying multicast routing changes. 4070 Header compression. 4072 Strategies for securing PGM against the black-hole attacks outlined 4073 in Security Considerations. 4075 Heuristics for delaying the transmission of RDATA from a source to 4076 balance the tradeoff between the repair latency experienced by 4077 receivers and the overhead of duplicate RDATA packets experienced by 4078 the network. 4080 Acknowledgements 4082 The design and specification of PGM has been substantially influenced by 4083 reviews and revisions provided by several people who took the time to 4084 read and critique this document. These include, in alphabetical order: 4086 Bob Albrightson albright@cisco.com 4087 Nidhi Bhaskar nbhaskar@cisco.com 4088 Joel Bion jpbion@cisco.com 4089 Mark Bowles bowles@tibco.com 4090 Jon Crowcroft j.crowcroft@cs.ucl.ac.uk 4091 Steve Deering deering@cisco.com 4092 Richard Edmonstone redmonst@cisco.com 4093 Tugrul Firatli tf@tibco.com 4094 Jim Gemmell jgemmell@microsoft.com 4095 Dan Harkins dharkins@cisco.com 4096 Dima Khoury dkhoury@cisco.com 4097 Dan Leshchiner dleshc@tibco.com 4098 Todd Montgomery tmont@gcast.com 4099 Gerard Newman gkn@network-alchemy.com 4100 Dave Oran oran@cisco.com 4101 Denny Page denny@tibco.com 4102 Ken Pillay ken@cisco.com 4103 Chetan Rai crai@cs.stanford.edu 4104 Yakov Rekhter yakov@cisco.com 4105 Luigi Rizzo luigi@iet.unipi.it 4106 Dave Rossetti rossetti@cisco.com 4107 Paul Stirpe paul.stirpe@reuters.com 4108 Lorenzo Vicisano lorenzo@cisco.com 4109 Brian Whetten whetten@gcast.com 4110 Kyle York kyork@cisco.com 4111 References 4113 [1] B. Whetten, T. Montgomery, S. Kaplan, "A High Performance Totally 4114 Ordered Multicast Protocol", in "Theory and Practice in Distributed Sys- 4115 tems", Springer Verlag LCNS938, 1994 4117 [2] S. Floyd, V. Jacobson, C. Liu, S. McCanne, L. Zhang, "A Reliable 4118 Multicast Framework for Light-weight Sessions and Application Level 4119 Framing", ACM Transactions on Networking, November 1996 4121 [3] J. C. Lin, S. Paul, "RMTP: A Reliable Multicast Transport Protocol", 4122 ACM SIGCOMM August 1996 4124 [4] K. Miller, K. Robertson, A. Tweedly, M. White, "Multicast File 4125 Transfer Protocol (MFTP) Specification", INTERNET DRAFT draft-miller- 4126 mftp-spec-02, January 1997 4128 [5] S. Deering, "Host Extensions for IP Multicasting", INTERNET RFC1112, 4129 STD 5, August 1989 4131 [6] D. Katz, "IP Router Alert Option", INTERNET DRAFT draft-katz- 4132 router-alert-04, January 1997 4134 [7] C. Partridge, "Gigabit Networking", Addison Wesley 1994 4136 [8] H. W. Holbrook, S. K. Singhal, D. R. Cheriton, "Log-Based Receiver- 4137 Reliable Multicast for Distributed Interactive Simulation", ACM SIGCOMM 4138 1995 4140 [9] R. Rivest, "The MD5 Message-Digest Algorithm", INTERNET RFC1321, 4141 INFORMATIONAL, April 1992 4143 [10] J. Reynolds, J. Postel, "Assigned Numbers", INTERNET RFC1700, STD 4144 2, October 1994 4146 [11] J. Nonnenmacher, E. Biersack, D. Towsley, "Parity-Based Loss 4147 Recovery for Reliable Multicast Transmission", ACM SIGCOMM September 4148 1997 4150 [12] L. Rizzo, "Effective Erasure Codes for Reliable Computer Communica- 4151 tion Protocols", Computer Communication Review, April 1997 4153 [13] V. Jacobson, "Congestion Avoidance and Control", ACM SIGCOMM August 4154 1988 4155 Authors' Addresses 4157 Tony Speakman 4158 speakman@cisco.com 4160 Nidhi Bhaskar 4161 nbhaskar@cisco.com 4163 Richard Edmonstone 4164 redmonst@ciscolcom 4166 Dino Farinacci 4167 dino@cisco.com 4169 Steven Lin 4170 slin@cisco.com 4172 Alex Tweedly 4173 agt@cisco.com 4175 Lorenzo Vicisano 4176 lorenzo@cisco.com 4178 Cisco Systems, Inc. 4179 170 West Tasman Drive, 4180 San Jose, CA 95134 4182 Jim Gemmell 4183 jgemmell@microsoft.com 4184 Microsoft Bay Area Research Center 4185 301 Howard Street 4186 San Francisco, CA. 94105