idnits 2.17.1 draft-bormann-mtp-so-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-20) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 8 instances of too long lines in the document, the longest one being 14 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 651 has weird spacing: '...rameter bits...' == Line 920 has weird spacing: '...lti/uni sent...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 1997) is 9653 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0' is mentioned on line 937, but not defined ** Downref: Normative reference to an Informational RFC: RFC 1301 (ref. '1') -- Possible downref: Non-RFC (?) normative reference: ref. '2' -- Possible downref: Non-RFC (?) normative reference: ref. '3' Summary: 12 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT Carsten Bormann, 3 Expires: May 1998 Joerg Ott 4 Universitaet Bremen 5 Nils Seifert 6 TU Berlin 7 November 1997 9 MTP/SO: Self-Organizing Multicast 10 draft-bormann-mtp-so-01.txt 12 Status of this memo 14 This document is an Internet-Draft. Internet-Drafts are 15 working documents of the Internet Engineering Task Force (IETF), its 16 areas, and its working groups. Note that other groups may also 17 distribute working documents as Internet-Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as ``work in progress.'' 24 To learn the current status of any Internet-Draft, please check the 25 ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow 26 Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), 27 munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or 28 ftp.isi.edu (US West Coast). 30 Distribution of this document is unlimited. 32 Abstract 34 Multiparty cooperative applications have recently received much 35 attention, as has the multicasting of datagrams in the internet. The 36 internet datagram multicasting mechanism is not reliable, often 37 requiring a higher level protocol to achieve the level of reliability 38 required for an application. 40 Much of the extensive work on reliable multicast protocols has 41 assumed relatively stable groups that need to ensure that all 42 messages are received by all members of this well-defined group. 43 Recently, work on loosely coupled teleconferencing has directed 44 attention to a class of multicast applications that scale up to an 45 extent where this assumption is no longer practical. 47 An interesting multicast transport protocol is defined in RFC 1301. 48 MTP provides globally ordered, receiver reliable, rate controlled and 49 atomic transfer of messages to multiple recipients. A revised, more 50 practical version of MTP, the Multicast Transport Protocol MTP-2 has 51 been in use for some time. 53 Self-Organizing Multicast, MTP/SO, uses MTP-2 as a basis and adds 54 spontaneous self-organization of the members of the group into local 55 regions. Scalability is increased by providing passive group joining 56 and local retransmission of lost packets. 58 This version of the document is not yet complete but contains most of 59 the vital parts. 61 1. Introduction 63 Multiparty cooperative applications have recently received much 64 attention, as has the multicasting of datagrams in the internet. The 65 internet datagram multicasting mechanism is inherently unreliable, 66 often requiring a higher level protocol to achieve the level of 67 reliability required for an application. Just as TCP has proven to 68 be a useful basis for many applications that could in theory motivate 69 the design of application specific transport protocols, it is likely 70 that generally available reliable multicast protocols would relieve 71 many multiparty applications from the details of efficiently coping 72 with unreliable delivery in their application protocol designs. 74 Much of the extensive work on reliable multicast protocols has 75 assumed relatively stable groups that need to ensure that all 76 messages are eventually received by all members of this well-defined 77 group. Recently, work on loosely coupled teleconferencing has 78 directed attention to a class of multicast applications that scale up 79 to an extent where this assumption is no longer practical. Many 80 other applications in the area of synchronous groupware also do not 81 need the strong property of reliability, but can nonetheless benefit 82 from a multicast protocol providing some weaker form of reliable 83 transport. 85 An interesting multicast transport protocol with a somewhat relaxed 86 view of reliability is defined in RFC 1301 [1]. MTP can be used with 87 unreliable and not necessarily sequence preserving underlying 88 multicast (or broadcast) network protocols such as IP multicast. MTP 89 provides globally ordered, receiver reliable, rate controlled and 90 atomic transfer of messages to multiple recipients. 92 A revised version of MTP, the Multicast Transport Protocol MTP-2, has 93 been used for a number of applications for some time [2]. MTP-2 has 94 been designed to avoid some of the practical problems experienced in 95 using MTP and introduces a number of additional facilities that 96 increase its utility. In particular, MTP-2 no longer has a single 97 point of failure. 99 This document defines Self-Organizing Multicast, MTP/SO. MTP/SO uses 100 MTP-2 as a basis and adds spontaneous self-organization of the 101 members of the group into a hierarchy of local regions. Scalability 102 is increased by providing passive group joining and local 103 retransmission of lost packets. 105 2. Requirements 107 Even more so than for unicast protocols, there are difficult trade- 108 offs in designing a multicast protocol. It is unlikely that a single 109 reliable multicast protocol can be applicable to all kinds of 110 multicast applications, from a small set of replicated database 111 systems synchronizing their updates to distributed interactive 112 simulation systems with hundreds of thousands of processes joining 113 and leaving large numbers of groups with high frequency. 115 Any design of a protocol that aims to cover a part of the ground must 116 therefore be explicit about the specific requirements the designers 117 had in mind. Concentrating on any single objective is unlikely to 118 yield a generally applicable protocol. In this section, we list what 119 we perceive to be the main requirements that went into the design of 120 MTP/SO, in order of importance. 122 o Scalability 124 While the actual usage pattern of synchronous group communication 125 software is not yet known, it is clear that groups of wildly 126 different sizes will need to be accommodated. A protocol that is not 127 scalable to large groups with a significant rate of membership change 128 will not be a viable multicast platform. 130 Many existing protocols that focus on reliability require a positive 131 acknowledgement from each recipient to the sender of each message. 132 This obviously does not scale to large groups. Also, group 133 management algorithms that require an acknowledgement from each member 134 to accept a new member are not acceptable in large groups (in 135 particular, building a group creates an n-square problem). 137 As a first level of attack, this scaling problem can be circumvented 138 by using negative acknowledgements (NAKs). Unfortunately, this also 139 conflicts with a strict reliability requirement: Not every failure 140 will be immediately detected, since the normal behavior of a 141 recipient, i.e. being silent, cannot be distinguished from a silent 142 failure. There is a trade-off between scalability and the kind of 143 reliability that can be realized. 145 o Efficiency 147 A reliable multicast protocol should be comparable in performance to 148 special protocols specifically designed for an application. Just as 149 TCP generally is slightly less efficient than a specially designed 150 protocol would be, some more packets and additional per-packet 151 overhead as well as some additional processing time will be 152 tolerable. However, the protocol needs to be in the same class of 153 overhead to be applicable to an application. 155 o Robustness and Reliability 157 A reliable multicast protocol should obviously be ``reliable'' in 158 some sense. Given the conflict with scalability, we define 159 reliability to mean: A recipient can (within bounded time) find out 160 when it is failing or being partitioned from active senders. A 161 sender is assured (with sufficient probability) that all its messages 162 reach within bounded time all recipients that are not failing or 163 being partitioned. 165 Obviously, this strict definition of reliability needs to be 166 complemented by some measure of robustness: A protocol that declares 167 failure or creates significant delays in the face of trivial errors 168 may meet this definition but is not useful. In a teleconferencing 169 environment, a desirable robustness property is the ability to 170 continue operating within partitions should the group become 171 partitioned. Ultimately, the applications that use the multicast 172 transport platform should be the ones to decide when the situation 173 has deteriorated to a point where continuing is meaningless. 175 o Ordering 177 Many applications are simplified considerably when all (or at least a 178 certain subset of all) messages exchanged in the group arrive in the 179 same order at all recipients, even if originated at different 180 senders. 182 3. Overview 184 This section gives an overview over the protocol functions of MTP/SO. 185 (Note to readers that have seen MTP or MTP-2: This overview is given 186 in terms that are more generic than those used in older protocol 187 definitions. In particular, the terms group, coordinator, sender, 188 and receiver have been substituted for the traditional terms web, 189 master, producer, and consumer.) 191 In MTP/SO there are three different roles of members in a group: 192 coordinator, sender and receiver. The coordinator provides the 193 message ordering for all members in a group and oversees the rate 194 control. Senders send data in messages (each sent as a sequence of 195 one or more data packets) after obtaining a token from the designated 196 coordinator. Receivers receive these messages and request the 197 retransmission of packets that did not arrive. 199 In MTP/SO, many actions like retransmitting control packets or 200 requesting retransmissions depend on a time interval that is a 201 parameter to the whole group. This interval is called heartbeat and 202 is measured in microseconds. 204 3.1. Global ordering 206 The coordinator assigns a global sequence number to each message. In 207 the simplest mode of transmission, before a sender is allowed to 208 start sending a new message, it has to obtain a token from the 209 coordinator. This can be done by transmitting a special request 210 packet to the coordinator or by sending the request along with data 211 packets belonging to other messages. The coordinator answers with a 212 confirm packet, which contains the sequence number for the new 213 message. Senders will then send this sequence number in every data 214 packet belonging to the message. It is the responsibility of the 215 receivers to deliver messages in the correct order to the 216 applications, if sequenced delivery has been specified for a message. 218 This results in an ordering class called "global ordering", which 219 means that even when there are many senders simultaneously sending 220 messages, every receiver will receive the messages in the same order 221 which corresponds to the order in which they were sent. 223 As the sequencing will quite often result in an additional delay (for 224 example when a short message is preceded by a very long one), 225 applications can assign messages to different streams. A message is 226 delivered irrespective of messages belonging to other streams, even 227 if these carry lower sequence numbers. By using streams, 228 applications can avoid unnecessary delays, simply by assigning 229 independent messages to different streams. 231 A message that can be processed independent of the ones preceding it 232 can be marked with a sequencing_off bit. Messages so marked can be 233 immediately delivered to the application by receivers, even if the 234 stream numbers of preceding messages are still unknown. 236 Normally the coordinator grants the tokens in the same order the 237 token request packets are received. If there is a need to transmit 238 some messages with a higher priority, applications can assign a 239 priority to every message. This priority is only considered while 240 granting a token (hence only when there are many tokens requested at 241 the same time) and has no effect on the transmission rate of the 242 message once a token has been assigned. As a result, when a sender 243 sends messages with different priorities, it is no longer guaranteed 244 that these are received in the same order they were queued for 245 sending -- if they are in the same stream, they are, however, 246 received in the same order by all receivers (including the sender). 248 3.2. Rate control 250 Rate control is overseen by the coordinator. A parameter global to 251 the group defines the maximum throughput of the group. The 252 coordinator dynamically adjusts a per-message parameter called window 253 to the number of tokens granted (up to 11). Senders are not allowed 254 to send data packets belonging to one message at an interval smaller 255 than window (measured in microseconds). So the coordinator can 256 ensure that the maximum throughput for the group is not exceeded. 258 3.3. Atomicity 260 At any point in time, each message is assigned a state by the 261 coordinator: pending, accepted, or rejected. 263 The state of a message is set to accepted when the coordinator did 264 receive the complete message. As soon as a sender notices one of its 265 messages to be accepted, it sends an acknowledgement of successful 266 transmission to its application. Such an acknowledgement does not 267 mean that every receiver received the message. It only guarantees 268 that at least the coordinator was able to receive it correctly. (It 269 also provides the sequence number assigned to the message so that the 270 application can order its own messages with respect to other messages 271 it may have received). 273 A message marked as rejected was not completely received (even after 274 requesting retransmissions) by the coordinator. Normally, every 275 receiver will drop such a message and the sender of the message will 276 indicate an unsuccessful-transmission error to its application. 278 Receivers do not deliver pending or rejected messages to the 279 application. If a specific receiver does not completely receive a 280 message (even after requesting retransmissions) that is finally 281 marked by the coordinator as accepted, it will signal this as an 282 unsuccessful-reception error to its application. 284 In summary, it is guaranteed that a message was either delivered 285 correctly to every receiver, that it was delivered to no receiver and 286 the sender is signalled an error, or that any receiver that did not 287 receive the message is signalled an error. (Of course, the protocol 288 works hard to minimize the number of such errors, but the above 289 statements are guarantees of the protocol.) 291 Atomicity increases the message latency: applications need to wait 292 for the accepted state propagating from the coordinator before they 293 can act on a message. In order to allow every member to quickly 294 learn about the state of messages, every packet contains a copy of 295 the most recent information available about the state of the most 296 recent messages. If application semantics do not require atomicity, 297 unnecessary delay can be avoided by marking a message with 298 atomicity_off. 300 3.4. Retransmission 302 Receivers request retransmissions of data packets when there is a gap 303 in the sequence numbers of data packets received for a message or if 304 no further data packet has arrived for more than one heartbeat while 305 the message is still incomplete. In case all data packets for a 306 message have been lost, this will be recognized from the message 307 state of packets from following messages or when the coordinator 308 propagates the state of the most recent messages. In any case the 309 request for retransmission can be generated at the latest after two 310 full heartbeats. 312 Retransmission requests, or NAKs (negative acknowledgements) for 313 short, are multicast to the group to reduce the implosion problem. 314 Receivers dither the time at which they send NAKs and postpone 315 sending a NAK when they have recently received one or more NAKs that 316 together cover the same set of packets. 318 In order to answer NAKs, senders keep a copy of every data packet 319 they sent. To limit the number of packets stored, senders are 320 allowed to discard these copies after a defined period of time which 321 is measured in heartbeats and depending on a special factor called 322 retention. After retention+4 heartbeats the copies are no longer 323 available and requests for retransmissions received after that period 324 are denied with a special control packet. This makes sure packets 325 are available for at least retention retransmissions. 327 Nonetheless there is a nonzero probability that all retransmissions 328 (or retransmission requests) related to a packet are lost and some 329 receivers do not receive the message correctly. For example a 330 network partitioning that lasts longer than heartbeat*retention will 331 result in lost messages. 333 This sounds undesirable, but it is similar to the retry limit used in 334 positively acknowledged protocols, only that the normally relatively 335 small value of heartbeat*retention puts a limit to the length of an 336 outage that can be tolerated. We assume that the application 337 protocol will have a way to handle receivers that experience such a 338 long gap in reception, because it already needs a way to treat new 339 members that appear late in the group. (Note that for applications 340 where this is undesirable, MTP/SO could be augmented by log servers 341 as in [3].) In any case, MTP/SO guarantees that when a message was 342 not completely received by every receiver, either the affected 343 receivers or the responsible sender will indicate the error to the 344 application. 346 3.5. Self-organization and Repeaters 348 Once MTP/SO groups get large, even the handling of NAK-based 349 retransmission traffic becomes a scalability problem. As with many 350 scaling problems, the obvious solution is to introduce some form of 351 hierarchy into the group. This allows at least some of the NAKs and 352 resulting retransmissions to be handled locally within trunks and 353 branches of that hierarchy. As MTP/SO is a many-to-many protocol, it 354 does not make much sense to base the hierarchy on the multicast tree 355 from any specific sender (including the coordinator, which generally 356 is not the sole sender and which may transfer its role to another 357 member during the activity of the group). 359 Instead, MTP/SO introduces the concept of a regional repeater. 360 Receivers multicast NAKs locally before multicasting them to the 361 entire group. Repeaters that have previously received the requested 362 data retransmit locally after receiving a local NAK. Repeaters that 363 don't have the data just relay the NAK to the next higher level of 364 hierarchy, up to the whole group (where, finally, the sender replies 365 with another copy of the data). 367 A prerequisite to this mechanism is a way to do a local multicasting 368 (of a NAK as well as of a retransmission). In current IP multicast 369 implementations, one way to define such regions is with TTL threshold 370 scoping; with IPv6, administrative scoping will provide a similar 371 method. The algorithms described in the rest of this section work 372 best when such a scoping mechanism is in effect; leaks or other 373 imperfections in the scoping boundaries do not cause catastrophic 374 failures, though. The following discussion assumes three levels of 375 scopes, e.g., site, country, and continent; the exact choice of 376 number and extent of scopes is a global parameter of the group. 378 With three local and one global scope, each group member is by 379 definition in four scopes, where each local scope is contained by the 380 next higher scope in the hierarchy. Any member that takes on a 381 receiver role can in principle also be a repeater for any of the 382 local scopes (each member can decide whether it wants to be a 383 potential repeater or not, e.g. depending in the cost structure of 384 the Internet service or on the availability of local memory space). 386 For scopes that contain only one member, it does not matter whether a 387 member considers itself to be a repeater for that scope or not. For 388 scopes that contain more than one member, a protocol is needed that 389 makes this fact known and selects one member as the repeater. This 390 protocol needs not necessarily ensure that there is exactly one 391 repeater for each scope at any time, as the retransmission protocol 392 still works without a repeater or with more than one repeater per 393 scope, albeit less efficient. 395 Repeater selection should favor the ``best'' member in the scope, 396 i.e. a member that has particularly good reception from the senders, 397 as it is most likely that this member will have received the data to 398 be able to perform a local retransmission. Each potential repeater 399 therefore maintains a reception quality parameter that, on a first 400 level of approximation, tallies the quotient of the number of 401 recently correctly received packets to the number of packets that 402 should have been received. 404 Members that consider themselves repeater for a scope periodically 405 multicast a repeater announcement message within the scope, 406 containing the current value of the reception quality parameter. 407 Potential repeaters observe these messages. If, within the most 408 local scope, a potential repeater has a considerably better reception 409 quality parameter than the current repeater, it sends a repeater 410 announcement at the start of its next heartbeat interval and assumes 411 the role of the repeater. Only the repeaters of the most local scope 412 compete for the repeater role of the next higher scope, and so on. 413 (A new repeater that displaces a member that was repeater at higher 414 level scopes also announces itself as repeater at these higher level 415 scopes.) 417 To better cope with repeater failure, receivers that are not 418 repeaters send NAKs at the most local scope first and escalate them 419 up the hierarchy if neither a retransmission nor a more global NAK 420 follows within one heartbeat. Repeaters for a set of scopes begin 421 sending NAKs within the next higher scope and then escalate them the 422 same way. Retransmissions always occur at the highest level of scope 423 that the NAKs leading to that retransmission carried (NAKs have a 424 scope field for this purpose). 426 A repeater that leaves a group simply sends a repeater announcement 427 with reception quality zero. A repeater that crashes stops sending 428 repeater announcements, causing potential repeaters to start sending 429 repeater announcements after a time interval that is inversely 430 related to their reception quality parameter. 432 3.6. Coordinator function 434 As it is responsible for assigning tokens and updating the message 435 state, the coordinator plays a central role in MTP/SO. If the member 436 carrying the coordinator function leaves the group, the coordinator 437 function will be passed to one of the remaining members 438 automatically. 440 To avoid the coordinator being a single point of failure, MTP/SO 441 provides a coordinator recovery function. This allows the group to 442 elect a new coordinator when the old one crashes or becomes 443 unreachable. The new coordinator will then collect all information 444 needed from the group members so that absolutely no information is 445 lost. (This protocol should be, but is not yet, integrated with the 446 repeater function.) 448 In order to enhance the performance of MTP/SO it may be useful to 449 actively influence which member performs the coordinator function. 450 For example if only one member will send messages for a longer period 451 of time, the group can migrate the coordinator function to that 452 member, thereby avoiding the overhead caused by requesting and 453 obtaining tokens (between one and two packets for every message). 454 MTP/SO allows either to request the coordinator function for oneself 455 or the coordinator to pass the coordinator function to another 456 member. 458 3.7. Membership classes 460 Not all members of the group will be in a position to take over the 461 functions of a coordinator or of a repeater. We therefore 462 distinguish several ``classes'' of members: 464 | 465 class | description 466 ------+---------------------------------------------------- 467 1 | normal member, potential coordinator and repeater 468 2 | normal member, potential repeater 469 3 | normal member 470 4 | unreliable receiver, normal sender 471 5 | unreliable member 473 Most members of an MTP/SO group will be class 1 members, i.e. they 474 are prepared to take over the coordinator role if this is required in 475 a coordinator recovery. Class 2 members do not want to take on this 476 role (for application reasons or for reasons of limited resources), 477 but compete for the repeater function. Class 3 members take over 478 neither special function, but take part as normal members in the 479 group; in particular, they are allowed to send NAKs. 481 Class 4 members never send NAKs. Their reception of messages in the 482 group is therefore unreliable. Nonetheless, they can originate 483 messages that are reliably received by the class 3 or higher members 484 of the group. One way to join an MTP/SO group is to start as a class 485 4 member, send a message at an appropriate time, and upgrade to a 486 higher class when the message has been accepted by the coordinator. 488 Class 5 members listen only; the only packet type they can send to 489 the group is unreliable multicast datagrams (not yet described in 490 this version of the draft). When a minimum quality of 491 transmission/reception is defined for the group (see group[info] 492 packets below), members may have to downgrade themselves to class 5 493 when they find out their own quality has dropped below the acceptable 494 level. 496 4. Protocol Definition 498 4.1. Notational Conventions 500 For convenience, the datagrams transmitted by MTP/SO group members 501 are called packets in this document. 503 MTP/SO packet types are written major[minor], where major is the 504 major type of the packet and minor is the subtype within the major 505 type. E.g., there are data[data] packets as well as data[eom] 506 packets. 508 4.2. Protocol Functions and Packet Types 510 o Heartbeat 511 All members operate on a time line that is divided into heartbeats. 512 The nominal length of a heartbeat is a global parameter of the group. 513 The actual heartbeat boundaries (or heartbeats for short) are 514 dithered around the nominal value. Most protocol actions are 515 performed at the start of a new heartbeat interval. The only 516 exception is the actual transmission of data packets, which is evenly 517 distributed over the heartbeat interval to which the data packets are 518 allocated. 520 o Global Ordering 522 A sender that wants to send a message applies for a token by 523 unicasting a token[request] packet to the coordinator. 524 Alternatively, the sender can include a token request field in a data 525 packet that is sent under a previously obtained token. 527 As soon as a token becomes available, the coordinator replies with a 528 token[confirm] containing a new global sequence number, under 529 consideration of the queue of token requests and the priority of the 530 token request. The sender uses this global sequence number as the 531 message number in every data packet pertaining to this message. 533 o Message Acceptance 535 The coordinator maintains the message acceptance state for recent 536 messages. For the 12 most recent messages, the message acceptance 537 state is disseminated in every packet. Packets sent by the 538 coordinator contain the current message acceptance state; packets 539 sent by other members contain a copy of the most recent message 540 acceptance state available to that sender (for data packets, this is 541 often the state obtained via the token[confirm] packet). As the 542 field that is used to disseminate that state only has 12 entries, the 543 number of messages that can be pending at any point in time is 544 limited. 546 To ensure that the most recent message acceptance state is always 547 disseminated, the coordinator sends an empty[info] packet in every 548 heartbeat in which no other member is scheduled to send packets based 549 on tokens sent out. 551 o Retransmissions 553 At each heartbeat, receivers that are missing packets of a message 554 multicast nak[request] packets (see also the discussion of self- 555 organization and repeaters above). A nak[request] contains a list of 556 ranges of sequence numbers for one or more messages. Ranges can be 557 open, i.e. implicitly include all further packets when the ending 558 packet number is not known. A nak[request] that is received by a 559 receiver postpones sending a nak[request] for the set of packets 560 listed in the nak[request]. Empty nak[request] packets are never 561 sent. 563 If a sender no longer has a copy of the data that needs to be 564 retransmitted, it multicasts a nak[deny] packet. 566 4.3. Addresses 568 A MTP/SO group has one group address and as many member addresses as 569 there are members. 571 The member address is the combination of a 128-bit IPv6 host address 572 (possibly in IPv4 compatibility format, i.e. with 96 bits of leading 573 zeroes) and a 16-bit UDP port number. 575 The group address is the pair of a 128-bit IPv6 multicast address 576 (again, possibly IPv4 compatible) and a group-ID. The group-ID 577 simply is the member address of the current coordinator. 579 MTP/SO multicasts always use the UDP destination port number 47112 580 (to be assigned) and the UDP source port number from the member 581 address. MTP/SO unicasts use UDP source and destination port numbers 582 in the range 47112+1 to 49152-1 (note that the number 49152 marks the 583 end of the medium priority port number space in some current IP 584 multicast router implementations). 586 4.4. Packet Formats 587 Figure 1: Standard packet header 588 0 1 2 3 589 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 590 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 591 | Version | Type | Mod | (Port part) | 592 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- -+-+ 593 | (Address part) | 594 +- -+ 595 | | 596 +- For multicast packets: Group ID -+ 597 | | 598 +- -+ 599 | | 600 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 601 | Heartbeat | Coordinator State Sequence Number | 602 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 603 | Retention | Message Acceptance Sequence Number | 604 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 605 |T| Number|Prio | |Mes|sag|e A|cce|pta|nce| St|ate| Ar|ray| | 606 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 607 | Window | 608 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 610 The standard packet header contains the following fields: 612 o Version 614 For the current version of MTP/SO, version is always 3. 616 o Type, Mod 618 Packet type and type modifier (subtype). 620 o Group ID 622 For multicast packets, this field gives the member address of the 623 current coordinator. For unicast packets, this field is not used. 625 o Coordinator State Sequence Number 627 A sequence number for the version of the coordinator state that is 628 disseminated with this message. 630 o Message Acceptance Sequence Number, Message Acceptance State Array 632 Let n be the message acceptance sequence number, then message 633 acceptance state array contains the most recent message acceptance 634 states known for messages n-1 to n-12: 636 0 pending 637 1 accepted 638 2 rejected 639 3 (reserved) 641 o T, Number, Prio 643 If the T bit is set, Number gives the serial number and Prio the 644 priority of a token request piggybacked in this packet. 646 o Heartbeat, Retention, Window 648 Current values for these three global parameters of the group. These 649 parameters are given as pseudo-floating-point numbers: 651 parameter bits mantissa (msb) exponent (lsb) unit 652 ----------------------------------------------------------------------------------- 653 heartbeat 8 3 5 microseconds (0 to 7*2^32) 654 retention 8 4 4 1 (0 to 15*2^16) 655 window 16 11 5 microseconds (0 to 2047*2^32) 657 Figure 2: token[request] 658 0 1 2 3 659 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 660 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 661 |1| Number|Prio |1| Number|Prio | 662 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 663 |1| Number|Prio | . . . 664 +-+-+-+-+-+-+-+-+- 666 A token[request] packet is unicast from a member to the coordinator 667 to apply for one or more tokens. Each of these requests for a token 668 contains a serial number of that request plus a request priority. 669 The first token request is carried in the token request part of the 670 standard header; additional token requests can be sent in the packet 671 type specific part following the standard header. 673 Figure 3: token[confirm], token[cancel] 674 0 1 2 3 675 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 676 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 677 | New Message Sequence | 678 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 679 | Number | 680 +-+-+-+-+-+-+-+-+ 682 A token[confirm] is unicast from the coordinator to the member that 683 requested the token. A token[cancel] can be used by the token 684 holding member to return the token to the coordinator. 686 Figure 4: data packets (except data[eom]) 687 0 1 2 3 688 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 689 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 690 | stream number | 691 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 692 |S|A|R|0 0|O| L | Message Sequence Number | 693 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 694 | Packet Sequence Number | 695 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 696 | Data | 697 : : 699 The S bit, if set, indicates that ordered delivery is not required 700 for this message (``sequencing_off''). The A bit, if set, indicates 701 that atomic delivery is not required for this message 702 (``atomicity_off''). The R bit, if set, indicates that this message 703 is not transmitted reliable which means that the producer is not 704 going to answer any nak[request]s. Consumers are expected to wait 705 for any missing packet of this message for one heartbeat and then 706 mark the message as not received. The O bit (``original'') is set 707 only for the first transmission of the data packet by the original 708 sender. It is reset for any kind of retransmission (regardless 709 whether performed by the original sender or not) . 710 L (``level'') is a binary number ranging from 0 to 3. Level 0 711 indicates a global transmission; levels 1 to 3 indicate transmission 712 of the packets at the second most global to most local level scope, 713 resp. (For a retransmission, the transmission level indicates the 714 scope in which this data packet was sent; lower level repeaters can 715 use this information to decide whether they can defer their own 716 retransmissions.) 717 Figure 4a: data[eom] 718 0 1 2 3 719 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 720 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 721 | stream number | 722 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 723 |S|A|0 0 0|O| L | Message Sequence Number | 724 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 725 | Packet Sequence Number | 726 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 727 | 0 (AL) | | 728 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- -+ 729 | | 730 +- -+ 731 | | 732 +- original sender's member address -+ 733 | | 734 +- -+ 735 | | 736 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 737 : authentication information (optional) : 738 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 739 | Data | 740 : : 742 To ensure that the original sender of a message becomes known even if 743 the only packets a receiver has received from this message were 744 repeater retransmissions, the data[eom] packet differs from the other 745 data packets in that it contains a copy of the original sender's 746 member address. (Note that this information is redundant for packets 747 that have the O-bit set; it is retained in favor of a common packet 748 format for all cases.) With an optional authentication protocol (not 749 specified in this version of the document), authentication 750 information can be given with this last packet of the message; the 751 length in 32-bit words is in AL. 753 Figure 5: nak[request], nak[deny] 754 0 1 2 3 755 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 756 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 757 | 0 | L | 758 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 759 |F| 0 | Message Sequence Number | 760 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 761 | Packet Sequence Number (Low) | 762 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 763 | Packet Sequence Number (High) | 764 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 765 |F| 0 | Message Sequence Number | 766 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 767 | Packet Sequence Number (Low) | 768 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 769 | Packet Sequence Number (High) | 770 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 771 : : 773 The F bit, if set, indicates that, starting at the packet sequence 774 number (low), all packets from the given message are missing. As 775 with data packets, L gives the scope level at which this NAK is being 776 multicast/replied to. NAK request and deny packets inhibit the 777 transmission of further such packets from other potential 778 transmitters (for one heartbeat) only at the level of scope given. A 779 retransmission that is a response to a NAK request should be sent at 780 the level of scope given. 782 Figure 6: status[request], status[deny] 783 0 1 2 3 784 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 785 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 786 | scope | 0 | 787 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 788 | 0 | Message Sequence Number | 789 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 790 | 0 | Message Sequence Number | 791 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 792 : : 794 A status request packet can be multicast by a member to request 795 status for messages that already have scrolled off the message 796 acceptance state array in the standard header. A status deny 797 response indicates that the retention time for keeping information 798 about the status of the messages has passed. 800 Figure 7: status[info] 801 0 1 2 3 802 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 803 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 804 | 0 | 805 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 806 |U| S | 0 | Message Sequence Number | 807 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 808 |U| S | 0 | Message Sequence Number | 809 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 810 : : 812 Responding to status requests, a repeater (for local scopes) or the 813 coordinator can multicast status info. The U bit, if set, indicates 814 that the status of the given message is unknown. The S field gives 815 the message acceptance state as in the message acceptance state 816 array. 818 Figure 8: group[seek] 819 0 1 2 3 820 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 821 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 822 | Scope | 0 |C|K| 823 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 824 | Group Name . . . 825 +-+-+-+-+-+-+-+-+- 827 The K-bit, if set, indicates that reliable receiver status 828 (membership class 1 to 3) is intended, i.e., that an explicit 829 acknowledgement for this member has to be given within a group[info]. 830 The C-bit, if set, indicates that the transmitter is a potential 831 coordinator (membership class 1); it causes other potential 832 coordinators with a higher member address to back off. The scope 833 field gives the actual scope in which this packet was transmitted 834 (this cannot just be given as a scope level number as the actual 835 scope levels used in this group may not yet be known to the 836 transmitter). 838 Figure 9: group[info] 839 0 1 2 3 840 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 841 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 842 | Quality | 843 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 844 | Activity | 0 |U|E| L | 845 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 846 | TTL Scope 0 | TTL Scope 1 | TTL Scope 2 | TTL Scope 3 | 847 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 848 | Network Packet Size | 849 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 850 | min. Receive Quality | min. Send Quality | 851 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 852 | Group Name Length | Group Name ... : 853 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 854 : (zeros) 4 byte alignment | 855 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 856 : type | length | extension : 857 :-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : 858 : : 859 :-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-: 860 : type | length | extension : 861 :-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : 862 : : 864 The group[info] packet is periodically transmitted by the coordinator 865 and by each repeater to ensure that all group members are aware of 866 the global parameters of the group and of the quality of the current 867 repeater. 869 Two parameters give dynamic information about the transmitter and 870 about the group: Quality is the (0,16 bit fixed point) product of 871 reception and transmission quality of the transmitter. Activity is a 872 measure for the recent activity of this group (useful for merging 873 decisions by applications). 875 The other fields of the packet give global group parameters that 876 usually are constant: The U-Bit (``unreliable''), when set, indicates 877 that this group operates entirely without NAKs and retransmissions. 878 The E-Bit (``elect'') is set for group[info] packets originated by 879 the coordinator in case it is willing to transfer the coordinator 880 function to a higher quality member; it requests other potential 881 coordinators to announce their quality (if better) via group[info]. 882 L gives the scope level, and, indirectly, the source of the 883 group[info]: level 0 packets are originated by the coordinator or by 884 other potential coordinators (the latter if the source address is not 885 equal to the coordinator part of the group address), level 1 to 3 886 packets are originated by repeaters of the respective level. 887 Analogously, the TTL fields provide the TTL scopes of the levels: TTL 888 0 is the scope of the entire group, TTL 1 to TTL 3 give the scopes of 889 the most global to most local repeater levels. Setting the scope for 890 a level to zero indicates that this level is not in use. The fields 891 minimal send quality and minimal receive quality give minimum levels 892 of quality for a member that wants to send reliable messages or that 893 wants to request retransmissions (reliable reception); if not met, 894 they cause the member to assume a lower membership class. 896 At the end of the fixed part of group[info] packets, extensions can 897 be added. Their type is identified by a one-byte type code their 898 length given by a one-byte length field, giving the number of 32-bit 899 words beyond the initial one in this extension. 901 Figure 9a: group[info] extension for member acks 902 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 903 : 1 | 4 | (Port part) : 904 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- -+-+ 905 : (Address part) : 906 +- -+ 907 : Acknowledged : 908 +- Member-Address -+ 909 : : 910 +- -+ 911 : : 912 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 914 Type 1 group[info] extensions are used to carry an acknowledgement for 915 a group[seek] requests by a member that needs to achieve reliable 916 reception status quickly (K-bit in group[seek] set). 918 4.5. Summary of packet types 920 packet type type[code] multi/uni sent by see Figure 921 ---------------------------------------------------------------------------- 922 data[data] 0[0] m C,R,s 4 923 data[eom] 0[1] m C,R,s 4a 924 data[dally] 0[2] m C,R,s 4*) 925 data[ceom] 0[3] m C,R,s 4*) 926 nak[request] 1[0] m r 5 927 nak[deny] 1[1] m C,s 5 928 group[info] 2[0] m C,R 9 929 group[seek] 2[1] m C,R,s,r 8 930 quit[order] 3[0] u C,R *) 931 token[request] 4[0] u s 2 932 token[confirm] 4[1] u C 3 933 token[cancel] 4[2] u s 3 934 status[request] 5[0] m C,R,s,r 6 935 status[deny] 5[1] m C 6 936 status[info] 5[2] m C,R 7 937 coord[suspected] 6[0] m R,s,r *) 938 coord[established] 6[1] m C *) 939 coord[seek] 6[2] m C *) 941 multi/uni: m is multicast, u is unicast. 943 sent by: C is coordinator, R is repeater, s is sender, r is receiver. 945 *) Not yet described in the present version of the document. 947 5. References 949 [1] S. Armstrong, A. Freier, K. Marzullo: ``Multicast Transport 950 Protocol'', RFC 1301, February 1992. 952 [2] C. Bormann, J. Ott, H.-C. Gehrcke, T. Kerschat and N. Seifert: 953 ``MTP-2: Towards Achieving the S.E.R.O. Properties for Multicast 954 Transport'', International Conference on Computer Communications 955 and Networks (ICCCN 94), 1994 (available from ftp://ftp.cs.tu- 956 berlin.de/pub/local/kbs/mtp/doc/sero.ps). 958 [3] Holbrook, H.W., Singhal, S.K., and Cheriton, D.R., Log-based 959 Receiver-Reliable Multicast for Distributed Interactive 960 Simulation. SIGCOMM '95, Cambridge, MA, August, 1995. 962 6. Authors' addresses 964 Carsten Bormann, Joerg Ott 965 Universitaet Bremen FB3 TZI 966 Postfach 330440 967 D-28334 Bremen, GERMANY 968 cabo, jo@tzi.org 969 phone +49.421.218-7024 971 Nils Seifert, 972 Technische Universitaet Berlin FR6-3 973 Franklinstrasse 28/29 974 D-10587 Berlin, GERMANY 975 nilss@cs.tu-berlin.de 976 phone +49.30.314-73389