idnits 2.17.1 draft-mannie-stc-ion-msp-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-27) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 13) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 149 has weird spacing: '...opology vs. ...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 1996) is 10056 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'SCSP' is defined on line 597, but no explicit reference was found in the text == Unused Reference: 'Epidemic' is defined on line 600, but no explicit reference was found in the text == Unused Reference: 'LNNI' is defined on line 606, but no explicit reference was found in the text == Unused Reference: 'IGMP' is defined on line 609, but no explicit reference was found in the text == Unused Reference: 'OSPF' is defined on line 612, but no explicit reference was found in the text == Unused Reference: 'PNNI' is defined on line 614, but no explicit reference was found in the text -- Unexpected draft version: The latest known version of draft-ietf-ipatm-ipmc is -11, but you're referring to -12. == Outdated reference: A later version (-14) exists of draft-ietf-rolc-nhrp-09 -- Possible downref: Non-RFC (?) normative reference: ref. 'MPOA' ** Obsolete normative reference: RFC 1577 (ref. 'Classical') (Obsoleted by RFC 2225) -- No information found for draft-luciani-rolc-scsp - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'SCSP' -- Possible downref: Non-RFC (?) normative reference: ref. 'Epidemic' -- Possible downref: Non-RFC (?) normative reference: ref. 'LANE' -- Possible downref: Non-RFC (?) normative reference: ref. 'LNNI' ** Obsolete normative reference: RFC 1583 (ref. 'OSPF') (Obsoleted by RFC 2178) -- Possible downref: Non-RFC (?) normative reference: ref. 'PNNI' Summary: 10 errors (**), 0 flaws (~~), 10 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 IP Over NBMA Working Group Eric Mannie 2 INTERNET-DRAFT Marc De Preter 3 Expires 21th of April 1997 (ULB-STC) 4 5 October 1996 7 Multicast Synchronization Protocol (MSP) 9 Status of this Memo 11 This document is an Internet-Draft. Internet-Drafts are working 12 documents of the Internet Engineering Task Force (IETF), its areas, 13 and its working groups. Note that other groups may also distribute 14 working documents as Internet- Drafts. 16 Internet-Drafts are draft documents valid for a maximum of six months 17 and may be updated, replaced, or obsoleted by other documents at any 18 time. It is inappropriate to use Internet-Drafts as reference 19 material or to cite them other than as ``work in progress.'' 21 To learn the current status of any Internet-Draft, please check the 22 ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow 23 Directories on ds.internic.net (US East Coast), nic.nordu.net 24 (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (PacificRim). 26 Abstract 28 This document defines a Multicast Synchronization Protocol (MSP) 29 designed to avoid traditional problems related to the use of unicast 30 (pt-to-pt) synchronization protocols (such as those encountered with 31 OSPF, P-NNI, SCSP, epidemic protocols,...). These protocols imply the 32 establishment and maintenance of a topology of servers (tree, star, 33 line, mesh,...). It is not obvious to find neither the best topology 34 for a given synchronization protocol nor the best algorithm which 35 allows to create this topology. An attempt to study the influence of 36 the spatial distribution (topology) has been made, for instance, in 37 Epidemic algorithms by Xerox Parc, which showed interesting results. 38 Moreover, traditional synchronization protocols notably imply a 39 convergence time and traffic which is proportional to the size of the 40 topology. We believe that reducing the topology to a set of members of 41 a single multicast group could reduce both convergence time and 42 traffic. Note that in that case, no configuration algorithm is 43 required. 45 1. Introduction 47 MSP allows to synchronize a replicated database between various 48 servers which are all members of the same server group (SG). It takes 49 advantage of multicast capabilities provided by a growing number of 50 underlying layers. It is suitable in environments supporting multicast 51 (group) addresses (such as ethernet and SMDS) or point-to-multipoint 52 connections (such as ATM). In this context all servers can directly 53 communicate with all servers in the same group, using the underlying 54 multicast capability. No particular topology (except the underlying 55 multicast topology itself) nor configuration algorithm (such as the 56 Spanning Tree,...) are required. No problem due to critical links and 57 topology partitions occurs. This protocol is very robust as an update 58 generated by a server is directly received by all other servers in the 59 same group. Finally, MSP is a generic protocol and is defined 60 independently of the particular database or cache to synchronize. 62 2. Overview 64 MSP is a generic protocol defined independently of the particular 65 database to synchronise. The MSP pdus may be either self transported 66 or part of other protocols and used as fields of these protocols. For 67 instance MSP can be supported on the top of an IP multicast service, 68 on the top of an Ethernet network, on the top of an ATM service or it 69 can be a part of Classical IP, MARS or NHRP. It can even be used to 70 synchronize different databases in parallel, in the same pdus. 72 Each server is the owner of a part of the database (e.g. bindings by 73 local clients) and has a unique server ID. It maintains a copy of the 74 complete database (replicated database), each entry is either locally 75 owned or learned from another server. An entry which belongs to a 76 particular server is tagged with a timestamp (event time) and the 77 server ID of its owner. This timestamp identifies an update of the 78 entry and is set each time the entry is updated. The timestamp is 79 unique in the context of the owner and is the concatenation of a 80 standard time and a sequence number. 82 All servers are directly connected by a multicast group or a mesh of 83 point-to-multipoint connections. Each entry update generated by a 84 server is sent with its timestamp and the server ID. This update is 85 directly received by all other servers and each local database is 86 updated accordingly. Updates are packed with their timestamps in 87 pdus, which are logically grouped into transactions. Transactions 88 allow to speed up the detection of the loss of a part of the pdus. 90 Pdus are not positively acknowledged (ACK) by receiving servers. If a 91 server detects some missing pdus, it sends a NACK in the multicast 92 group for the corresponding missing updates identified by their 93 timestamps. In order to avoid an implosion of concurrent NACKs and 94 reduce the total number of transmitted NACKs, a technique similar to 95 IGMP is used. A server waits for a delay randomly chosen between zero 96 and D milliseconds before sending a NACK. If, during this period, 97 another NACK is seen for the same timestamps, the NACK generation is 98 cancelled and the server waits for the retransmitted updates. This 99 scheme is based on the fact that if a pdu is lost in the context of a 100 multicast group, there are probably more than one server which have 101 not received the pdu. In addition, it should be noted that such pdus 102 are essentially small pdus and that the expected error rate of the 103 underlying layer should be very low (e.g. ATM). 105 Small hello pdus are generated periodically at each hello interval and 106 include the server id and its last attributed timestamp. Hello pdus 107 are needed to detect the loss of a set of complete transactions. 109 When a server or a connection (re)starts, either all the database has 110 been lost, or, only the most recent updates have not been 111 received. The corresponding server sends a SYNC pdu to the multicast 112 group, either indicating that all the database has to be 113 retransmitted, or, listing each server ID and its last received 114 timestamp. SYNC are not sent directly; the same scheme as the one used 115 for NACKs is used again; it allows a server to take advantage of the 116 resynchronization requested by another server. Receiving servers send 117 transactions for the missing information, and could use the same 118 scheme as the one used with NACKs and SYNCs. Each server only sends 119 its own updates and only if their timestamps are greater than the 120 requested one. If some timestamps are not available any longer, it 121 means that the corresponding information has been updated by a more 122 recent update and that only this last update has to be retransmitted 123 by the owner. Only more recent updates are retransmitted. Selective 124 SYNCs are resent if a part of the requested updates has not been 125 received, obsolete timestamps are advertised. 127 For the configuration point of view, only a single multicast address 128 or a list of server addresses has to be configured (e.g. obtained 129 through a "LEC" like configuration server [LANE]). No particular 130 topology has to be built, no configuration algorithm is needed, no 131 problem occurs if a server fails to rebuild the topology ! 133 Finally, if we only consider a topology of servers connected by 134 point-to-point connections, MSP acts like a traditional 135 synchronization protocol as explained in the MSP unicast chapter. 137 S1 S2 S3 S1 S2 138 | | | \ /\ 139 | | | \/ \ 140 +-----------------+ S3 S4 141 | | / 142 | Multicast Group |---S4 / 143 | | S5-----S6 144 +-----------------+ / 145 | | | / 146 | | | S7 147 S5 S6 S7 149 Multicast topology vs. traditional topology 151 3. Server Group 153 MSP allows to synchronize a replicated database between various 154 servers which are all members of the same server group (SG). It 155 ensures that within a short duration, all servers in the SG will have 156 exactly the same copy of the database. The scope of a server group 157 could be restricted to servers which are all connected to the same LIS 158 [Classical], LAG [NHRP], cluster [MARS], ELAN [LANE] or IASG [MPOA]. 160 Each server group (SG) is identified by a server group ID (SGID). This 161 allows to support multiple server groups at the same time in the same 162 domain. This SGID only needs to be unique in this domain and could, 163 for instance, consist of the multicast address used to identify the 164 servers. 166 Moreover, each server in a SG is uniquely identified by a server ID 167 (SID). This ID could be the internetwork layer address of the server 168 itself, e.g. its IP address. Each pdu transmitted by a server will be 169 tagged with its server group ID (SGID) and server ID (SID). Using a 170 complete internetwork layer address as SID allows to quickly identify 171 a server and to facilitate the management. Both the SGID and the SID 172 are represented on 32 bits. 174 4. Underlying Layer 176 MSP is designed to take advantage of multicast capabilities provided 177 by the underlying layer. This version mainly focuses on the use of 178 unidirectional point-to-multipoint connections and full multicast 179 service such as supported by an Ethernet network. MSP may be supported 180 over the IP layer itself or over any NBMA network such as an SMDS or 181 ATM (UNI3.0/3.1 or later) network. 183 It should be noted that, if no multicast capability at all is 184 supported by the underlying layer, MSP mainly acts like a traditional 185 point-to-point synchronization protocol (see specific chapter). 187 5. Topology 189 The topology is reduced to the scope of a single multicast group. No 190 particular algorithm nor protocol are required to establish and 191 maintain this topology. The address of the multicast group or the list 192 of all members may be directly obtained through a configuration server 193 as the LECS [LANE]. 195 All servers are directly connected by a multicast group or a mesh of 196 point-to-multipoint connections. In the worst case, n 197 point-to-multipoint connections are needed when n servers have to be 198 synchronized. In a near future, when multipoint- to-multipoint 199 connections will be available, the need to support n connections will 200 be cancelled. 202 It is important to note that these point-to-multipoint connections are 203 only supported by servers (never by clients) and that these servers 204 could be efficiently implemented over internetworking units, such as 205 ATM switches. In addition, the behaviour of these servers is very 206 static as they do not appear and disappear continuously in a server 207 group. This considerably simplifies the management of the connections. 209 A multicast topology is very robust as every message is directly 210 received by all servers in the SG and as there is no single point of 211 failure or forwarding point at the server level. The complexity of 212 establishing the topology and dealing with dynamic topology 213 partitioning is left to the underlying layer, at the level of routers 214 or switches. 216 Resources required by servers for message generation are reduced since 217 a message is once sent and is never repeated from server to 218 server. Forwarding of messages is better achieved by transport 219 protocols than by synchronization protocols. It also results that the 220 time needed by an update to reach all servers is very reduced and not 221 directly proportional to the number of servers. 223 More resources are required to receive messages but this process may 224 be optimized by using appropriate filtering on incoming messages and 225 database lookup, e.g. using dedicated hardware such as CAM (contents 226 access memory). The same kind of problem is solved in Ethernet 227 networks. Filtering on an incoming UPDATE pdu received from a given 228 server is easily applied, by comparing the recorded last received 229 timestamp from that server with the smallest (first) or largest (last) 230 timestamp of that pdu. If the first timestamp in the pdu is greater 231 than the recorded timestamp, the complete pdu updates the local 232 database. If the last timestamp in the pdu is smaller than the 233 recorded timestamp, the complete pdu may be dropped (except if in 234 response to a NACK). Otherwise, all entries between the recorded 235 timestamp and the end of the UPDATE pdu update the local database. 237 6. Database and Timestamp 239 MSP is defined independently of the particular database (cache) to 240 synchronise. Database entries are transparently transported and are 241 defined by the particular protocols whose databases are 242 synchronized. The encoding of these entries will be defined in 243 specific appendixes of this document or in companion documents. The 244 database entries describe for instance bindings between IP addresses 245 and ATM addresses. Unlike other synchronization protocols, MSP does 246 not use summaries of database entries. MSP identifies events and 247 ensures that the most recent event for each entry has been received by 248 all servers. 250 Each server is the owner of the part of the database which is 251 generated by its own clients. It maintains a copy of the complete 252 database but is only responsible for the entries it owns and can only 253 transmit these entries. For instance, each entry in a database may be 254 either tagged as locally owned or learned from another server. 256 Each event (client action) in the database is identified by a 257 timestamp which is unique in the context of the server which generates 258 that event. This timestamp is the concatenation of a standard time 259 (e.g. GMT) and a real contiguous sequence number. This time may be 260 local to each server, a global time is not needed by this protocol, 261 i.e. clocks do not need to be synchronized. The sequence number is 262 incremented by one at each new event. It allows also to support more 263 than one event per tick of the standard time. 265 Each entry in the database is associated with a timestamp set by the 266 owner server at the moment when the client generates or updates the 267 entry. A server cannot change timestamps for entries it does not 268 own. Timestamps are used by servers to identify missing information in 269 their database. 271 If a server receives an entry which is already included in its 272 database, it must compare the two timestamps and keep the entry with 273 the most recent (greater) timestamp. 275 A timestamp X is greater (more recent) than a timestamp Y if the 276 standard time of X is greater than the standard time of Y or if they 277 are equal, the sequence number of X is greater than the sequence 278 number of Y. 280 When a server starts for the first time or looses its knowledge of its 281 current timestamp (after a crash for instance), it only has to 282 generate a new value for the standard time and restart the sequence 283 numbering at zero. The same scheme is applied when the sequence 284 number wraps out. The standard time is never modified when a timestamp 285 sequence number is incremented. It is only used to insure a unique 286 value during the lifetime of a server. 288 7. Transactions 290 Database entries are transmitted in variable length UPDATE pdus. The 291 size of an UPDATE pdu may be limited in order to fit in the particular 292 MTU supported by the underlying layer. UPDATE pdus related to events 293 close in time are logically grouped into transactions. A transaction 294 is delimited by a start flag in its first UPDATE pdu and by a stop 295 flag in its last UPDATE pdu. The shortest transaction contains a 296 single UPDATE pdu with both the start and stop flags set. 298 +------------+-------------------+ 299 | Pdu header | Entry 1... | 300 |(#seq,...) | ...Entry n | 301 +------------+-------------------+ 303 UPDATE pdu 305 Transactions are generated in response to synchronization request 306 (SYNC received) or in response to a NACK or when a set of entries have 307 to be flooded. An empty transaction is sent to indicate that the 308 requested timestamps are not available any more (obsolete), following 309 a synchronization request or a NACK. Such a transaction is made of one 310 empty UPDATE pdu identifying the requested timestamps. 312 Transactions are not numbered but all UPDATE pdus are numbered 313 sequentially among all transactions generated by the same server. All 314 entries in a transaction and in its subsequent UPDATE pdus are sorted 315 by timestamp order. No sort algorithm is needed at all, only a sorted 316 list per server has to be maintained (trivial since timestamps are 317 generated in order). The numbering of UPDATE pdus is contiguous so 318 that a server can immediately detect missing pdus. UPDATE pdu sequence 319 numbers should not be confused with timestamp sequence numbers. PDU 320 sequence numbers are needed since timestamp sequence numbers are not 321 always contiguous (if a server [re-]starts or when entries are 322 obsoleted). 324 +----- 325 | 326 | +------------+-------------------+ 327 | | Pdu header | Entry 1... | UPDATE 328 | |(#seq,...) | ...Entry n | pdu 329 | +------------+-------------------+ 330 | . 331 | . 332 | . 333 | +------------+-------------------+ 334 | | Pdu header | Entry 1... | UPDATE 335 | |(#seq,...) | ...Entry n | pdu 336 | +------------+-------------------+ 337 +----- 339 A transaction 341 Transactions are useful in order to detect faster the loss of the last 342 UPDATE pdus sent by a server. This loss may be detected either by a 343 timer waiting for the end of the current transaction, or when the 344 first UPDATE pdu of the next transaction is received or when the next 345 HELLO pdu is received from the server generating that 346 transaction. These HELLO pdus are generated regularly by each server 347 to indicate its last timestamp value. 349 A transaction starts either immediately when its first UPDATE pdu is 350 built or after a small random delay in order to avoid multicast 351 storms. During this delay, new updates may be generated by clients and 352 added in the transaction. 354 If a gap in the numbering of UPDATE pdus is detected, it means either 355 that a part of a transaction has been lost or that a complete 356 transaction or set of transactions have been lost. A NACK pdu is sent, 357 indicating the last received timestamp before the gap and the first 358 received timestamp just after the gap. The last received timestamp 359 before the gap was received in an UPDATE pdu in the same transaction 360 or in a previous transaction. 362 If after a timeout (NACK interval), the requested entries have not 363 been retransmitted (in a new transaction), a NACK pdu is retransmitted 364 for the same timestamps. After a maximum number of NACK 365 retransmission, the corresponding server is considered as not 366 available any more. All its updates are kept physically in the 367 database but the server memorizes the fact that the last valid 368 received timestamp was the last one received before the gap. When the 369 previously unreachable server will become available again, it will 370 sent an HELLO or a SYNC or a transaction and the current server will 371 discover that a set of timestamps are missing. A NACK pdu consist in a 372 list of server IDs with a list of timestamp intervals for each of 373 these IDs. In the simplest case, a NACK pdu is made of a single ID 374 with a single interval. A specific timestamp value is used to indicate 375 an open interval such as [x, [ (i.e. from value x to the infinite). 377 If a server crashes and loses its current UPDATE pdu sequence number, 378 it restarts the numbering at zero. In that case no problem occurs as 379 it resynchronizes its database as described in the synchronization 380 chapter. In order to be able to detect a gap in UPDATE streams, a 381 server keeps the last pdu sequence number received from each 382 server. After having received a SYNC pdu, a server must transparently 383 set that number to the value contained in the next received UPDATE pdu 384 from the corresponding server, without checking for a gap. 386 MSP may be implemented over a service which does not prevent pdu 387 misordering. In that case, a server should wait for a small timer 388 before deciding that a pdu is lost, in order to have a chance to 389 re-order the pdus. When that timer expires, a gap is detected and a 390 NACK is sent. 392 8. Synchronization 394 The synchronization of information is required in two cases: when an 395 update has to be flooded and when a server has to build or rebuild its 396 database. The first case is detailed in the Update flooding 397 chapter. The last case is the synchronization process itself. It 398 occurs, for instance, when a server joins the SG for the first time 399 without having any idea of any existing binding or when a server joins 400 the SG already having a part of the bindings (e.g. when a broken 401 underlying connection is rebuilt). 403 A server having to synchronize with the rest of the group will first 404 build a list made of all server IDs included in its database (possibly 405 empty). For each entry in the list, it adds the corresponding highest 406 known timestamp. This list is inserted in a SYNC pdu, together with 407 the total number of elements, the synchronizing server ID and its own 408 highest timestamp. This pdu is multicasted to the group. 410 Each server receiving a SYNC pdu scans the list for its own ID. If it 411 finds its ID, it builds a transaction containing all entries locally 412 owned and whose timestamps are greater than the required one. If that 413 required timestamp is greater or equal to its own highest timestamp, 414 no entries have to be send and an empty transaction is build, 415 signalling that the synchronizing server is up to date. Finally, the 416 transaction is multicasted to the group. 418 If a server does not find its ID in the list included in a SYNC pdu, 419 it means either that it has joined the group while the synchronizing 420 server was unreachable or that it does not own any entry in the global 421 database. In the first case, it has to build and send a transaction 422 for all entries it owns. In the last case, it does not have to send 423 any transaction. 425 Finally, each server receiving a SYNC pdu checks the synchronizing 426 server timestamp. If the local timestamp associated with the 427 synchronizing server is lower than the received one, the server has 428 not received a part of the synchronizing server own database and sends 429 a NACK. 431 The SYNC pdu is a kind of summary of the database known by the 432 synchronizing server. Its length is bounded by the total number of 433 servers in the SG. If a SYNC pdu does not reach a subset of the 434 servers, the synchronizing server will not receive any transaction in 435 response from these servers and will retransmit, after a timeout, a 436 SYNC pdu for these servers only. 438 The multicasting of each transaction improves the robustness of the 439 protocol and allows also other servers to learn entries before 440 starting their own synchronization (if still needed after the silent 441 listening). Transactions in response to a SYNC may be multicasted on a 442 separate multicast address on which only servers which have to 443 synchronize are listening. This last solution reduces the traffic 444 which is globally multicasted and allows also each server to 445 independently decide if it wants or not to receive the synchronization 446 traffic. 448 If all servers or a large number of servers have lost their 449 connectivity at the same time, the multicast scheme is very 450 efficient. If the real multicasting is not supported and if the 451 transmission on point-to-multipoint connections is not desired, it is 452 possible to use an on-demand point-to-point connections. 454 9. Update Flooding 456 Each time a client modifies its entry in a server, a new update is 457 generated and has to be flooded to all servers in the SG. The owner 458 server associates a timestamp with the update and builds a transaction 459 to flood that update. This update is included in an UPDATE pdu. The 460 server may send a transaction for that single update or it may group a 461 number of updates together in one or more UPDATE pdus in the same 462 transaction. Transactions may be sent immediately or after a small 463 random delay (see chapter on multicast storm). 465 10. Hello Pdu 467 Small HELLO pdus are sent periodically at each hello interval. Each 468 pdu includes the server ID and the last used timestamp of the 469 sender. This timestamp allows to detect the loss of a part of the 470 updates sent by the server which has generated the HELLO pdu. In 471 particular, it allows to detect the loss of a complete transaction or 472 a set of complete transactions. 474 When a server receives a given number of HELLO pdus indicating that it 475 has missed a few updates (its timestamps from the sending servers are 476 out of date), it may decide to resynchronize its database and generate 477 a SYNC. Otherwise, it may send an NACK pdu for each of these servers. 479 11. Multicast Storm 481 In order to avoid a multicast storm of NACKs when some UPDATE pdus are 482 lost or a storm of SYNCs when many servers have to synchronize at the 483 same time or a storm of transaction, a technique similar to IGMP may 484 be used. Before sending a NACK, a SYNC or the beginning of a 485 transaction, a server may wait for a small random delay between 0 and 486 D milliseconds. During this delay, the server listens to MSP pdus, 487 receives all transactions and updates its database. This silent 488 listening may result in a decreasing traffic and cancel some local 489 operations as explained hereafter. 491 If a server wants to send a NACK and that during the random delay, a 492 NACK is seen for the same set or subset of timestamps, the server 493 waits for the responding transactions. If after, that delay all of its 494 requested timestamps have been received, the generation of the NACK is 495 cancelled. The previous explanations on NACK retransmission are also 496 applicable here. 498 If a server wants to send a SYNC and that during the random delay, 499 other compatible SYNCs have been seen, it waits for the corresponding 500 transactions and decides after if its SYNC is still needed. 502 If a server wants to send a transaction in response to a SYNC, it may 503 also wait for a random delay in order to limit the number of 504 simultaneous transactions transmitted and/or received, to decrease the 505 amount of resources needed. 507 12. Unicast MSP 509 MSP is a multicast or point-to-multipoint protocol but may be also 510 used in an unicast or point-to-point environment. In that case it acts 511 like a traditional synchronization protocol, except mainly that each 512 UPDATE pdu doesn't need to be acked one by one, and that after a 513 failure no complete database summary has to be exchanged in two ways 514 each time. In this last case, only two small SYNC pdus are exchanged 515 and each server acts as a proxy for the information owned by other 516 servers behind him. Random delays are not needed any more, since there 517 is only one sender for each direction. 519 Of course, as we are back in the point-to-point case, an algorithm is 520 again needed to establish and maintain the topology. Each server knows 521 automatically the servers for which it must act as a proxy by 522 listening the Hello pdus and learning the position of each server. A 523 proxy server generates or forwards Hello pdus for servers it 524 represents. 526 A server wishing to synchronize will send a SYNC pdu to each server it 527 is connected to. These servers will respond with their own updates and 528 those of the servers they represent. The SYNC scheme is the same as 529 the one used in the multicast case, except for the random delay 530 technique. 532 A server knows which servers it represents, by keeping a trace of the 533 connections where HELLO pdus are received from. It represents all the 534 servers that sends HELLO pdus on all its connections other than the 535 one where it has received the SYNC pdu. 537 When a transaction is received from a neighbour server, the receiving 538 server must directly repeat this update on all its connections, except 539 the one on which it was received. 541 A server must also respond to each received NACK destinated to itself 542 or to one of the servers for which it acts as a proxy. In addition, it 543 does not need to wait for a random delay when it generates a NACK. 545 13. Further Study (not in this version) 547 In order to reduce the number of pdu formats, a SYNC pdu could be 548 implemented as a NACK pdu. In that case, a flag is used to indicate 549 that the NACK is a synchronization. Only the last received timestamp 550 (open interval) is given for each known server. A server which does 551 not see its ID in that list must retransmit all its own entries. 553 Transactions could be suppressed if the HELLO pdu rate is high enough 554 to allow to quickly detect the loss of the last transmitted pdu from a 555 given server. In the current version, the reception of the last pdu of 556 a transaction allows to know that a bundle of pdus have been 557 transmitted and that no further pdu must be waited before the 558 beginning of a next transaction. 560 14. Security 562 When MSP is embedded in another protocol, security considerations are 563 mainly covered by this specific protocol. Detailed security analysis 564 of this protocol is for further study. 566 Conclusion 568 MSP is a generic multicast synchronization protocol which may also act 569 as traditional unicast protocol. It reduces the traffic by identifying 570 events in the database in place of using database summaries, and by 571 supporting negative acknowledgments (NACK) in place of systematic 572 ACKs. It is particularly suitable in environments with a low error 573 rate such as ATM. It reduces the convergence time and improves the 574 robustness by using a multicast topology where updates are directly 575 received by all servers. No single point of failure exists. No 576 configuration algorithm and protocol are needed, no specific problem 577 occurs if the topology partitions. It takes advantage of the fact that 578 the forwarding of information and the dynamic routing is better 579 achieved by well-known dedicated protocols such as interworking and 580 routing protocols which are implemented anyway to support the normal 581 data transfer service. 583 References 585 [MARS] "Support for Multicast over UNI 3.0/3.1 based ATM 586 Networks.", Armitage, draft-ietf-ipatm-ipmc-12.txt. 588 [NHRP] "NBMA Next Hop Resolution Protocol (NHRP)", Luciani, 589 Katz, Piscitello, Cole, draft-ietf-rolc-nhrp-09.txt. 591 [MPOA] "Baseline Text for MPOA, draft", C. Brown, ATM Forum 592 95-0824R6, February 1996. 594 [Classical] "Classical IP and ARP over ATM", Laubach, RFC 595 1577. 597 [SCSP] "Server Cache Synchronization Protocol (SCSP) - 598 NBMA", J. Luciani et al., draft-luciani-rolc-scsp-03.txt 600 [Epidemic] "Epidemic Algorithms for Replicated Database 601 Maintenance", Demers et al., Xerox PARC. 603 [LANE] LAN Emulation over ATM Version 1.0, ATM Forum af- 604 lane-0021.000, January 1995. 606 [LNNI] LAN Emulation over ATM Version 2 - LNNI specification 607 - Draft 3 ATM Forum 95-1082R3, April 1996. 609 [IGMP] "Host Extensions for IP Multicasting", S. Deering, 610 STD 5, rfc1112, Stanford University, February 1989. 612 [OSPF] "OSPF Version 2", Moy, RFC1583. 614 [PNNI] "PNNI Specification version 1", Dykeman, Goguen, ATM 615 Forum af-pnni-055.000, March 1996. 617 Acknowledgments 619 Thanks to all who have contributed but particular thanks to Andy Malis 620 from Nexen and Ramin Najmabadi Kia from ULB. 622 Author's Address 624 Eric Mannie 625 Brussels University (ULB) 626 Service Telematique et Communication 627 CP 230, bld du Triomphe 628 1050 Brussels, Belgium 629 phone: +32-2-650.57.17 630 fax: +32-2-629.38.16 631 email: mannie@helios.iihe.ac.be 633 Marc De Preter 634 Brussels University (ULB) 635 Service Telematique et Communication 636 CP 230, bld du Triomphe 637 1050 Brussels, Belgium 638 phone: +32-2-650.57.17 639 fax: +32-2-629.38.16 640 email: depreter@helios.iihe.ac.be 642 Appendix 1 - PDU Format 644 For further study