<?xml version="1.0" ?>
	<!DOCTYPE rfc SYSTEM 'rfc2629.dtd' [
	<!ENTITY rfc2629 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml'>
	]>
<rfc
	category="exp"
	docName="draft-bestler-transactional-subset-multicast-00"
	ipr="trust200902"
>
<?rfc toc="yes"?>
<?rfc symrefs=yes"?>
<?rfc compact="yes"?>

	<front>
		<title abbrev='Transactional Subset Multicast Groups'>
			Creation of Transactional Subset Multicast Groups 
		</title>
		<author initials="C" surname="Bestler" fullname="Caitlin Bestler" role="editor">
			<organization abbrev='Nexenta'>Nexenta Systems</organization>	
			<address>
			<postal>
			<street>455 El Camino Real</street>
			<city>Santa Clara</city>
			<region>CA</region>
			<country>US</country>
			</postal>
			<email>	caitlin.bestler@nexenta.com,cait@asomi.com</email>
			</address>
		</author>
		<author initials="R" surname="Novak" fullname="Robert Novak">
			<organization abbrev='Nexenta'>Nexenta Systems</organization>	
			<address>
			<postal>
			<street>455 El Camino Real</street>
			<city>Santa Clara</city>
			<region>CA</region>
			<country>US</country>
			</postal>
			<email>	robert.novak@nexenta.com</email>
			</address>
		</author>
		<date month="September" year="2014" />
		<area>Transport</area>
		<workgroup>TSVWG</workgroup>
		<keyword>RFC</keyword>
		<keyword>Request for Comment</keyword>
		<keyword>I-D</keyword>
		<keyword>Multicast Group</keyword>
		<abstract>
		<t>
			This memo presents techniques for controlling the membership of multicast
			groups which are constrained to be a subset of a pre-existing multicast
			group, where such subset groups are only used for short duration
			transactions which are multicast to a subset of the larger multicast
			group.
		</t>				
		</abstract>
		<note title="Editor's Note">
		<t>
			The proper working group for this draft has not yet been determined.
			Alternate working groups include PIM and INT.
		</t>
		<t>
			Nexenta has been developing a multicast based transport/storage protocol
			for Object Clusters at Nexenta. This applies multicast datagrams to
			creation and replication of Objects such as those supported by the Amazon
			Simple Storage Service ("S3") protocol or the OpenStack Object Storage
			service ("Swift"). Creating replicas of  object payload on multiple
			servers is an inherent part of any storage cluster, which makes multicast
			addressing very inviting. There are issues of congestion control and
			reliability to settle, but new Layer 2 capabilities such as DCB
			(Data Center Bridging) make this doable.
		</t>
		<t>
			However, we found that the existing protocols for controlling multicast
			group membership (IGMP and MLD) are not suitable for our storage
			application. The Authors doubt this is unique to a single application.
			It should apply to many clusters that have a need to distribute
			transactional messages to dynamically selected subsets of a group
			within a cluster to multiple known recipients.
		</t>
		<t>
			Computational clusters using MPI  are also potential users of
			transactional multicasting. Inter-server replication in a pNFS
			cluster is another.
		</t>
		<t>
			These are just examples of synchronizing cluster data where the
			synchronization does not replicate all of the shared data with the
			entire cluster. But these are merely initial hunches, working group
			feedback is expected to refine characterization of the applicability
			of transactional subset multicast groups.
		</t>
		<t>
			This submission, and ensuing discussion of this draft and its
			successors will make reference to specific applications, including
			the Nexenta Replicast protocol for multicast replication in Nexenta's
			Cloud Copy-on-Write (CCOW) Object Cluster used in the NexentaEdge
			product. Such examples are merely for illustrative purposes. Any IETF
			standardization of the Replicast storage protocols would be done via
			the Storm or NFS groups, and would require adoption of a definition of
			Object Storage as a service before standardizing any specific protocol
			for providing Object Storage services.
		</t>
		<t>
			At this stage in drafting message formats have not yet been set for the
			standardized version of the protocol. The pre-standard version was 
			limited to a single L2 physical network, which would be an inappropriate
			limitation for an IETF standard. Working Group feedback on the format
			of these messages will be sought during the consensus building process.
		</t>
		</note>
	</front>
	<middle>
	<section anchor='intro' title='Introduction'>
		<t>
			Existing standards for controlling the membership of multicast groups
			can be characterized as being Join-driven. These include
			<xref target="RFC3376" />,<xref target="RFC3810" />,
			<xref target="RFC4541" /> and <xref target="RFC4604" />.
			Due to their inherent latency these techniques prove to be unsuitable
			for maintaining large sets of related multiast groups. This memo details
			a new method of maintaining such large sets of related multicast groups
			when they are all subsets of a single master reference group.
			This is not a restriction for most cluster-oriented applications which
			could use transactional multicasting.
		</t>
		<t>
			Transactional Subset Multicasting defines techniques that extends existing
			control of a reference multicast group to a potentially large set of 
			multicast addresses used with a VLAN within each local subnet that the
			reference multicast group reaches.
		</t>
		<t>
			This specification makes no modifications to the forwarding of
			multicast packets nor to the communications between mrouters.
			New methods are defined to set Layer 2 multicast forwarding rules
			on switches within each of the relevant Layer 2 subnets.
		</t>
		<section title='Requirements Notation'>
		<t>
		   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   			"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   			document are to be interpreted as described in RFC 2119
   			<xref target="RFC2119" />.
		</t>
		</section>
	</section>
	<section title="Motivation">
		<t>
			Transactional Subset Multicast groups are maintained within each VLAN.
			A 'Forwarding Control Agent' is defined within each VLAN that is
			responsible for applying the forwarding information known for a
			reference multicast group to efficiently set layer 2 multicast
			forwarding rules within each local network.
		</t>
		<t>
			The functionality of the Forwarding Control Agent is best understood
			as extending the functionality of IGMP/MLD Snooping (See
			<xref target="RFC4541" />).
		</t>
		<t>
			An IGMP/MLD snooper interprets IGMP (see <xref target="RFC3376" />) or
			MLD (see <xref target="RFC3810" />) messages to translate their Layer 3
			objectives into Layer 2 multicast forwarding rules.
		</t>
		<t>
			A Forwarding Control Agent interprets new messages defined in this
			specification for a newly defined class of transactional subset multicast
			groups into the same Layer 2 multicast forwarding rules. Strategies for
			implementing Forwarding Control Agents would include extending IGMP/MLD
			snooping implementations or building the Forwarding Control Agent external
			to the existing L2 switch software.
		</t>
		<t>
			The per transaction costs of using such groups are far lower than with the
			existing methods. The ongoing maintenance work for multicast forwarding
			elements is limited to the reference multicast group, it is not replicated
			for each of the subset transactional multicast groups.
		</t>
	</section>
	<section title="An Example Application">
		<t>
			The Replicast (see <xref target="Replicast" />) usage of transactional
			subset multicasting involves:
			<list style="symbols">
				<t>
					Taking a Cryptographic Hash of each chunk to be stored.
					This "hash id" is used with a distributed hash table to determine
					a conventional multicast group which will be used to negotiate
					placement of the chunk. This is the reference multicast group.
					Replicast refers to it as a "Negotiating Group".
				</t>
				<t>
					Multicasting a request to put the chunk to the reference multicast
					group. Receiving storage nodes will respond with a bid on when they
					could store that chunk, or an indication that they already have
					that chunk stored. Each of the storage nodes is offering a
					provisional reservation of its input capacity for a specific
					time window.
				</t>
				<t>
					Assuming that the chunk is not already stored, selecting the
					best responses to make a transactional subset group. Determination
					of 'best' typically is driven by the earliest possible completion
					of the transaction, but may factor the current available storage
					capacity on each of the storage nodes as well.
				</t>
				<t>
					Form or select a "rendezvous group" which will be used to transfer
					the chunk. When the core network is non-blocking, the transfer
					will be able to proceed at close to full wire speed at the
					reserved time because each of the selected storage nodes has
					reserved its input capacity for bulk payload exclusively. A
					multicast message to the reference group informs both those
					selected and those not selected for the rendezvous transfer.
					Those not selected will release the provisional reservation.
				</t>
				<t>
					At the designated time, multicast the chunk payload to the
					transactional subset multicast group.
				</t>
				<t>
					Each recipient validates the cryptographic hash of the received
					data, and unicasts a positive or negative acknowledgement to the
					sender.
				</t>
				<t>
					If sufficient valid copies have been positively acknowledge, the
					transaction is complete. Otherwise it is retried.
				</t>
			</list>
		</t>
	</section>
	<section title="Generalized Usage of Transactional Subset Multicast Groups">
		<t>
			Beyond a specific application, the generalized potential for dramatic
			savings is that transactional messaging within a cluster is a radically
			different use-case from traditional multicast.
			The set of factors that differentiates this class of applications can be
			examined through a series of questions:
		</t>
		<t>
			<list style="symbols">
				<t>How is the group Selected? <xref target='GroupSelect' /></t>
				<t>What are the endpoints that receive the messages?
					<xref target='WhoReceives' /></t>
				<t>What is the duration of the group?<xref target='Duration' /></t>
				<t>Who are the potential members of the group?
					<xref target='Members' /></t>
				<t>How much latency does the application tolerate?
					<xref target='Latency' /></t>
				<t>What must be done to maintain the group?
					<xref target='Maintenance' /></t>
			</list>
		</t>
	</section>

<section title='Transactional Subset Multicast Groups'>
<section title='Definition'>
<t>
	A Transactions Subset Multicast Group is a multicast group which:
	<list style='symbols'>
		<t>
			Is derived from a pre-existing multicast group created by means
			independent of this standard. The membership of this derived group
			is a subset of the reference existing multicast group.
		</t>
		<t>
			Has a multicast group address which is part of a block allocated for
			transactional multicast groups.
    	</t>
		<t>
			Will only be used for the duration of a transaction. A network failure
			or re-configuration during the transaction will require an upper layer
			retry of the transaction. Transactional Subset Multicast groups are not
			suitable for streaming of content. Transactional subset multicast groups
			may be persistent, in that the same group continues to exist and be used
			for a series of transactions. But each message sent to the group is
			part of a single short duration transaction.
		</t>
	</list>
</t>
<section title='Dynamic Specification versus Dynamic Selection'>
<t>
	There are two basic strategies for managing the membership of subset
	multicast groups:
	<list style="symbols">
	<t>
		Dynamic Specification: The selected members join a group
		that had been pre-selected for the transaction.
	</t>
	<t>
		Dynamic Selection: A pre-existing group is selected to
		match the subset desired. That group is allocated for this
		purpose and used for the transaction.
	</t>
	</list>
</t>
<t>
	These two strategies can also be combined to form a hybrid strategy. If there is
	a pre-existing group for the desired membership list it is allocated and used,
	otherwise an available group is allocated and re-configured to have the required
	membership.
</t>
</section>
<section title='Push vs. Join'>
<t>
	Existing methods for managing membership of a multicast group can be characterized
	as Join protocols. The receivers may join the group, or subscribe to a specific
	source within a group, but the receivers of multicast messages control their
	reception of multicast messages.
</t>
<t>
	This model is well suited for multimedia transmission where the sender does not
	necessarily know the full set of endpoints receiving its multicast content.
	In many cluster application the sender has determined the set of receivers.
	Requiring the sender to communicate with the recipients so that they can 
	Join the group adds latency to the entire transaction.
</t>
<t>
	However, there would be a serious security concern if transactional multicasting
	is not limited to transactional subset multicasting. Requiring that every member
	of a subset multicast group already be a member of a reference multicast group
	ensures that no new method of sending traffic is being created. Without this
	guarantee a denial-of-service attacker could simply push a multicast group
	membership listing 1000 members, then flood that multicast group. The amount
	of traffic delivered to the aggregate destinations would be multiplied by
	a factor of 1000.
</t>
<t>
	Transactional subset multicasting is defined to eliminate the latency required
	for Join-directed multicast group membership, while avoiding creating a new
	attack vector for denial-of-service flooding.
</t>
</section>
</section>
<section title='Applicability'>
<t>
	Transactional Subset Multicast Groups are applicable for applications that
	want to reduce overall latency by reducing the number of round-trips required
	for their transactions when identical content must be delivered to multiple
	cluster members, but the selected members are a subset of a larger group
	that must be dynamically selected.
</t>
<t>
	Parallel processing of payload and/or storage of payload are the primary
	examples of such a pattern of communications.
</t>
<t>
	Examples of such applications include:
	<list style="symbols">
		<t>
			Computational Clusters, particularly those using MPI
			(see <xref target='MPI' />)
		</t>
		<t>
			Storage applications, including:
			<list style="symbols">
				<t>
					pNFS (See <xref target="RFC5661" />).
				</t>
				<t>
					Amazon Simple Storage Service (S3) (See <xref target="AmazonS3" />).
				</t>
				<t>
					OpenStack Object Storage (Swift) (See <xref target="Swift" />).
				</t>
			</list>
		</t>
	</list>
</t>
<t>
	Dynamic selection of subsets ultimately enables multiple concurrent transfers
	to occur, which would not have been possible if the message had been sent to
	the entire reference multicast group. Applications with relatively small
	payload to be multicast may find it easier to use simple multicast and
	slightly over-deliver the message.
</t>
<section anchor='GroupSelect' title='How is the Group Selected?'>
<t>
	In Join-directed multicasting the membership of a multicast group is controlled
	by the listeners joining and leaving the group. The sender does not control or
	even know the recipients. This matches the multicast streaming use-case very
	well. However it does not match a cluster that needs to distribute a
	transactional message to a subset of a known cluster.
</t>
<t>
	The target group is also assumed to be stable for a long sequence of packets,
	such as streaming a video. The targeted applications direct transactions to a
	subset of a stable group.
</t>
<t>
	One example of the need to distribute a transactional message to a subset of a
	known cluster is replication of data within an object cluster. A set of targets
	has been selected through an higher layer protocol. Joi-directed group setup here
	adds excessive latency to the process. The targets must be informed of their
	selection, they must execute IGMP joins and confirm their joining to the source
	before the multicast delivery can begin. Only replication of large storage assets
	can tolerate this setup penalty.
</t>
<t>
	A distributed computation may similarly have data that is relevant to a specific
	set of recipients within the cluster. Performing the distribution serially to
	each target over unicast point-to-point connections uses excessive bandwidth
	and increases the transactions' latency. It is also undesirable to incur the
	latency of Join-driven multicast group setup.  
</t>
<t>
	This specification creates two methods for a sender to form or select a multicast
	group for transactional purposes. With these methods no further transmissions are
	required from the selected targets until the full transfer is complete.
</t>
<t>
	The restriction that the targeted group must be a subset of an existing multicast
	group is necessary to prevent a denial-of-service flooding attack. Transactional
	multicast groups that were not restricted to being a subset of an existing 
	multicast group could be used to flood a large number of targets that were
	unprepared to process incoming multicast datagrams.
</t>
</section>


<section anchor='WhoReceives' title='What are the endpoints that receive the messages?'>
<t>
	The endpoints of the transactional messages may be higher layer entities, where
	each network endpoint supports multiples instances of the higher layer entities.
	For example, a storage application may have IP addresses associated with specific
	virtual drives, as opposed to an IP address associated with a server that hosted
	multiple virtual drives.
</t>
<t>
	Having an IP address for each drive makes migrating control over that drive to
	a new server easier, and allows the servers to direct incoming payload to the
	correct drive.
</t>
</section>


<section anchor='Duration' title='What is the duration of the group?'>

<t>
	Join-directed multicasting is designed primarily for the multicast streaming
	use-case. A group has an indefinite lifespan, and members come and go at any
	time during this lifespan, which might be measured in minutes, hours or days.
</t>
<t>
	Transaction multicasting is designed to support applications where a transaction
	lasts for microseconds or milliseconds (possibly even seconds).
	Transactional multicasting seeks to identify a multicast group for the duration of
	sending a set of multicast datagrams related to a specific transaction. Recipients
	either receive the entire set of datagrams or they do not. Multicast streaming
	typically is transmitting error tolerant content, such as MPEG encoded material.
	Transaction multicasting will typically transmit data with some form of validating
	signature and transaction identifier that allows each recipient to confirm full
	reception of the transaction.
</t>
<t>
	This obviously needs to be combined with applicable congestion control strategies
	being deployed by the upper layer protocols. The Nexenta Replicast protocol only
	does bulk transfers against reserved bandwidth, but there are probably as many
	solutions for this problem as there are applications. Replicast relies upon
	IEEE I802.1 Datacenter Bridging (DCB) protocols such as Priority Flow Control
	and Congestion Notification to provide  no-drop service. The DCB protocols deal
	with the fine timing of congestion avoidance, but require higher layer transport
	or application protocols to keep the sustained traffic rates below the sustained
	capacity. Creating explicit reservations for bulk transfers is the main method
	for accomplishing this.
</t>
<t>
	The relevant DCB protocols include:
	<list style="symbols">
		<t>Congestion Notification:<xref target="IEEE.802.1Qau-2011" /></t>
		<t>Enhanced Transmission Selection:<xref target="IEEE.802.1Qaz-2011" /></t>
		<t>Priority Flow Control<xref target="IEEE.802.1Qbb-2011" /></t>		
	</list>
</t>
<t>
	The important distinction between Replicast and conventional multicast
	applications is that there is no need to dynamically adjust multicast
	forwarding tables during the lifespan of a transaction, while IGMP and
	MLD are designed to allow the addition and deletion of members while a
	multicast group is in use. This distinction is not unique to any single
	storage application. Transactional replication is a common element in cluster
	protocol design.
</t>
<t>
	The limited duration of a transactional multicast group implies that there is
	no need for the multicast forwarding element to rebuild its forwarding tables
	after it restarts. Any transaction in progress will have failed, and been retried
	by the higher-layer protocol. Merely limiting the rate at which it fails and
	restarts is all that is required of each forwarding element.
</t>
<t>
	Another implication is that there is no need for the forwarding elements to
	rebuild the membership list of a transactional multicast group after the
	forwarding element has been reset. The transactions using the forwarding
	element will all fail, and be retried by a higher layer transport or application
	protocol. Assuming that forwarding elements do not reset multiple times a minute
	this will have very limited impact on overall application throughput.
</t>
<t>
	The duration of a transaction is application specific, but inherently limited.
	A failed transaction will be retried at the application layer, so obviously it
	has a duration measured in seconds at the longest.
</t>
</section>


<section anchor='Members' title='Who are the members of the group?'>
<t>
	Join-directed multicasting allows any number of recipients to join or leave a
	group at will.
</t>
<t>
	Transactional multicast requires that the group be identified as a small subset of
	a pre-existing multicast group.
</t>
<t>
	Building forwarding rules that are a subset of forwarding rules for an existing
	multicast group can be done substantially faster than creating forwarding rules
	to arbitrary and potentially previously unknown destinations.
</t>
<t>
	Some applications, including Object Clusters, benefit considering the members
	to be higher layer entities (such as virtual drives) rather than simply being
	the base IP address of the servers that host the higher layer entities. Doing
	so allows groups to be defined for each set of logical endpoints, not merely
	sets of physical endpoints. An Object Cluster, for example, could have two
	different groups ([A,B,C] vs [A,B,D]) even when the destinations are the same
	Layer 2 MAC address (i.e., C and D are hosted by the same server). This allows
	the server hosting both C and D to distinguish which entity is addressed using
	the Destination IP Address.
</t>
</section>

<section anchor='Latency' title='How much latency does the application tolerate?'>
<t>
	While no application likes latency, multicast streaming is very tolerant of setup
	latency. If the end application is viewing or listening to media, how many msecs
	are required to subscribe to the group will not have a measurable impact to the
	end user.
</t>
<t>
	For transactions in a cluster, however, every msec is delaying forward progress.
	The time it takes to do an IGMP join would be a significant addition to the latency
	of storing an object in an object cluster using a relatively fast storage
	technology (such as SSD, Flash or Memristor).
</t>
</section>

<section anchor='Maintenance' title='What must be done to maintain the Group?'>
<t>
	The Join-directed multicast protocols specify methods for the required maintenance
	of multicast groups.mMulticast forwarders, switches or mrouters, must deal with
	new routes and new locations for endpoints.
</t>
<t>
	The reference multicast group will still be maintained by the existing
	Join-directed multicast group protocols. The existing IGMP/MLD snooping
	procedures will keep the L2 multicasting forwarding rules updated as changes
	in the network topology are detected. Nothing in this specification changes
	the handling of the reference multicast group.
</t>
<t>
	Transactional subset multicast groups are defined to be used only for short
	transactions, allowing them to piggy-back on the maintenance of the reference
	multicast group.
</t>
</section>
</section>
</section>

<section title='Forwarding Control Agent'>
<t>
	The Forwarding Control Agent is responsible for translating forwarding
	control messages as defined in <xref target="FCAMethods" /> into Layer 2
	multicast forwarding for one or more subnets associated with a single physical
	layer 2 subnet.
</t>
<t>
	Each Forwarding Control Agent can be though of as extending the IGMP/MLD
	snooping capabilities of an L2 forwarding element. It is translating the
	forwarding control agent messages into configuration of L2 multicast forwarding
	just as an IGMP/MLD snooper translates IGMP/MLD messages into configuration of
	Layer 2 multicast forwarding. This MAY be done external to the existing
	implementation, or it may be integrated with the IGMP/MLD snooper
	implementation.
</t>
<t>
	Each Forwarding Control Agent:
	<list style='symbols'>
		<t>
			MUST Accept authenticated forwarding control agent messages controlling
			the creation and membership of Transactional Subset Multicast Groups
			within the context of a specified VLAN.
		</t>
		<t>
			MUST support at least one VLAN.
		</t>
		<t>
			MAY support multiple VLANs.
		</t>
		<t>
			MUST update the controlled Layer 2 forwarding element's multicast
			forwarding rules to reflect the subset specified for the group.
		</t>
		<t>
			MUST Update the controlled L2 forwarding elements multicast forwarding
			rules to reflect changes in the mapping of IP addresses to L2 MAC addresses
			between transactions for persistent transactional suset multicast groups
			when informed of a prior transactional failure with a Refresh Membership
			message (see <xref target='RefreshMembershipSetMsg' />).
		</t>
		<t>
			MAY refresh the Layer 2 multicast forwarding rules at any time.
		</t>
	</list>
</t>
<section title="Network Topology">
<t>
	Forwarding Control Agents are applicable for networks which consist of one or
	more local subnets which have direct links with each other.
</t>
</section>
<section anchor="IsolatedVLANs" title="Isolated VLANs Strategy">
<t>
	Transactional Subset Multicast groups define a very large number of multicast
	addresses which must be delivered within a closed set of IP subnets without
	having to dynamically co-ordinate allocation of these multicast addresses with
	a wider network.
</t>
<t>
	This MAY be accomplished using a "Isolated VLANs Strategy" where the reference
	multicast group and all transactional multicast groups derived from it are
	used strictly inside of a single VLAN or a set of interconnected VLANs which
	route these multicast groups solely within this closed set.
</t>
<t>
	Specifically, an implementation using the Isolated VLANs Strategy:
	<list style="symbols">
		<t>
			MUST include only a pre-defined set of subnets,each enforced with a VLAN.
		</t>
		<t>
			MUST provide for routing or forwarding of all packets using the reference
			multicast group and all transactional subset multicast groups derived
			from it amongst these subnets.
		</t>
		<t>
			MUST NOT allow any packet using the reference multicast group or any
			transactional subset multicast groups derived from it to be routed to
			any subnet that is not part of the identified Isolated VLAN set.
		</t>
		<t>
			MAY/SHOULD guard the confidentiality of multicast packets routed between
			subnets that transit subnets that are not part of the Isolated VLAN set.
		</t>
	</list>
</t>
<t>
	Applications MAY use the Isolated VLAN Strategy. Virtually all applications
	will elect to do so because allocating a very large block of adjacent multicast
	addresses would be very difficult. Confining usage of these addresses to a
	single VLAN is highly desirable.
</t>
<t>
	Direct connections between the VLANs hosting Forwarding Control Agents is
	required because the Transactional Subset Multicast Groups are not known to
	any intermediate multicast routers that would implement indirect links.
	Co-locating Forwarding Control Agents with RBridges [<xref target='RFC6325' />]
	MAY be a solution.
</t>
</section>
</section>
<section anchor='FCAMethods' title='Forwarding Control Agent Methods'>
<section anchor='DynamicPush' title='Dynamically Pushed Subset Groups'>
<t>
	Each Pushed Subset Membership commands MUST contain the following:
	<list style="symbols">
		<t>
			Subset Transactional Multicast Group: Group multicast address that is
			to have its multicast forwarding rules updated. This address must be
			within a block of Transactional Multicast Groups previously created
			using the Create Transactional Multicast Address Block command
			(<xref target="CreateAddressBlock" />).
		</t>
		<t>
			Target List: List of IP Addresses which are to be the targets of this
			group. These addresses are intended to be members of the reference group.
			When formulating the list, non-members MUST NOT be included. However
			there is no transaction lock placed upon the group, and therefor there
			may be changes in the group membership before the message is received.
			Therefore the Forwarding Control Agent MUST ignored any listed target
			that is not a member of the reference group.
		</t>
	</list>
</t>
<t>
	This sets the multicast forwarding rules for pre-existing multicast forwarding
	address X to be the subset of the forwarding rules for existing group Y
	required to reach a specified member list.
</t>
<t>
	This is done by communicating the same instruction (above) to each multicast
	forwarding network element.  This can be done by unicast addressing with each
	of them, or by multicasting the instructions.
</t>
<t>
	Each multicast forwarder will modify its multicast forwarding port set to be
	the union of the unicast forwarding it has for the listed members, but result
	must be a subset of the forwarding ports for the parent group.
</t>
<t>
	For example, consider an instruction is to modify a transaction multicast
	group I which is a subset of multicast group J to reach addresses A,B and C.
</t>
<t>
	Addresses A and B are attached directly to multicast forwarder X,
	while C is attached to multicast forwarder Y.
</t>
<t>
On forwarder X the forwarding rule for new group I contains:
<list style="symbols">
	<t>The forwarding port for A.</t>
	<t>The forwarding port for B.
		The forwarding port to forwarder Y (a hub link). This eventually leads to C.
	</t>
	</list>
</t>
<t>
	While on forwarder Y the forwarding rule for the new group I will contain:
	<list style="symbol">
		<t>The forwarding port for forwarder X (a hub link).
			This eventually leads to A and B.
		</t>
		<t>The forwarding port for C.</t>
	</list>
</t>
<t>
	This assumes that the Forwarding Control Agent can perform a two-step
	translation: first from IP Address to MAC Address, and then from MAC Address 
	to forwarding port. For typical applications of Transactional Subset Multicasting,
	all of the referenced IP Addresses will have been involved in recent messaging,
	and therefore will frequently already be cached.
</t>
<t>
	Many ethernet switches already support command line and/or SNMP methods of
	setting these multicast forwarding rules, but it is challenging for an
	application to reliably apply the same changes using multiple vendor specific
	methods. Having a standardized method of pushing the membership of a multicast
	group from the sender would be desirable.
</t>
<t>
	A Forwarding Control Agent MAY accept a request where the Target List is
	expressed as a list of destination L2 MAC addresses.
</t>
</section>


<section title='Persistent Transactional Subset Groups'>
<t>
	There is a large group of pre-configured multicast groups which are an enumeration
	of the possible subsets of a master group. This will be a specific subset, such
	as all combinations of 3 members for multicast group X. These groups are enumerated
	and assuaged successive multicast addresses within a block.
</t>
<t>
	The sender first obtains exclusive permission to utilize a portion of the reception
	capacity of each desire target, and then selects the multicast address that will
	reach that group.
</t>
<t>
	In a straightforward enumeration of 3 members out of a group of 20, there are
	20*19*18/3*2 or 1040 possible groups. Typically the higher layer protocol will
	have negotiated the right to send the transaction with the member prior to
	selecting the multicast group. In making the final selection, the actual multicast
	group is selected and some offered targets are declined.
</t>
<t>
	Those 1040 possible groups can be enumerated in order (starting with M1, M2 and M3
	and ending with M18, M19 and M20) and assigned multicast addresses from N to N+1039.
</t>
<t>
	When the transaction requires reaching M4, M5 and M19, you simply select that group.
	Because exclusive rights to use multicasting to M4, M5 and M19 have already been
	obtained through the higher layer protocol the group [M4,M5,M19] is already
	exclusively claimed.
</t>
<t>
	These 1040 groups may be set up through any of the following means:
	<list style="symbols">
		<t>Traditional IGMP/MLD joining/leaving.</t>
		<t>
			Setting static forwarding rules using SNMP MIBs and/or switch-specific
			command line interfaces. Note that the wide-spread existence of command line
			interfaces to custom set multicast forwarding rules is an indicator that
			there are existing applications that find the exising IGMP/MLD protocols
			to be inadequate to fulfill their needs.
		</t>
		<t>
			The Dynamically Pushed Multicast Group method.
			See <xref target='DynamicPush' />
		</t>
	</list>
</t>
</section>
</section>
<section title='Relationship to Existing Multicast Membership Protocols'>
<t>
	TBD: briefly describe and cite IGMP, MLD and PIM.
</t>
<t>
	Transactional Subset Multicast Groups are not a replacement for Join-based
	management of Multicast Groups. Rather it extends the group maintenance
	performed by the Join-based multicast control protocols from the reference
	group to any entire set of multicast addresses that are subsets of it.
</t>
<t>
	This extension requires no modification to the existing data-plane
	multicast forwarding protocols or implementations. Transactional Subset
	Multicast groups may be implemented solely in the sender, receivers and
	the Forwarding Control Agents associated with each multicast forwarder
	supporting the reference group.
</t>
<t>
	The maintenance work of the Join-based multicast protocols performed on the
	reference multicast group is leveraged to allow maintenance of a potentially
	large number of derived Transactional Multicast groups. This allows
	identification of a large number of subsets of the reference group,
	without requiring a matching increase in the maintenance traffic which
	would have been required had the derived groups been formed with a
	Join-based protocol.
</t>
</section>
<section title='Control Protocol'>
<t>
	Note: the pre-standard protocol relies on multicasting of commands within a
	single secure VLAN. More general usage of these techniques will require
	transmitting Forwarding Control Agent instructions between subnets where
	they may be subject to interception and even alteration. Therefore a more
	secure method of delivering Forwarding Control Agent instructions is required.
</t>
<t>
	The methods standardized by the KARP (Key Authentication for Router Protocols)
	are, in the Authors' opinion, fully applicable to this protocol. 
	See <xref target="RFC6518" />. 
	Working Group feedback is sought as to how to expand this section,
	whether to split the Control Protocol to a separate document, or other
	methods of dealing with the control protocol.
</t>
<t>
	The following requirements apply to any Control Protocol used:
	<list style="symbols">
	<t>
		Each request MUST be uniquely identified. This identification MUST include
		the source IP address of the requester.
	</t>
	<t>
		The message MUST be authenticated.
	</t>
	<t>
		WG discussion is needed to reach a consensus as to whether the message
		contents need to be kept confidential, or whether preventing alteration
		is sufficient.
	</t>
	<t>
		The sender MUST NOT be required to transmit the command more than once other
		than as required for retries. For example, requiring SSH connections with each
		Forwarding Control Agent is not acceptable.
	</t>
	<t>
		Barring network errors, the message MUST be delivered to all Forwarding Control
		Agents that can receive the reference master group.
	</t>
	</list>
</t>
</section>
<section title="Forwarding Control Agent Methods">
<section anchor="CreateAddressBlock" 
		 title='Create Transactional Multicast Address Block'>
<t>
	TBD:This section will define the fields required for the command to create a block
	of transactional subset multicast addresses within a specific VLAN. The command
	defined here is delivered within a control protocol.
</t>
<figure anchor='CreateAddressBlockMsg'
        title='Create Transcaction Multicast Address Block Message'>
<artwork>
   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                Opcode=CreateTransactionalMulticast            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Base Multicast Group Number                |
	+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+--+
   |               Number of Addresses required in Block           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+_-+--+-+
</artwork></figure>
<t>
	The Multicast Group Number is the 24-bit L2 Multicast MAC address.
	This matches both the IPV4 and IPV6 addresses which map to it.
	A given UDP datagram is sent using either an IPV4 or an IPV6 address,
	so the membership of a Multicast Group is either IPV4 endpoints or
	IPV6 endpoints at any given instant.
</t>
<t>
	This command does not allow creating numerically scattered group of
	addresses. Doing so would have made the job of each Forwarding Control
	Agent more complex, and would be of no benefit in the recommended
	Isolated VLANs strategy (See <xref target='IsolatedVLANs' />).
</t>
<t>
	note: add IANA language here
</t>
</section>
<section title='Release Transactional Multicast Address Block'>
<figure anchor='ReleaseAddressBlockMsg'
        title='Release Transcactin Multicast Address Block Message'>
<artwork>
   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |              Opcode=ReleaseTransactionalMulticast             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Base Multicast Group Number                |
	+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+--+

</artwork></figure>
<t>
	note: add IANA language here
</t>
</section>
<section title='Set Dynamic Transactional Multicast Group Membership IPV6'>
<figure anchor='PushMembershipMsgIPV6'
        title='Set Dynamic Transactional Multicast Group Membership Message'>
<artwork>
   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Opcode=PushTransactionalMulticastMembershipIPV6       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | # members     |        Multicast Group Number                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   |                    IPV6 Address of 1st Member                 |
   |                                                               |
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	...
</artwork></figure>
<t>
	Members: 8 bit unsigned number of IPV6 addresses that are to be
	the target of this specified Multicast Group Number.
</t>
<t>
	note: add IANA language here
</t>
</section>
<section title='Set Dynamic Transactional Multicast Group Membership IPV4'>
<figure anchor='PushMembershipMsgIPv4'
        title='Set Dynamic Transactional Multicast Group Membership Message'>
<artwork>
   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Opcode=PushTransactionalMulticastMembershipIPV4       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | # members     |        Multicast Group Number                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                     IPV4 Address of 1st member                |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	...
</artwork></figure>
<t>
	Members: 8 bit unsigned number of IPV6 addresses that are to be
	the target of this specified Multicast Group Number.
</t>
<t>
	note: add IANA language here
</t>
</section>
<section title='Set Persistent Transactional Multicast Groups IPv6'>
<figure anchor='PushMembershipSetMsgIPV6'
        title='Set Persistent Transactional Multicast Groups Message IPV6'>
<artwork>
   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |            Opcode=PushPersistentMulticastMembershipIPV6       | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | select N      |      Base Multicast Group Number to be        | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | # members     |        Reference Multicast Group Num          | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  
   |                    IPV6 Address of 1st Member                 |  
   |                                                               |  
   |                                                               |  
   |                                                               |  
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  
	...
</artwork></figure>
<t>
	Members: 8 bit unsigned number of Members that are to be included
	in each Transactional Subset Group set by this command.
</t>
<t>
	Base Multicast Group Number to be set.
</t>
<t>
	# Members in the following list of IPV6 addresses. These must
	all be members of the Reference Multicast Group.
</t>
<t>
	Reference Multicast Group Num: 24 bit L2 Multicast Group Number.
</t>
<t>
	The motivation for supplying the list of IP addresses is to avoid
	race conditions where an IGMP or MLD join is in progress. If there
	were a method to refer to a specific generation of a multicast
	group membership then it would be possible to omit this list.
	Working Group suggestions are encouraged on this topic.
</t> 
<t>
	note: add IANA language here
</t> 
</section>
<section title='Set Persistent Transactional Multicast Groups IPv4'>
<figure anchor='PushMembershipSetMsgIPV4'
        title='Set Persistent Transactional Multicast Groups Message IPv4'>
<artwork>
   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |            Opcode=PushPersistentMulticastMembershipIPV6       | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | select N      |      Base Multicast Group Number to be        | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | # members     |        Reference Multicast Group Num          | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  
   |                    IPV4 Address of 1st Member                 | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
	...
</artwork></figure>
<t>
	Members: 8 bit unsigned number of Members that are to be included
	in each Transactional Subset Group set by this command.
</t>
<t>
	Base Multicast Group Number to be set.
</t>
<t>
	# Members in the following list of IPV6 addresses. These must
	all be members of the Reference Multicast Group.
</t>
<t>
	Reference Multicast Group Num: 24 bit L2 Multicast Group Number.
</t>
<t>
	note: add IANA language here
</t> 
</section>
<section title='Refresh Persistent Transactional Multicast Group'>
<figure anchor='RefreshMembershipSetMsg'
        title='Refresh Persistent Transactional Multicast Groups Message'>
<artwork>
   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |       		      Opcode=RefreshMulticastMembership            | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | reserve       |    Multicast Group Number to be Refreshed     | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | reserved      |        Reference Multicast Group Num          | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  
</artwork></figure>
<t>
	The existing Join-directed multicast group control protocols maintain
	delivery of a multicast group to the subscribers independent of network
	topology changes either at Layer 2 or layer 3. If a unicast IP datagram
	to a member would be delivered, then the multicast forwarding can be
	expected to also be current.
</t>
<t>
	Transactional subset multicast groups do not require the same effort
	for maintenance. For a given transaction the entire set of datagrams is
	either delivered or it is not. There is no benefit to the application
	that the Forwarding Control Agent can achieve by promptly updating the
	L2 multicast forwarding tables after a network topology change. The current
	transaction will miss at least one datagram, and therefore does not care if
	it misses multiple datagrams.
</t>
<t>
	However, a Persistent Transactional Subset Mutlicast Group is used for a
	sequence of transactions targeting the same group. The upper layer protocol
	sender must have obtained exclusive rights to use the group for the period
	of time that it will be sending the transaction.
</t>
<t>
	One method that it MAY use is to obtain the exclusive right to send the
	specific type of transaction to each of the members of the targeted group
	during negotiations conducted prior to use of the transactional group. For
	example, a reservation on inbound bandwidth may have been granted.
</t>
<t>
	The Forwarding Control Agent MAY refresh its mapping from member IP addresses
	to L2 MAC address and then to L2 forwarding port at any time. However it MUST
	do so after receipt of a Refresh Transactional Subset Multicast Group for the
	group.
</t>
<t>
	The sender of a transaction SHOULD send a Refresh Transactional Subset Multicast
	Group message after it fails to receive acknowledgement of an attempted transaction.
</t>
</section>
</section>

<section title='Security Considerations'>
<t>
	The methods described here enable no sender to multicast messages to any destination
	that was not already addressable by it. Therefore no new security vulnerabilities
	are enabled by these techniques.
</t>
<t>
	Because authentication of subset commands is kept lightweight there is an
	implicit trust within the application that transactional subset groups will
	be formed or selected in accordance with application layer expectations.
	The transport layer lacks sufficient information to enforce application layer
	expectations. If a malicious actor deliberately creates a transactional subset
	multicast group with an incorrect group it may adversely impact the operation of
	the specific upper layer application. However in no case can it be used to launch
	a denial of service attack on targets that have not already voluntarily joined the
	reference group
</t>
<t>
	The protocol does not currently provide any mechanism to guard against selecting
	an existing but unrelated multicast group as a reference multicast group.
	Explicitly enabling use of an existing multicast group to be a reference
	group would not solve the problem that the existing management of multicast
	groups is not aware of the need to explicitly forbid creation of derived 
	multicast groups based upon a multicast group that it creates.
</t>
</section>

<section title="IANA Considerations">
<t>
	To be completed.
</t>
</section>

<section title='Summary'>
<t>
	The proposal provides for two new methods to manage multicast group membership,
	Thee are simple techniques, but provide  a cohesive cluster-wide approach to
	providing transactional multicasting. These techniques are better suited for
	transactional multicasting that the existing methods, IGMP and MLD, which are
	oriented to streaming use-cases.
</t>
</section>
	
</middle>
<back>
	<references title='Informative References'>
		<reference anchor="Replicast">
			<front>
				<title>
					White Paper: Nexenta Replicast
					http://info.nexenta.com/rs/nexenta/images/Nexenta_Replicast_White_Paper.pdf
				</title>
				<author initials="C" surname="Bestler" fullname="Caitlin Bestler">
					<organization>Nexenta Systems</organization>
				</author>
				<date year="2013" month="November" />
			</front>
		</reference>
		<reference anchor="MPI">
			<front>
				<title>
					Message Passing Inteface
				</title>
				<author>
					<organization>MPI Forum</organization>
				</author>
				<date year="2012" />
			</front>
		</reference>
		<reference anchor="AmazonS3">
			<front>
				<title>
					Amazon Simple Storage Service (S3)
					http://aws.amazon.com/s3/
				</title>
				<author>
					<organization>Amazon</organization>
				</author>
				<date year="2014" />
			</front>
		</reference>
		<reference anchor="Swift">
			<front>
				<title>
					OpenStack Object Service (Swift)
					http://docs.openstack.org/developer/swift/
				</title>
				<author>
					<organization>Openstack</organization>
				</author>
				<date year="2014" />
			</front>
		</reference>
		<reference anchor="IEEE.802.1Qau-2011">
			<front>
				<title>
					IEEE Standard for Local and Metropolitan Area Networks:
					Virtual Bridged Local Area Networks
					- Amendment 10: Congestion Notification
				</title>
				<author>
					<organization>IEEE</organization>
				</author>
				<date year="2011" />
			</front>
			<seriesInfo name="IEEE Std" value="802.1Qau" />
		</reference>
		<reference anchor="IEEE.802.1Qaz-2011">
			<front>
				<title>
					IEEE Standard for Local and Metropolitan Area Networks:
					Virtual Bridged Local Area Networks
					- Amendment 18: Enhanced Transmission Selection.
				</title>
				<author>
					<organization>IEEE</organization>
				</author>
				<date year="2011" />
			</front>
			<seriesInfo name="IEEE Std" value="802.1Qaz" />
		</reference>
		<reference anchor="IEEE.802.1Qbb-2011">
			<front>
				<title>
					IEEE Standard for Local and Metropolitan Area Networks:
					Virtual Bridged Local Area Networks
					- Amendment 17: Priority-based Flow Control.
				</title>
				<author>
					<organization>IEEE</organization>
				</author>
				<date year="2011" />
			</front>
			<seriesInfo name="IEEE Std" value="802.1Qbb" />
		</reference>
		<?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.5661.xml"?>
	</references>
	<references title='Normative References'>
	<?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"?>
	<?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.3376.xml"?>
	<?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.3810.xml"?>
	<?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.4541.xml"?>
	<?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.4604.xml"?>
	<?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.6325.xml"?>	
	<?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.6518.xml"?>
	</references>	
</back>
</rfc>	