This is a good point, and highlights
a very tricky issue in any distributed system design. A protocol should
behave gracefully in the face of failure. One of the definite challenges
here is partition vs. failure. Its impossible to distinguish between a
failed service and one which is lost behind a network partition. In the
case of a partition, with an avatar continuing to run on momentarily isolated
region, care needs to be taken to recover gracefully not only from the
"can't get to the region service" to get it to release the agent,
but also the "The region has returned to contact" So,I
think when we talk about being resilient we need to think about the
range of failure, from "the service I was using has crashed"
to "the service I was using has become invisible due
to a network partition." Of course,
they look identical. This rather strongly implies good semantics on what
happens to a "Stub" session which becomes isolated as well
as the "new" session
which we create by bypassing a failed
service. (In the case of teleport, the old region's idea of agent state
and the new ones, in other cases, possibly other isolated state)
Food for thought
- David
~ Zha
Morgaine <morgaine.dinova at googlemail.com> Sent by: ogpx-bounces at ietf.org
10/23/2009 12:47 PM
To
ogpx at ietf.org
cc
Subject
Re: [ogpx] Teleports and protocol resilience
Looking back at the replies in this thread, I think that
the goal and the means to achieve it didn't quite come across.
I was trying to address only a very specific issue, just protocol resilience
under source region non-responsiveness, since this is common enough that
it merits addressing. I did not suggest that there be any perceivable
change of teleport semantics under normal operation (because no such change
is needed), only a change in service coupling. The semantics we experience
in SL and in Opensim would remain completely unchanged, except in the single
case of source region non-responsiveness. Under this single anomalous
case there would be a perceivable change, but that change
would be a huge improvement.
There would be no new decoherence introduced since exactly the same state
changes would occur on TP as before, with no possibiity of agent state
change in the source region once the AD accepts the TP.
All that's needed to achieve such resilience for teleport at the protocol
level is a slight revision of operation phasing to permit greater execution
overlap, as I outlined. This is independent of anything else that
happens in the course of the overall teleport operation --- the change
of phasing would affect only the transfer of agent location alone,
nothing else.
In particular, it should not be confused with the separate requirement
for instantiation of assets or objects at destination, nor with the matter
of serializing and deserializing script states. The latter has not
even been defined for VWRAP, so it's hard to talk about changing it.
In any event, this isn't about those aspects of teleport, and doesn't affect
them --- they would continue to work as before.
One of the central aspects of VWRAP is that the protocol is based on a
multiple services model, and one of the key approaches in highly scalable
systems design is to keep services decoupled to the largest extent possible.
That's what I'm proposing here, a partial decoupling that has no
normal semantic change but which does have benefits in anomalous situations.
Agent location change can be decoupled significantly from asset
instantiation change and script state transfer. My suggestion referred
to this decoupled agent location change only, not to asset and simulation
services. Those other two services undergo state transitions at the
same time as change of agent location does on TP, but services should never
be coupled together unnecessarily, and in this case the coupling can be
left very weak. The three types of service operations can proceed
each at their own independent rates, coupled at TP initiation time and
nowhere else.
It should be noted that the legacy protocols do some of this already, in
that the agent is already active in the destination region long before
her avatar or objects have appeared. Furthermore, the avatar currently
continues to be visible in the source region for a while after the agent
becomes active in the destination region, because of normal operation latencies,
sim-side queueing, and client lag. This is a normal part of current
operation, and is not considered an anomaly. What's important is that no
new state change to the agent is possible in the source region after TP
is initiated, and that would remain true.
The impact of this on the other parts of the puzzle needs to wait until
those other parts are examined. We're not there yet, but I would
hope that improving teleport protocol resilience would be a desireable
goal when the only noticeable change in semantics occurs under fault conditions
and provides a major improvement on current behaviour.
Morgaine.
======================================
On Tue, Oct 13, 2009 at 6:13 AM, Morgaine <morgaine.dinova at googlemail.com>
wrote:
One of the advantages we have in developing the VWRAP
protocols is that we are able to look back at legacy SL and Opensim protocols
and recognize design mistakes or limitations in them. This allows
us to avoid repeating such mistakes or limitations in the next generation
of systems.
One of the most common sources of frustration and dissatisfaction is simulator
non-responsiveness. While this has many possible causes, in VWRAP
we are not interested in the internal implementation of simulators, but
we ARE interested in the ability of a protocol endpoint to perform
its duty within the protocol. A jammed simulator host is in many
cases quite unable to perform its protocol duties, or in some cases only
exceedingly slowly, often timing out in a TP for example. We have
a huge amount of experience of this happening in both SL and Opensim, so
it is a practical reality. On occasion, simulators will be unable
to fulfil their part in a protocol, and this needs to be taken into account
because it is not uncommon.
One key area in which the above is relevant is in teleports OUT
of a simulator that is under distress. Quite often users wish nothing
more than to leave the region being run by a dying simulator, but
when teleport-out requires cooperation from the host that one is trying
to leave then this is often not possible at all. In this situation,
the only remedy in existing systems is to forcibly terminate the client
and relog in another region. We should avoid such out-of-protocol
remedies being necessary through good protocol design.
In VWRAP, we have both Rez Avatar and Derez Avatar capabilities, which
lead to corresponding protocol operations during teleport. If R1
is a region being run by a non-responsive simulator from which we want
to escape, and R2 is another region to which we wish to go, if the protocol
requires a Derez in R1 to be completed before a Rez in R2 can commence
then the user will have difficulties. Clearly we don't want this.
In http://tools.ietf.org/html/draft-hamrick-ogp-intro-00
, it is made clear that "The agent domain MUST also remove the
avatar from it's current location before placing the avatar
in the destination location." This suggests that the protocol
will be sensitive to R1 non-responsiveness. While we do not yet have
an actual VWRAP Teleport draft, it seems likely that its initial incarnation
will have that same problem built in.
I suggest that the protocol define Derez and Rez as concurrent
and non-dependent operations to avoid this situation.
The AD can mark R1 as disabled for all further agent state changes ---
this will provide all the protection needed to prevent brief double-presence
anomalies from being significant. If a jammed R1 refuses to give
up its hold on the avatar, then at least the user will not suffer from
it. Reaping dead simulator sessions then becomes a problem for the
region operator alone, and not for the AD, the user, and the region as
happens now.