[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PMOL] Several questions/suggestions from my review of draft-ietf-pmol-sip-perf-metrics-04



Hi All -

I have several concerns about draft-ietf-pmol-sip-perf-metrics that I would like to discuss. I've asked for a dedicated RAI-review of this document, so there may be additional comments later, but I wanted to get these to you now so we can start working through them.

These comments are more-or-less in document order, with a couple of nits moved to the end. I've numbered them to help split responses into threads later. When replying to just one of the items below, please change the subject line to indicate what you're replying to.

Thanks,

RjS

--------------------------------------------------------------------------------------------------

1 The document should more carefully describe its scope (and consider
  changing its title). This document focuses on the use of SIP for
  simple telephony and relies on measurements in earlier telephony
  networks for guidance.  But telephony is only one use of SIP. These
  aren't the same metrics that would be most useful for observing a
  network that was involved primarily in setting up MSRP sessions for
  file transfer, for instance. A eventual set of generic SIP
  performance metrics will need to focus on the primitives rather than
  artifacts from any particular application.

2 That said, I'm skeptical of the utility of many of these metrics even
  for monitoring systems that are focusing only on delivering basic
  telephony. Has the group surveyed operators to see what they're
  measuring, what they're finding useful, and what they're just
  throwing away? Some additional text motivating why this particular
  set of metrics were chosen should be provided to help
operators/implementers choose which ones they are going to try to use.

3 "Each session is identified by a unique Call-ID" is incorrect. You
  need at least Call-ID, to-tag, and from-tag here. And to be pedantic,
  you're describing the SIP dialog, not one of the sessions it manages.
  The session is what is described by the Session Description Protocol.
  The metrics in this draft are derived from signaling events, not
  session events, and is making assumptions about how those correlate
  for a simple voice call that may not be true for more advanced uses.

4 The document is inconsistent about whether the metrics will describe
  any part of an early-dialog/early session. The introduction indicates
  it won't and focuses on the delivery of a 200 OK, but there are
  metrics that measure the arrival time of 180s. This should be
  reconciled. Do take note that early sessions are pervasive in real
  deployments at this point in time.

5 These metrics are intentionally designed to not measure (or be
  perturbed by) the hop-hop retransmission mechanisms. This should be
  made explicit. There should also be some discussion of the effect of
  the end-to-end retransmission of 200OK/ACK on the metrics based on
  those messages.

6 The document should consider the effects of the presence or absence
  of the reliable-provisional extension on its metrics (some of the
  metrics will be perturbed by a lost 18x that isn't sent reliably).

7 Using T1 and T4 as the timing interval measurement tokens is
  unfortunate. SIP uses those symbols already to mean something
  completely different. Is there a reason not to change these and avoid
  the confusion that the collision will cause?

8 The document uses the terms UAC and UAS incorrectly. It is trying to
  use them to mean the initiator and recipient of a simple phone call.
  But the terms are roles scoped to a particular transaction, not to a
  dialog. When an endpoint sends a BYE request, it is by definition
  acting as a UAC.

9 The document uses the word "dialog" in a way that's not the same as
  the formal term with the same name defined in RFC3261 and that will
  lead to confusion. (A sequence of register requests and responses,
  for example, are never part of any dialog. The INVITE/302/ACK
  messages shown in the call setup flows are not part of any dialog.)
  Please choose another word or phrase for this draft. I suggest
  "message exchange".

10 The 3rd to last paragraph of section 4 should be expanded. I think
  it's unlikely that implementers, especially those with other language
  backgrounds,  will understand the subtlety of the quotes around
  "final".  Enumerating the cases where you want the measurement to
  span from the request of one transaction to the final response of
  some other transaction will help. (I'm guessing you were primarily
  considering redirection, but I suspect you also wanted to capture the
  additional delay due to Requires-based negotiation or 488
  not-acceptable-here style re-attempts?). You may also want to
  consider the effect of the negotiation phase of extensions like
  session-timer on these metrics.

11 The document assumes that a registration will be DIGEST challenged.
  That's a common deployment model, but it is not required. If other
  authentication mechanics are used (such as SIP Identity), the RRD
  metric, for example, becomes muddied.

12 In section 4.2, "Subsequent REGISTER retries are identified by the
  same Call-ID" should say "identified by the same transaction
  identifier (same topmost Via header field branch parameter value".
  Completely different REGISTER transactions from a given registrant
  are likely to have the same Call-ID.

13 The SRD metric definition in 4.3.1 ignores the effect of forking.
  Unlike 200 OKs, where receiving multiple 200s in response to a single
  INVITE only happens if a race is won, it is the _normal_ state of
  affairs for a UAC to receive provisional responses from multiple
  branches when a request forks. Deployed systems are increasingly
  sending 18x responses reliably with an answer, establishing early
  sessions, so when forking is present it is _highly_ likely that there
  will be multiple 18x's from different branches arriving at the UA.
  This section should provide guidance on what to report when this
  happens.

14 The Failed Session Setup SRD claims to be useful in detecting
  problems in downstream signaling functions. Please provide some text
  or a reference supporting that claim. As written, this metric could
  be dominated by how long the called user lets his phone ring. Is that
  what was intended? You might consider separate treatment for 408s and
  for explicit decline response codes.

15 What was the motivation for making MESSAGE special in section 4.3.3.
  Why didn't the group instead extend the concept to measuring _any_
  non-INVITE transaction (with the possible exception of CANCEL)?

16 In section 4.4, what does it mean to measure the delay in the
  disconnect of a failed session completion? Without a successful
  session completion, there can be no BYE. This section also begs the
  very hard to answer question about what to do when BYEs receive
  failure responses. It would be better to note that edge-case exists
  and what, if anything, the metric is going to say about it if it
  happens.

17 Section 4.5 is a particularly strong example of these metrics
  focusing on the simple telephony application. It may even be falling
  into the same traps that lead to trying to build fraud-resistant
  billing based on the time difference between an INVITE and a BYE.
  Some additional discussion noting that the metric doesn't capture
  early media and recommendation on when to give up on seeing a BYE
  would be useful. (Sometimes BYEs don't happen even when there is no
  malicious intent.)

18 Trying to use Max-Forwards to determine how many hops a request took
  is going to produce incorrect results in any but the most simple of
  network deployments (I would have expected this to be based on
  counting Vias with a note pointing to the discussion on the problems
  B2BUAs introduce). Proxies  can reduce Max-Forwards by more than one.
  There are many implementations in the wild that cap Max-Forwards. If
  this metric remains as defined, you should also point out that
  neither endpoint can calculate it. Some third entity will have to
  collect information from each end to make this calculation.

19 The ratio metrics don't define (or convey) the interval that totals
  are taken over. Are these supposed to be "# requests received since
  this instance was manufactured' or "since last reboot" or "since last
  reset of statistics" or something else? What is the implementation
  supposed to report when the denominator of a ratio is 0?

20 Please add some discussion motivating why all 300s, 401, 402, and 407
  are treated specially (vrs several other candidate 4xx and 6xx
  responses) in sections like section 4.8. Were other codes considered?
  If so, why were they rejected?

21 Section 4.9 seems to be implying that you can't receive a 500 class
  response to a reINVITE which is not true. If you want this metric to
  only reflect the results of initial INVITEs, more definition will be
  needed.

22 ISA in section 4.10 claims that 408s indicate an overloaded state in
  a downstream element. Overload may induce 408s, but 408s do _not_
  indicate overload. Its possible to receive them just because someone
  is not answering a phone.

23 In section 5, why where these correlation dimensions chosen. Was the
  Request-URI considered? If so, why was it rejected?

24 The treatment of forking in section 6.3 is insufficient. As noted
  earlier, provisional messages establishing early sessions is becoming
  common, and there will be multiple early sessions for a given INVITE
  when there is forking. The recommendation to latch onto the "first"
  200 (or 18x) and ignore the others only marginally works for playing
  media for simple telephony applications - we're seeing phones that
  mix or present multiple lines, and applications that go beyond basic
  phone calls (like file transfer) that make use of all the responses.
  Trying to dodge the complexity as the current section does will lead
  to metrics that don't reflect what the network is doing.

25 I'm a little surprised there is no discussion on privacy,
  particularly on profiling the usage patterns of individuals or
  organizations, in the security considerations section.

26 Nits:
    26.1 What does it mean in section 4.3.1 for the "user" to send the
      first bit of a message? Suggest deleting "or user" from the
      sentence.
    26.2 Section 4.11 has a stale internal pointer to a non-existant
      section 3.5 I suspect it's trying to point back into 4 somewhere.