[PMOL] Several questions/suggestions from my review of draft-ietf-pmol-sip-perf-metrics-04
Robert Sparks <rjsparks@nostrum.com> Tue, 29 September 2009 18:30 UTC
Return-Path: <rjsparks@nostrum.com>
X-Original-To: pmol@core3.amsl.com
Delivered-To: pmol@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 82F9128C14B for <pmol@core3.amsl.com>; Tue, 29 Sep 2009 11:30:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.505
X-Spam-Level:
X-Spam-Status: No, score=-2.505 tagged_above=-999 required=5 tests=[AWL=0.095, BAYES_00=-2.599, SPF_PASS=-0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZS8V9eTAERzD for <pmol@core3.amsl.com>; Tue, 29 Sep 2009 11:30:37 -0700 (PDT)
Received: from nostrum.com (nostrum-pt.tunnel.tserv2.fmt.ipv6.he.net [IPv6:2001:470:1f03:267::2]) by core3.amsl.com (Postfix) with ESMTP id 589783A6904 for <pmol@ietf.org>; Tue, 29 Sep 2009 11:30:35 -0700 (PDT)
Received: from [192.168.2.2] (pool-173-71-53-15.dllstx.fios.verizon.net [173.71.53.15]) (authenticated bits=0) by nostrum.com (8.14.3/8.14.3) with ESMTP id n8TIVpGQ002568 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO); Tue, 29 Sep 2009 13:31:51 -0500 (CDT) (envelope-from rjsparks@nostrum.com)
Message-Id: <72A0E3C5-322B-4FEE-B565-9659581BBFC4@nostrum.com>
From: Robert Sparks <rjsparks@nostrum.com>
To: pmol@ietf.org
Content-Type: text/plain; charset="US-ASCII"; format="flowed"; delsp="yes"
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v936)
Date: Tue, 29 Sep 2009 13:31:50 -0500
X-Mailer: Apple Mail (2.936)
Received-SPF: pass (nostrum.com: 173.71.53.15 is authenticated by a trusted mechanism)
X-Mailman-Approved-At: Wed, 30 Sep 2009 01:17:48 -0700
Cc: pmol-chairs@tools.ietf.org
Subject: [PMOL] Several questions/suggestions from my review of draft-ietf-pmol-sip-perf-metrics-04
X-BeenThere: pmol@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Performance Metrics at Other Layers <pmol.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/pmol>, <mailto:pmol-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/pmol>
List-Post: <mailto:pmol@ietf.org>
List-Help: <mailto:pmol-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/pmol>, <mailto:pmol-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Sep 2009 18:30:38 -0000
Hi All - I have several concerns about draft-ietf-pmol-sip-perf-metrics that I would like to discuss. I've asked for a dedicated RAI-review of this document, so there may be additional comments later, but I wanted to get these to you now so we can start working through them. These comments are more-or-less in document order, with a couple of nits moved to the end. I've numbered them to help split responses into threads later. When replying to just one of the items below, please change the subject line to indicate what you're replying to. Thanks, RjS -------------------------------------------------------------------------------------------------- 1 The document should more carefully describe its scope (and consider changing its title). This document focuses on the use of SIP for simple telephony and relies on measurements in earlier telephony networks for guidance. But telephony is only one use of SIP. These aren't the same metrics that would be most useful for observing a network that was involved primarily in setting up MSRP sessions for file transfer, for instance. A eventual set of generic SIP performance metrics will need to focus on the primitives rather than artifacts from any particular application. 2 That said, I'm skeptical of the utility of many of these metrics even for monitoring systems that are focusing only on delivering basic telephony. Has the group surveyed operators to see what they're measuring, what they're finding useful, and what they're just throwing away? Some additional text motivating why this particular set of metrics were chosen should be provided to help operators/implementers choose which ones they are going to try to use. 3 "Each session is identified by a unique Call-ID" is incorrect. You need at least Call-ID, to-tag, and from-tag here. And to be pedantic, you're describing the SIP dialog, not one of the sessions it manages. The session is what is described by the Session Description Protocol. The metrics in this draft are derived from signaling events, not session events, and is making assumptions about how those correlate for a simple voice call that may not be true for more advanced uses. 4 The document is inconsistent about whether the metrics will describe any part of an early-dialog/early session. The introduction indicates it won't and focuses on the delivery of a 200 OK, but there are metrics that measure the arrival time of 180s. This should be reconciled. Do take note that early sessions are pervasive in real deployments at this point in time. 5 These metrics are intentionally designed to not measure (or be perturbed by) the hop-hop retransmission mechanisms. This should be made explicit. There should also be some discussion of the effect of the end-to-end retransmission of 200OK/ACK on the metrics based on those messages. 6 The document should consider the effects of the presence or absence of the reliable-provisional extension on its metrics (some of the metrics will be perturbed by a lost 18x that isn't sent reliably). 7 Using T1 and T4 as the timing interval measurement tokens is unfortunate. SIP uses those symbols already to mean something completely different. Is there a reason not to change these and avoid the confusion that the collision will cause? 8 The document uses the terms UAC and UAS incorrectly. It is trying to use them to mean the initiator and recipient of a simple phone call. But the terms are roles scoped to a particular transaction, not to a dialog. When an endpoint sends a BYE request, it is by definition acting as a UAC. 9 The document uses the word "dialog" in a way that's not the same as the formal term with the same name defined in RFC3261 and that will lead to confusion. (A sequence of register requests and responses, for example, are never part of any dialog. The INVITE/302/ACK messages shown in the call setup flows are not part of any dialog.) Please choose another word or phrase for this draft. I suggest "message exchange". 10 The 3rd to last paragraph of section 4 should be expanded. I think it's unlikely that implementers, especially those with other language backgrounds, will understand the subtlety of the quotes around "final". Enumerating the cases where you want the measurement to span from the request of one transaction to the final response of some other transaction will help. (I'm guessing you were primarily considering redirection, but I suspect you also wanted to capture the additional delay due to Requires-based negotiation or 488 not-acceptable-here style re-attempts?). You may also want to consider the effect of the negotiation phase of extensions like session-timer on these metrics. 11 The document assumes that a registration will be DIGEST challenged. That's a common deployment model, but it is not required. If other authentication mechanics are used (such as SIP Identity), the RRD metric, for example, becomes muddied. 12 In section 4.2, "Subsequent REGISTER retries are identified by the same Call-ID" should say "identified by the same transaction identifier (same topmost Via header field branch parameter value". Completely different REGISTER transactions from a given registrant are likely to have the same Call-ID. 13 The SRD metric definition in 4.3.1 ignores the effect of forking. Unlike 200 OKs, where receiving multiple 200s in response to a single INVITE only happens if a race is won, it is the _normal_ state of affairs for a UAC to receive provisional responses from multiple branches when a request forks. Deployed systems are increasingly sending 18x responses reliably with an answer, establishing early sessions, so when forking is present it is _highly_ likely that there will be multiple 18x's from different branches arriving at the UA. This section should provide guidance on what to report when this happens. 14 The Failed Session Setup SRD claims to be useful in detecting problems in downstream signaling functions. Please provide some text or a reference supporting that claim. As written, this metric could be dominated by how long the called user lets his phone ring. Is that what was intended? You might consider separate treatment for 408s and for explicit decline response codes. 15 What was the motivation for making MESSAGE special in section 4.3.3. Why didn't the group instead extend the concept to measuring _any_ non-INVITE transaction (with the possible exception of CANCEL)? 16 In section 4.4, what does it mean to measure the delay in the disconnect of a failed session completion? Without a successful session completion, there can be no BYE. This section also begs the very hard to answer question about what to do when BYEs receive failure responses. It would be better to note that edge-case exists and what, if anything, the metric is going to say about it if it happens. 17 Section 4.5 is a particularly strong example of these metrics focusing on the simple telephony application. It may even be falling into the same traps that lead to trying to build fraud-resistant billing based on the time difference between an INVITE and a BYE. Some additional discussion noting that the metric doesn't capture early media and recommendation on when to give up on seeing a BYE would be useful. (Sometimes BYEs don't happen even when there is no malicious intent.) 18 Trying to use Max-Forwards to determine how many hops a request took is going to produce incorrect results in any but the most simple of network deployments (I would have expected this to be based on counting Vias with a note pointing to the discussion on the problems B2BUAs introduce). Proxies can reduce Max-Forwards by more than one. There are many implementations in the wild that cap Max-Forwards. If this metric remains as defined, you should also point out that neither endpoint can calculate it. Some third entity will have to collect information from each end to make this calculation. 19 The ratio metrics don't define (or convey) the interval that totals are taken over. Are these supposed to be "# requests received since this instance was manufactured' or "since last reboot" or "since last reset of statistics" or something else? What is the implementation supposed to report when the denominator of a ratio is 0? 20 Please add some discussion motivating why all 300s, 401, 402, and 407 are treated specially (vrs several other candidate 4xx and 6xx responses) in sections like section 4.8. Were other codes considered? If so, why were they rejected? 21 Section 4.9 seems to be implying that you can't receive a 500 class response to a reINVITE which is not true. If you want this metric to only reflect the results of initial INVITEs, more definition will be needed. 22 ISA in section 4.10 claims that 408s indicate an overloaded state in a downstream element. Overload may induce 408s, but 408s do _not_ indicate overload. Its possible to receive them just because someone is not answering a phone. 23 In section 5, why where these correlation dimensions chosen. Was the Request-URI considered? If so, why was it rejected? 24 The treatment of forking in section 6.3 is insufficient. As noted earlier, provisional messages establishing early sessions is becoming common, and there will be multiple early sessions for a given INVITE when there is forking. The recommendation to latch onto the "first" 200 (or 18x) and ignore the others only marginally works for playing media for simple telephony applications - we're seeing phones that mix or present multiple lines, and applications that go beyond basic phone calls (like file transfer) that make use of all the responses. Trying to dodge the complexity as the current section does will lead to metrics that don't reflect what the network is doing. 25 I'm a little surprised there is no discussion on privacy, particularly on profiling the usage patterns of individuals or organizations, in the security considerations section. 26 Nits: 26.1 What does it mean in section 4.3.1 for the "user" to send the first bit of a message? Suggest deleting "or user" from the sentence. 26.2 Section 4.11 has a stale internal pointer to a non-existant section 3.5 I suspect it's trying to point back into 4 somewhere.
- [PMOL] Several questions/suggestions from my revi… Robert Sparks
- Re: [PMOL] Several questions/suggestions from my … Al Morton
- Re: [PMOL] Several questions/suggestions from my … Robert Sparks