[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bmwg] WGLC: draft-ietf-bmwg-protection term-06 and meth-05



Jay,

see below.

> -----Original Message-----
> From: Jay Karthik (jakarthi) [mailto:jakarthi at cisco.com] 
> Sent: 04 August 2009 19:33
> To: Kris Michielsen (kmichiel); Al Morton; bmwg at ietf.org
> Subject: RE: [bmwg] WGLC: draft-ietf-bmwg-protection term-06 
> and meth-05
> 
> Hi Kris,
> 
> Thanks for your detailed review. I have responded to your 
> comments on the methodology draft here. We will get back with 
> our response to your comments on the terminology draft.
> 
> Please see inline.
> 
> Cheers,
> Jay
> 
> -----Original Message-----
> From: bmwg-bounces at ietf.org [mailto:bmwg-bounces at ietf.org] On 
> Behalf Of Kris Michielsen (kmichiel)
> Sent: Monday, August 03, 2009 9:56 AM
> To: 'Al Morton'; bmwg at ietf.org
> Subject: Re: [bmwg] WGLC: draft-ietf-bmwg-protection term-06 
> and meth-05
> 
> <Snip>
> 
> comments on meth-05:
> 
> * 5.1 I don't think shutdown is a good method to simulate a 
> failure since the actions taken by a shutdown are very much 
> implementation dependent. Shutdown should only be considered 
> as an administrative action.
> 
> Jay: Agreed that the shutdown is very much implementation 
> dependent and hence the reason for including this as failover 
> trigger event. As you can see this is 1 of the failure events 
> and we have listed a few other events as well.
> 

If the shutdown is implementation dependent, how can we compare benchmarks of different vendors' implementations?
An example of an "Interface shutdown on remote side with POS Alarm": One implementation "A" immediately stops receiving traffic
after an interface shutdown and sends AIS. Another implementation "B" just sends AIS after interface shutdown but keeps forwarding
traffic coming in from the shutdown interface. Implementation "B" is also very slow in triggering FRR after receiving AIS (e.g.
~200ms), while implementation "A" reacts swiftly to AIS (e.g. ~10ms). The benchmark test for implementation "A" will see some packet
loss (e.g. 10ms), while implementation "B" will see zero packet loss, but for a _real_ failure implementation "B" may see more
packet loss (e.g. ~200ms).

To test the reaction to different kinds of layer1 failure indications some additional equipment (same as used in real networks) is
required to inject these failure indications in the DUTs. This can go with or without traffic loss, just as in reality. Also higher
layer failures may cause immediate traffic loss or not. To be able to compare benchmarks the characteristics of a Failover event
need to be the same or at least documented/reported.

An interface shutdown is not to be considered as a real "failure", but merely as an administrative action. Measuring packet loss
following such an administrative action is useful, but a user/tester is free to decide how to take an interface administratively out
of service (which can be more complicated than just doing a "shut" on an interface).

> * 5.2 Can't the remote fault indication in the ethernet 
> auto-negotiation scheme be considered as a type of link 
> failure indicator for directly connected devices? In that 
> configuration it doesn't need to rely on
> layer3 failure detection.
> 
> Jay: Sure - we shall modify the text with your suggestion 
> explicitly for directly connected device.
> 
> * 5.6 Shouldn't reoptimization (signal a new LSP while the 
> failure still
> persists) be added to this section?
> 
> Jay - The authors discussed this and based on the comments we 
> received earlier, we have mentioned the make-before-break 
> scenario towards the end of section 8. Would you prefer that 
> we include this in section 5.6 as well ?

It needs to be explained that a reoptimization can take place while the failure is still not recovered, but it doesn't need to take
place, also depending on the DUT configuration. Since it doesn't fit the title/subject of 5.6 it may be better to put it in a new
section before 5.6.
In the Reporting format it needs to be indicated if this reoptimization takes place or not.

> 
> * 5.7 If the failover is prefix/LSP dependent, chosing only 3 
> routes to measure is not a good idea. In that case as much 
> prefixes as possible (limited by required accuracy and 
> Throughput) are needed. Maybe a scheme to randomly select as 
> much prefixes as possible from the total number could be 
> considered. One can also first test if the failover is 
> prefix/LSP independent and make use of that characteristic to 
> reduce the number of traffic destinations that need to be 
> measured. If there is a dependency on the number of prefixes, 
> the number of LSPs, ... then how is this reflected in the 
> measurements? Is only the fastest failover time important, or 
> the slowest, or the average?
> 
> Jay - Kris, we have not mandated 3 routes/streams. You 
> probably picked the 3 routes from the example we had 
> furnished. We also mention at the beginning of this section 
> that more streams could be used as long as the flow is steady 
> across all the traffic streams/routes.

Guidelines are needed on how much and which streams need to be used when benchmarking. This is required to avoid misinterpretations
of the benchmark metrics. If one picks only a handfull stream destinations out of a total of e.g. 5000, then in a prefix or tunnel
dependent implementation one cannot know what is being measured (unless maybe when using white-box information). Are those traffic
streams going to the prefixes and/or tunnels that failover first or last or somewhere inbetween?

> 
> * 5.8 When tunnels are established from the Tester, it should 
> be capable to resignal/reoptimize LSPs following the receipt 
> of a path error from the PLR, just as real devices do
> 
> Jay - Sure but we did not see much value in including 
> benchmarking scenarios that would validate the Tester as 
> opposed to the DUT. Btw, most if not all of the testers we 
> know do not support reoptimization as of now.

Then why is there a need for Tester capability 1. in section 5.8?
While we're at it, are Tester capabilities 2. and 5. really needed? If so, when?

> 
> * 6 It is confusing to mark R1 as HE and R2 as MID while 
> their actual roles may vary depending on the testcase
> 
> Jay - Not sure where the confusion arises from. We believe 
> all the different scenarios could be tested by positioning R1 
> and HE and R2 as MID nodes

I meant the indications in the topology figures, such as Figure 2. below (indicated with =><=).

#             -------    -------- PRI  --------  
#            |  R1   |  |   R2   |    |   R3   | 
#         TG-|=>HE<= |--|=>MID<= |----|    TE  |-TA 
#            |       |  |  PLR   |----|        | 
#             -------    -------- BKP  -------- 
#      
#                          Figure 2. 

And e.g. the testcase section 7.2 where R2 is Headend. This does not match with "HE" and "MID" in Figure2, and leads to confusion.

# 7.2. Headend PLR with Link Failure  
...
#      1. Establish the primary LSP on R2 required by the topology 
#         selected. 
#      2. Establish the backup LSP on R2 required by the selected 
#         topology. 

> 
> * 8 "Packet Size" is it layer2, layer3 ? "Forwarding rate" 
> should be "Offered Load"?
> 
> Jay - Packet Size is w.r.t layer3. Forwarding Rate is used in 
> "IGP dataplane convergence term" documents and we are 
> consistent with that.

Then it needs to be indicated in the draft it's layer3 packet size.
Forwarding Rate is not the correct term to use here. It should be Offered Load. Forwarding Rate varies during the test (e.g. drops
to zero when pulling the cable), Offered Load is constant/predefined.

> 
> * 8 BGP routes, is this only applicable for VPN PEs?
> 
> Jay - We meant BGP routes from global table, just like we 
> mentioned IGP routes. Please note, we have explicitly listed 
> VPN routes as well 

In the IGP dataplane convergence draft, all other protocols are banned from the DUT. If you permit to have BGP on the DUT, many more
things need to be specified. How many BGP nexthops? How many BGP prefixes are routed on tunnel(s)? ...

> 
> * 8 What is a "FRR tunnel", the primary or the backup? I 
> think it's important to indicate the scale of what is 
> protected, besides the global network scale. The Benchmarks 
> need to indicate how they are measured, which method.
> 
> Jay - "FRR tunnel" as in "Protected tunnel". We have left the 
> decision making on which method of the PBLM, TBLM and TBM to 
> the individual measuring the benchmarks.

Then it's better to indicate it as "Protected tunnel" in the draft.

It needs to be reported which method is being used to derive the benchmark metrics.

> 
> * 8, p 24, bullet 1. s/Rate/Loss/
> 
> Jay - Not sure what you are referring to. I looked for this 
> pattern and don't see any.

This is the bullet I refer to.

#      1. Packet-Based Loss method (PBLM): (Number of packets 
#        dropped/packets per second * 1000) milliseconds. This method 
#        could also be referred as Rate Derived method. 
                                   ^^^^^^^^^^^^--> Loss-Derived

I just noticed that the PBLM is called Packet-Loss Based Method (PLBM) in term-06

> 
> * 8 The note at the end of the section should be more 
> elaborated on (and may have implications throughout the document).
> 
> Jay - We can elaborate on that.

Thanks,
Kris

> 
> <Snip>
> 
> Thanks,
> Jay
>