[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ippm] Testing TCP Throughput Capacity in Operator Networks



Hi Matt,

I understand your concerns with relation to TCP and you undoubtedly have
vast experience in detailed TCP implementations and studies.  I have
read several of your papers and they are all excellent.

Let me try to clearly lay out the problem statement and then determine
if it is worthy of a -00 draft.  I tried to keep this email concise,
sorry if it is a little long-winded.

In my work with network providers and network equipment manufacturers
(NEMs), there is a growing realization that RFC-2544 packet level
testing on today's networks is not adequate.  This community desires to
measure network throughput performance at the TCP layer, to gain a much
more meaningful measure that the network can meet the end user's
application SLA (and ultimately reach some level of TCP testing
interoperability which does not exist today).  The complexity of the
network grows and the various queuing mechanisms in the network greatly
affect TCP layer performance (i.e. improper default router settings for
queuing, etc.) and devices such as firewalls, proxies, load-balancers
can actively alter the TCP settings as a TCP session traverses the
network (such as window size, MSS, etc.).  These are all very complex
topics to the general network community, and I can't overemphasize the
desire for a standard test methodology/guideline at the TCP layer.

So the intent behind this draft TCP Throughput work would be to define a
methodology for testing TCP layer performance, and guidelines for
expected TCP throughput results that *should* be experienced in the
network under test.  Network providers and NEMs are wrestling with
end-end complexities of the above (queuing, active proxy devices, etc.);
they desire to standardize the methodology to validate end-end TCP
performance, as this is the precursor to acceptable end-user application
performance.

Before RFC-2544 testing existed, network providers and NEMs deployed a
variety of ad hoc test techniques to verify the Layer2/3 performance of
the network.  RFC-2544 was a huge step forward in the network test
world, standardizing the Layer 2/3 test methodology which greatly
improved the quality of the network (and reduced operational test
expenses).

In the case of TCP, several network providers that I work with employ
the following TCP testing methodology (high level simplified example):

1. Run tests to determine average end-end latency and end-end bottleneck
bandwidth (the minimum bandwidth may be known by design).
2. Calculate bandwidth delay product (BDP) for this situation then run a
series of window size experiments with TCP hosts on each end of the
network.  The network under test should be able to support the full TCP
capacity unless there are packet loss conditions.  This may point to
devices in the network that are altering the window size IF
retransmissions are not the issue and throughput is not achieved.
3. Run multiple connection tests (i.e. 32 connections) to over-utilize
the link.  Proper queuing techniques in the network should allow the TCP
connections to maintain their "fair share" of the available bandwidth.
Improper queuing (which is very common) would display bandwidth
volatility between the connections.  This is very valuable method as
well.
4. Another example is by either running single or multiple TCP
connections with a variety of MSS values (similar to varying the frame
size in RFC-2544).

Right now there is no established methodology for any of the above tests
in the network provider / NEM space.  So even taking a first step toward
standardizing the test methodology at the TCP layer, would be a giant
step forward.  As a matter of fact, this paper would probably be
co-authored with myself and at least one network provider and/or NEM.

So again, sorry for the long winded response, but I hope that this
provides better insight into the intent of such a draft. 

Thanks,
Barry

 

Principal Member of Technical Staff

 

JDSU Communication Test (formerly Acterna)

Emerging Markets and Technology Research         

One Milestone Center Court                              

Germantown, MD 20876                                         

(W) 240-404-2227                                                

(C) 301-325-7069


-----Original Message-----
From: Matt Mathis [mailto:matt.mathis at gmail.com] 
Sent: Thursday, October 15, 2009 2:52 PM
To: Henk Uijterwaal
Cc: Lars Eggert; Barry Constantine; Matthew J Zekauskas; ippm at ietf.org
Subject: Re: [ippm] Testing TCP Throughput Capacity in Operator Networks

>>> My thoughts are similar to Henk's.  IPPM has always been interested
in
>>> the area (see the bulk transport capacity work), so I'd claim the
area
>>> is in-scope, but doing work would require AD discussion.
>>
>> I think it wouldn't be unreasonable for IPPM to take this on, if we
more
>> clearly nail down what "this" is.

You might want to consider finishing the work started in RFCs 2330 and
3148 as a sufficient definition of "this".

Developing a Bulk Transport Metric has been sort of a holy grail of
IPPM from the very beginning.    It was my primary agenda when I
co-chaired the BOF at the Danvers IETF, April 1995).    After
publishing a tool [TReno] and making some important progress [RFC2330,
RFC3148, and Marl Alman's[CAP] the effort was abandoned essentially
because the results were not sufficiently robust to be useful to
support SLAs.  [NPAD] was my latest chapter of my efforts in this
area.

Although there are many difficulties with using TCP (notably
differences in implementations as noted in other messages) the real
killer is this:

TCP congestion control is an equilibrium process.  Correctly
functioning TCP with sufficient data ALWAYS causes congestion
somewhere in the network.  This congestion manifests itself by raising
the RTT and/or the packet loss until the data rate is consistent with
the the model[MODEL]:
date_rate=(MSS/RTT)*(0.7/squrt(loss_probability))   The problem is you
can't tell how much of the congestion was caused by your measurement,
and how much was already present in the network.   This causes sort of
a double Heisenberg problem where you can't even tell if you are
measuring bowling balls with pingpong balls or vice versa.  This is
because the "stiffness" of a TCP flow (think "first derivative of the
model") depends on the RTT, so when there are multiple flows sharing a
bottleneck, each flow yields a different amount, depending on their
relative RTTs.   If the measurement flow has a very short RTT, it will
push nearly all other traffic out of the bottleneck, if the cross
traffic has a short RTT, it will greatly depress the data rate of the
measured folw.  As a consequence a simple measurement with a single
TCP flow is useless for predicting performance of another flow through
the same bottleneck.   Note that a non-predictive measurement is
useless to the point where it can't really be called a metric.

THIS IS REALLY IMPORTANT: even with an ideal TCP implementation and
network, equilibrium TCP performance is useless as a metric.

The trick (used by NPAD) is the throttle TCP with a controlled
bottleneck, such that it does not cause congestion....

This can be done using an RFC 4898 instrumented stack under real
applications.   I will describe it in email in a bit, and a
presentation at IPPM, if people are interested.    No I do not have
enough cycles to generate an ID in time for the submission deadline.

Thanks,
--MM--
-------------------------------------------
Matt Mathis      http://staff.psc.edu/mathis
Work:412.268.3319   Home/Cell:412.654.7529
-------------------------------------------
Evil is defined by mortals who think they know
"The Truth" and use force to apply it to others.