From nobody Mon Apr 1 08:28:26 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5680A120167 for ; Mon, 1 Apr 2019 08:28:24 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.851 X-Spam-Level: X-Spam-Status: No, score=-1.851 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, KHOP_DYNAMIC=0.85, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id sHLWKhvIp39L for ; Mon, 1 Apr 2019 08:28:22 -0700 (PDT) Received: from mx0a-00273201.pphosted.com (mx0a-00273201.pphosted.com [208.84.65.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 96D4A120052 for ; Mon, 1 Apr 2019 08:28:22 -0700 (PDT) Received: from pps.filterd (m0108159.ppops.net [127.0.0.1]) by mx0a-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x31FJjdQ029370 for ; Mon, 1 Apr 2019 08:28:21 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : subject : date : message-id : content-type : mime-version; s=PPS1017; bh=tJUmiyVKyoPvlV/PZTmpK5QVawBJ4JoXk+QDOPJ4MQI=; b=ZCm92yjqpNmjM6mGp2I2M1y08T5XVEr8tN4OAkarU0T8FHykQ9wTv9YzS9Wvrz9lvJRp 3duLzVtqKE1w47gd2Pywd8et/ea3flVi+/6ZzdhQfPNEuqIPZqO8rXwfKhXdVFzf/0e9 CzuLHxxca7htKpnYpNw1ylbRpRskvcwVmJ9Ahcmqbdq2y6ektXv3eClUsPHoyrlC7UQA llt/OE+KCxQtLOtpQI3LY+OI0addEfEp5byPP34noW7L7I7go6ZjTaYLLubD7dDLs84m E2Ppohi8xCrcJCZH6rWyov2Nz6usD/kNb23Ak3QoE3seumwy0ehuxzeYe6hHfYQpx3Z6 9w== Received: from nam05-co1-obe.outbound.protection.outlook.com (mail-co1nam05lp2050.outbound.protection.outlook.com [104.47.48.50]) by mx0a-00273201.pphosted.com with ESMTP id 2rkeb90t7m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Mon, 01 Apr 2019 08:28:21 -0700 Received: from MWHPR05MB3279.namprd05.prod.outlook.com (10.173.230.18) by MWHPR05MB3040.namprd05.prod.outlook.com (10.168.246.146) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1771.6; Mon, 1 Apr 2019 15:28:20 +0000 Received: from MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202]) by MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202%9]) with mapi id 15.20.1771.007; Mon, 1 Apr 2019 15:28:20 +0000 From: Antoni Przygienda To: "rift@ietf.org" Thread-Topic: Thu core meet ... Thread-Index: AQHU6J9rJS+EP5Ypx0G7jNcfOUra5g== Date: Mon, 1 Apr 2019 15:28:20 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [66.129.239.12] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 55f806c4-27b2-4837-0bb2-08d6b6b6abc7 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(5600139)(711020)(4605104)(4618075)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(2017052603328)(7153060)(7193020); SRVR:MWHPR05MB3040; x-ms-traffictypediagnostic: MWHPR05MB3040: x-microsoft-antispam-prvs: x-forefront-prvs: 0994F5E0C5 x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(366004)(396003)(136003)(39860400002)(346002)(376002)(199004)(189003)(105586002)(256004)(6116002)(3846002)(4743002)(66066001)(68736007)(52536014)(14454004)(2501003)(186003)(26005)(486006)(6436002)(5660300002)(6916009)(71190400001)(105004)(74316002)(81166006)(71200400001)(25786009)(7736002)(476003)(19627405001)(6506007)(2351001)(9686003)(558084003)(8936002)(7696005)(99286004)(53936002)(7116003)(478600001)(81156014)(1730700003)(8676002)(106356001)(54896002)(5640700003)(33656002)(55016002)(316002)(102836004)(2906002)(3480700005)(97736004)(86362001); DIR:OUT; SFP:1102; SCL:1; SRVR:MWHPR05MB3040; H:MWHPR05MB3279.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: LjdDFNgJjipm0LSlU/0zmjEF608pX8hgx6Iclh03km9BmovkrX1OIZlVFTlxLK2lIGBWygR4XMgLhccCnVX44DZN/aaUs1LfzMbkYFsqXXTeAlqDJxj6QqJRNuzxG1rSsx38LncKQDQo9+pHvIygXx8Oq+3GK8/3aW+qvyM2eq5k0X/Ix3AUjDocUPfYRekvNQ2C5/hNNQ3mYdzUMSNQMqfFpkiEnu3EfKtjQ6ruKDv1okC1nYbptUlEtN1CzQprTZncjuLgdoeruTYZl/KMt9KdtKtl5tiENIpeLg9OgS8DNWTTv4NvWlTMK9mZlrwjzNHk50rgL1PplQr7EgmSTyHgWCfntAI6OC58Ta9V8Q1vBsLNcdc26q4j7J2bo9mQZpQksoE1Du5e0MGvKGdoPbkSDb5t8GDhuA2P2aYNDPY= Content-Type: multipart/alternative; boundary="_000_MWHPR05MB32799941B4617A11CA645245AC550MWHPR05MB3279namp_" MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-Network-Message-Id: 55f806c4-27b2-4837-0bb2-08d6b6b6abc7 X-MS-Exchange-CrossTenant-originalarrivaltime: 01 Apr 2019 15:28:20.1255 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR05MB3040 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-01_05:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=692 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904010103 Archived-At: Subject: [Rift] Thu core meet ... X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Apr 2019 15:28:24 -0000 --_000_MWHPR05MB32799941B4617A11CA645245AC550MWHPR05MB3279namp_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I'm back from IETF & ready to pick up on Thu. I have bunch small'ish things= on the model based on ongoing deployment discussions, maybe Bruno has some= thing on envelope already & I assume some more mcast? More topics? ... --- tony --_000_MWHPR05MB32799941B4617A11CA645245AC550MWHPR05MB3279namp_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
I'm back from IETF & ready to pick up on Thu. I have bunch small'ish th= ings on the model based on ongoing deployment discussions, maybe Bruno has = something on envelope already & I assume some more mcast? More topics? = ...

--- tony
--_000_MWHPR05MB32799941B4617A11CA645245AC550MWHPR05MB3279namp_-- From nobody Mon Apr 1 09:10:15 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D31431202C0 for ; Mon, 1 Apr 2019 09:10:13 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.998 X-Spam-Level: X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1j-bOMei_5xV for ; Mon, 1 Apr 2019 09:10:10 -0700 (PDT) Received: from mail-qt1-x82b.google.com (mail-qt1-x82b.google.com [IPv6:2607:f8b0:4864:20::82b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 04D131202C6 for ; Mon, 1 Apr 2019 09:10:10 -0700 (PDT) Received: by mail-qt1-x82b.google.com with SMTP id k14so11544832qtb.0 for ; Mon, 01 Apr 2019 09:10:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=8oe3RWKypqLlZT7bkXyssSN76cq+WID4k7ZLBo1HyZI=; b=HrRqXtG9r+jHRFYca2S1FlZeFBlK0rwLyhoiF00u1uBGNibHvV8J6IEBGwXkVIDvJ5 H3C8UCT8PZCbfb0A7Hm/3orR46fHFPJU9h287PqmjZMyV4R8/odV2Wnqf8fldi6ZZgHK 9wrtRzWKzjuVu/ItzwZxU6ahASOt/sAZvV1H0fNCGwZveH9OxJkZHqP7UC0IqFrhVbDU 15AsM7v/NP6TDAMxBYHmPZrpiy3Qen4DJZGG5+OFHl+RwfqRT50/UMHurL+siWuajINA 54TtwC0eTYlh/EVCiNC+haxSIn4p2K0MMK4ntTops9woh3Ojh2wcdYkWVhsBanNJrfdU 5cBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=8oe3RWKypqLlZT7bkXyssSN76cq+WID4k7ZLBo1HyZI=; b=p8fuaq5AjFjwQCUv8FjjcawHLCJwo1TdD+r3pKUtKekpvHrKGP597aBYtZWJ5HrveE W2ZtspkcWm+x9JtKfPzhKchz//M3y53olwtzO501sZYFK+oG9Pe6Ep9rjHob14JsnoMt wbAW1EOR9jGL/9+2xY5IPt94R5aPx/5aSie0mHeMcqTCU4IFY4Qi1OTItHmnD8QMM4YN LxoZLQy80QgnTOGwNXsXjy/lhOVXvlmyCNLyslqhDRNkxd1P1amLDRt6zfuT/86CZKsh 6l0ZXjo59lLPGY3ge8oX7bKalF1T176iTyI6vRB3SXQ4OIuibUsSBcR8y3UFRfPGQcuw w1rQ== X-Gm-Message-State: APjAAAWMpxe7yjh/+iY+t4Z5DzBzbb+cmmFw/ikAJ3IIODsl2K5k8kQ+ gq25AqmOm6CkjPG2EyCtFVAt7RZ3 X-Google-Smtp-Source: APXvYqyXajQC6dFF6wjth3WOM5FgbE4sgSFNewk9X5cd5T4tKnZDQcsydtgeU0kFWHf8UvCVEbIDtQ== X-Received: by 2002:ac8:2cd1:: with SMTP id 17mr53507049qtx.299.1554135008573; Mon, 01 Apr 2019 09:10:08 -0700 (PDT) Received: from [192.168.20.114] ([200.63.168.130]) by smtp.gmail.com with ESMTPSA id x201sm5592244qkb.92.2019.04.01.09.10.06 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 01 Apr 2019 09:10:07 -0700 (PDT) From: Bruno Rijsman Message-Id: <89F9D5D5-D1D0-4383-9832-2163348B6F34@gmail.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_583E9E56-7F9E-4A84-A065-F3F2D14BB606" Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Date: Mon, 1 Apr 2019 13:10:03 -0300 In-Reply-To: Cc: "rift@ietf.org" To: Antoni Przygienda References: X-Mailer: Apple Mail (2.3445.9.1) Archived-At: Subject: Re: [Rift] Thu core meet ... X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Apr 2019 16:10:14 -0000 --Apple-Mail=_583E9E56-7F9E-4A84-A065-F3F2D14BB606 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 I have not done any work on the envelope in RIFT-Python yet. Right know I am looking into the port of RIFT-Python to Free Range = Routing (FRR), in anticipation of an expected go-ahead on that project. Most likely I will not attend the Thursday call on 4 April =E2=80=94 I = am off to the mountains again for a couple of weeks starting 4 April. =E2=80=94 Bruno > On Apr 1, 2019, at 12:28 PM, Antoni Przygienda = wrote: >=20 > I'm back from IETF & ready to pick up on Thu. I have bunch small'ish = things on the model based on ongoing deployment discussions, maybe Bruno = has something on envelope already & I assume some more mcast? More = topics? ....=20 >=20 > --- tony=20 > _______________________________________________ > RIFT mailing list > RIFT@ietf.org > https://www.ietf.org/mailman/listinfo/rift = --Apple-Mail=_583E9E56-7F9E-4A84-A065-F3F2D14BB606 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 I = have not done any work on the envelope in RIFT-Python yet.

Right know I am looking = into the port of RIFT-Python to Free Range Routing (FRR), in = anticipation of an expected go-ahead on that project.

Most likely I will not = attend the Thursday call on 4 April =E2=80=94 I am off to the mountains = again for a couple of weeks starting 4 April.

=E2=80=94 Bruno

On Apr = 1, 2019, at 12:28 PM, Antoni Przygienda <prz=3D40juniper.net@dmarc.ietf.org> wrote:

I'm back from IETF & ready = to pick up on Thu. I have bunch small'ish things on the model based on = ongoing deployment discussions, maybe Bruno has something on envelope = already & I assume some more mcast? More topics? .... 

--- tony 
_______________________________________________
RIFT mailing list
RIFT@ietf.org
https://www.ietf.org/mailman/listinfo/rift

= --Apple-Mail=_583E9E56-7F9E-4A84-A065-F3F2D14BB606-- From nobody Mon Apr 1 09:16:14 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 420D6120315 for ; Mon, 1 Apr 2019 09:16:12 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -14.5 X-Spam-Level: X-Spam-Status: No, score=-14.5 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cisco.com header.b=EvtMffZm; dkim=pass (1024-bit key) header.d=cisco.onmicrosoft.com header.b=P/zidLUB Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fK_GlPQ_GC8v for ; Mon, 1 Apr 2019 09:16:09 -0700 (PDT) Received: from alln-iport-3.cisco.com (alln-iport-3.cisco.com [173.37.142.90]) (using TLSv1.2 with cipher DHE-RSA-SEED-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9820312030F for ; Mon, 1 Apr 2019 09:16:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=5657; q=dns/txt; s=iport; t=1554135369; x=1555344969; h=from:to:subject:date:message-id:references:in-reply-to: mime-version; bh=19PtDduA08ViMuF0U5wg+fOVtpjSBAd3rpYIDKXO5P0=; b=EvtMffZm8KZMS/zNq5grJ2gJeuZH9OVzL4DGUgagyBH0DRpb5Sp7Xl4c P5IWYIQbPlH8sOvbnNMygcDJwAJsoXTPu2vh80F2qa+v135ZED8ZJ2lLb QeqbpQ1XpSLjQdYga8i3wxUajEnzKU19u1yQidL3xpeJKLHdabbNyB3MI I=; IronPort-PHdr: =?us-ascii?q?9a23=3A6SKCwBTxoDV7W5Vml4ICG9PTmtpsv++ubAcI9p?= =?us-ascii?q?oqja5Pea2//pPkeVbS/uhpkESXBNfA8/wRje3QvuigQmEG7Zub+FE6OJ1XH1?= =?us-ascii?q?5g640NmhA4RsuMCEn1NvnvOjQmHNlIWUV513q6KkNSXs35Yg6arw=3D=3D?= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0AEAAAGOKJc/4MNJK1jGgEBAQEBAgE?= =?us-ascii?q?BAQEHAgEBAQGBUQUBAQEBCwGBDi9QA2h0BAsnCodLA4RSimFKgg2SRoRJgS6?= =?us-ascii?q?BJANUDgEBLIRAAoVEIjQJDQEBAwEBCQEDAm0cDIVKAQEBAQMtEwEBOA8CAQg?= =?us-ascii?q?RBAEBLzIdCAIEARIIgxuBEUwDFQECoUkCihSCIIJ5AQEFhHoYggwIgS8BizI?= =?us-ascii?q?XgUA/gRFGgkw+hEaDOYImkSaUJwkCk3iULIs/k04CBAIEBQIOAQEFgU04gVZ?= =?us-ascii?q?wFTuCbIIKg26KU3KBKI4UAYEeAQE?= X-IronPort-AV: E=Sophos;i="5.60,297,1549929600"; d="scan'208,217";a="255914171" Received: from alln-core-1.cisco.com ([173.36.13.131]) by alln-iport-3.cisco.com with ESMTP/TLS/DHE-RSA-SEED-SHA; 01 Apr 2019 16:16:08 +0000 Received: from XCH-ALN-013.cisco.com (xch-aln-013.cisco.com [173.36.7.23]) by alln-core-1.cisco.com (8.15.2/8.15.2) with ESMTPS id x31GG88v030163 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=FAIL); Mon, 1 Apr 2019 16:16:08 GMT Received: from xhs-rcd-002.cisco.com (173.37.227.247) by XCH-ALN-013.cisco.com (173.36.7.23) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Mon, 1 Apr 2019 11:16:07 -0500 Received: from xhs-rcd-002.cisco.com (173.37.227.247) by xhs-rcd-002.cisco.com (173.37.227.247) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Mon, 1 Apr 2019 11:16:06 -0500 Received: from NAM02-CY1-obe.outbound.protection.outlook.com (72.163.14.9) by xhs-rcd-002.cisco.com (173.37.227.247) with Microsoft SMTP Server (TLS) id 15.0.1473.3 via Frontend Transport; Mon, 1 Apr 2019 11:16:06 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cisco.onmicrosoft.com; s=selector1-cisco-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=KbXgJZOoE7C0pJ+96VHe5sLeM0CbQYhLki/luyuXINQ=; b=P/zidLUBA2n/jxWHmYgWZ1F2ULkm730jsDDfij3NDiy7oYiWq7d0sr1f94ZcCMX/CbKNCN4tv3RPQtVYGqH2kShBM7Y/zQDvBp485KCgzGyUZnFPL473tqR0U/b057XOg9Iqi8Vw+E0N5R790DRXoD0pUj4rluJxD9VxVi74NnE= Received: from MN2PR11MB3565.namprd11.prod.outlook.com (20.178.250.159) by MN2PR11MB3853.namprd11.prod.outlook.com (20.178.250.147) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1750.17; Mon, 1 Apr 2019 16:16:05 +0000 Received: from MN2PR11MB3565.namprd11.prod.outlook.com ([fe80::975:4644:7891:e2b1]) by MN2PR11MB3565.namprd11.prod.outlook.com ([fe80::975:4644:7891:e2b1%3]) with mapi id 15.20.1750.017; Mon, 1 Apr 2019 16:16:05 +0000 From: "Pascal Thubert (pthubert)" To: Antoni Przygienda , "rift@ietf.org" Thread-Topic: [Rift] Thu core meet ... Thread-Index: AQHU6J9rJS+EP5Ypx0G7jNcfOUra5qYneVtg Date: Mon, 1 Apr 2019 16:16:02 +0000 Deferred-Delivery: Mon, 1 Apr 2019 16:15:57 +0000 Message-ID: References: In-Reply-To: Accept-Language: fr-FR, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=pthubert@cisco.com; x-originating-ip: [173.38.220.57] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 4fd33dba-81fc-462d-b508-08d6b6bd57ba x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(5600139)(711020)(4605104)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(2017052603328)(7153060)(7193020); SRVR:MN2PR11MB3853; x-ms-traffictypediagnostic: MN2PR11MB3853: x-ms-exchange-purlcount: 2 x-microsoft-antispam-prvs: x-forefront-prvs: 0994F5E0C5 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(136003)(396003)(366004)(39860400002)(346002)(376002)(189003)(199004)(52536014)(66066001)(8936002)(55016002)(53936002)(25786009)(478600001)(8676002)(33656002)(486006)(9686003)(86362001)(6436002)(6306002)(54896002)(110136005)(3846002)(81156014)(316002)(790700001)(6116002)(97736004)(256004)(76176011)(6246003)(99286004)(186003)(5660300002)(53546011)(106356001)(11346002)(105586002)(102836004)(7696005)(2906002)(6666004)(81166006)(7736002)(476003)(229853002)(26005)(14454004)(71190400001)(2501003)(74316002)(71200400001)(6506007)(446003)(68736007); DIR:OUT; SFP:1101; SCL:1; SRVR:MN2PR11MB3853; H:MN2PR11MB3565.namprd11.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: cisco.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: TCnUQalGaVC573v+Yrl/1yIQ7bA/MP8i3ymMt7zI4oOt+nbraykqHuAg1Qsq+a3vkDTsZNuQqIQ2DMcBSnt3R82NLO2vFHAk3Otxb+C3/VcneMAYC8GjZrQzEtx7FhDt91dO5k0Fv0csqf7rQH5HacQufrjPkKH5mLfR2DbcYKZaMsu9BuV1csmTVhLRhL22WM+t8mqJOirHCdi/WTliylIaMUCCUWJttVr2B3soaXTHMh0LlYJDHcwyE2D2m8p0BWy601x6CqmnCu3PiUXDkJdMJlM3MDYaPMXXvnnSt6sMKDZzayoMCAB3JbffzrkyDa+0NagqC862MhW7aa3wZmEjy2B5MTeDKI1ow1ZE7xqexujX9ihSFm37VXWdK7SyIp3i4+ZVJO5J9Bv/OogfN4EvjAGPlXzlz8Jo8V9kciQ= Content-Type: multipart/alternative; boundary="_000_MN2PR11MB35658CEA98221DCD5D42A59DD8550MN2PR11MB3565namp_" MIME-Version: 1.0 X-MS-Exchange-CrossTenant-Network-Message-Id: 4fd33dba-81fc-462d-b508-08d6b6bd57ba X-MS-Exchange-CrossTenant-originalarrivaltime: 01 Apr 2019 16:16:05.4800 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 5ae1af62-9505-4097-a69a-c1553ef7840e X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR11MB3853 X-OriginatorOrg: cisco.com X-Outbound-SMTP-Client: 173.36.7.23, xch-aln-013.cisco.com X-Outbound-Node: alln-core-1.cisco.com Archived-At: Subject: Re: [Rift] Thu core meet ... X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Apr 2019 16:16:12 -0000 --_000_MN2PR11MB35658CEA98221DCD5D42A59DD8550MN2PR11MB3565namp_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Multicast would be good. Jeffrey told me that we still have some disconnect= , the paint is still fresh. At least we seemed to reach an agreement with the room that differentiating= routing for traffic depending on whether it arrives form north or south wa= s problematic. If the list confirms, then the choice of the BIDIR option is finalized. Bui= lding the trees to the subToF is piece of cake. The game is to form the NBM= A RPL, built a tree there (call it NRT?). I presented my understanding of t= he latest discussions with Jeffrey, about making the NRT not too deep and n= ot too fat either, and transporting the NRT root ID in the node TIE. Happy = to chat more about it all, and very soon ready to produce text if we reach = agreement. All the best, Pascal From: RIFT On Behalf Of Antoni Przygienda Sent: lundi 1 avril 2019 17:28 To: rift@ietf.org Subject: [Rift] Thu core meet ... I'm back from IETF & ready to pick up on Thu. I have bunch small'ish things= on the model based on ongoing deployment discussions, maybe Bruno has some= thing on envelope already & I assume some more mcast? More topics? .... --- tony --_000_MN2PR11MB35658CEA98221DCD5D42A59DD8550MN2PR11MB3565namp_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Multicast would be good. Jeffrey told me that we sti= ll have some disconnect, the paint is still fresh.

 

At least we seemed to reach an agreement with the ro= om that differentiating routing for traffic depending on whether it arrives= form north or south was problematic.

If the list confirms, then the choice of the BIDIR o= ption is finalized. Building the trees to the subToF is piece of cake. The = game is to form the NBMA RPL, built a tree there (call it NRT?). I presente= d my understanding of the latest discussions with Jeffrey, about making the NRT not too deep and not too fat either, an= d transporting the NRT root ID in the node TIE. Happy to chat more about it= all, and very soon ready to produce text if we reach agreement.=

 

All the best,

 

Pascal

 

From: RIFT <rift-bounces@ietf.org> O= n Behalf Of Antoni Przygienda
Sent: lundi 1 avril 2019 17:28
To: rift@ietf.org
Subject: [Rift] Thu core meet ...

 

I'm bac= k from IETF & ready to pick up on Thu. I have bunch small'ish things on= the model based on ongoing deployment discussions, maybe Bruno has somethi= ng on envelope already & I assume some more mcast? More topics? ....

&n= bsp;

--- ton= y

--_000_MN2PR11MB35658CEA98221DCD5D42A59DD8550MN2PR11MB3565namp_-- From nobody Mon Apr 1 10:34:35 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C2FB61204D4 for ; Mon, 1 Apr 2019 10:34:30 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.85 X-Spam-Level: X-Spam-Status: No, score=-1.85 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, KHOP_DYNAMIC=0.85, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PaL3wmItZkkl for ; Mon, 1 Apr 2019 10:34:28 -0700 (PDT) Received: from mx0a-00273201.pphosted.com (mx0a-00273201.pphosted.com [208.84.65.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B8F251204D3 for ; Mon, 1 Apr 2019 10:34:27 -0700 (PDT) Received: from pps.filterd (m0108156.ppops.net [127.0.0.1]) by mx0a-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x31HYOsc007146; Mon, 1 Apr 2019 10:34:24 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : subject : date : message-id : references : in-reply-to : content-type : mime-version; s=PPS1017; bh=l5WbI/SMFk33w9O+g62agP3HTHLqvE0hgeuMx4PaeHI=; b=ai6+SsDANG9dAovVtIcTJ29CdfINlmHTBe6pYRGQ3ot37pfq/yp7tAS77scWDjWtS5N9 1kunncqocySdwmIG6AERMs3VfNl3OluSHOeT4Q0N8IE/f8o+AVDQghipfks2U2bTM1iv 0prLfwfF+Z5wodORf7f9xbny9NoM38GdA/jzAUBBj/c7C+9YIjujq0OcPwTklMoVQsnO ANHmXfroO3W3QcsTUXaFgYBfOsWYKuraqYJBI47Q0PTxy1vKKM7+8nY60LfDlShsINlX XjIvhohlE5qT4/B+aJ2/nioPJxqJrIeg3PgpiYG+lT50SmnYQYWxjnBZtRK3SeHi/Ztv WQ== Received: from nam04-co1-obe.outbound.protection.outlook.com (mail-co1nam04lp2053.outbound.protection.outlook.com [104.47.45.53]) by mx0a-00273201.pphosted.com with ESMTP id 2rkhsv0p04-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Mon, 01 Apr 2019 10:34:24 -0700 Received: from CO2PR05MB2455.namprd05.prod.outlook.com (10.166.95.137) by CO2PR05MB2775.namprd05.prod.outlook.com (10.166.213.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1771.11; Mon, 1 Apr 2019 17:34:22 +0000 Received: from CO2PR05MB2455.namprd05.prod.outlook.com ([fe80::81e2:bbe8:6851:16b2]) by CO2PR05MB2455.namprd05.prod.outlook.com ([fe80::81e2:bbe8:6851:16b2%6]) with mapi id 15.20.1771.006; Mon, 1 Apr 2019 17:34:22 +0000 From: "Jeffrey (Zhaohui) Zhang" To: "Pascal Thubert (pthubert)" , Antoni Przygienda , "rift@ietf.org" Thread-Topic: [Rift] Thu core meet ... Thread-Index: AQHU6J9rJS+EP5Ypx0G7jNcfOUra5qYneVtggAATIFA= Content-Class: Date: Mon, 1 Apr 2019 17:34:21 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.1.100.23 dlp-reaction: no-action msip_labels: MSIP_Label_ec86cea6-d130-4b5a-862d-1c69acc725d6_Enabled=True; MSIP_Label_ec86cea6-d130-4b5a-862d-1c69acc725d6_SiteId=bea78b3c-4cdb-4130-854a-1d193232e5f4; MSIP_Label_ec86cea6-d130-4b5a-862d-1c69acc725d6_Owner=zzhang@juniper.net; MSIP_Label_ec86cea6-d130-4b5a-862d-1c69acc725d6_SetDate=2019-04-01T17:22:50.8040489Z; MSIP_Label_ec86cea6-d130-4b5a-862d-1c69acc725d6_Name=Juniper Confidential-Controlled Unclassified Information; MSIP_Label_ec86cea6-d130-4b5a-862d-1c69acc725d6_Application=Microsoft Azure Information Protection; MSIP_Label_ec86cea6-d130-4b5a-862d-1c69acc725d6_Extended_MSFT_Method=Manual; Sensitivity=Juniper Confidential-Controlled Unclassified Information x-originating-ip: [66.129.241.14] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 26d53e91-117c-4563-7862-08d6b6c846fa x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(5600139)(711020)(4605104)(4618075)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(2017052603328)(7193020); SRVR:CO2PR05MB2775; x-ms-traffictypediagnostic: CO2PR05MB2775: x-ms-exchange-purlcount: 2 x-microsoft-antispam-prvs: x-forefront-prvs: 0994F5E0C5 x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(366004)(136003)(39860400002)(376002)(396003)(346002)(199004)(189003)(2501003)(106356001)(3846002)(105586002)(86362001)(790700001)(6116002)(74316002)(7736002)(256004)(25786009)(14454004)(6246003)(53936002)(6306002)(54896002)(68736007)(55016002)(236005)(229853002)(53946003)(7696005)(6436002)(99286004)(71200400001)(71190400001)(316002)(110136005)(8676002)(9686003)(2906002)(486006)(33656002)(53546011)(6506007)(446003)(102836004)(76176011)(186003)(26005)(5660300002)(478600001)(66066001)(52536014)(8936002)(97736004)(81156014)(81166006)(476003)(11346002)(579004); DIR:OUT; SFP:1102; SCL:1; SRVR:CO2PR05MB2775; H:CO2PR05MB2455.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: i4RRLgz8HIYwVbhBdK3mL8kzVLUHc2cHrVMxBgMiC0UuJ2hHRo3hmImd5lphp7esErpeN22JbijQQbVMMn5EFnRdoTbmZHDwnaPbJk4xvsvMYokiZLzL3DIawzeixi6WNa+835pSonQ+H1TZ9xtfT9OozialG3fepyMCSQ0a3/UounUDSdF89aRqdvEJkH7afgzjf69UB3g5EVD1q9w3IrlOzaFh7mFZ3s44T9JH4sSgTxci79G0olc3LJWfc2v6/Zv/dB+0XS9ASLYgjG/VNHbImdOiA1eIOBFQZ+PBn3oVBMCbPrePeCGehTRVTQxcMyU999Awxmaaz/mHwNOmrbxgbO8asM8qZd4WuILihXgNj4LMtZ9+XP/7SA+eyXwn0slUlD3DuA5iCxOO3/McsrDgxzdeVljNgCmZxJQH2PE= Content-Type: multipart/alternative; boundary="_000_CO2PR05MB2455E39AA000B514759D8BFFD4550CO2PR05MB2455namp_" MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-Network-Message-Id: 26d53e91-117c-4563-7862-08d6b6c846fa X-MS-Exchange-CrossTenant-originalarrivaltime: 01 Apr 2019 17:34:21.8747 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: CO2PR05MB2775 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-01_05:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904010115 Archived-At: Subject: Re: [Rift] Thu core meet ... X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Apr 2019 17:34:31 -0000 --_000_CO2PR05MB2455E39AA000B514759D8BFFD4550CO2PR05MB2455namp_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable I won't be able to attend this week (and the next week). Please see zzh> below. Juniper Confidential-Controlled Unclassified Information From: RIFT On Behalf Of Pascal Thubert (pthubert) Sent: Monday, April 1, 2019 12:16 PM To: Antoni Przygienda ; rift@ietf.org Subject: Re: [Rift] Thu core meet ... Multicast would be good. Jeffrey told me that we still have some disconnect= , the paint is still fresh. Zzh> The disconnect that I thought we had was whether we build (*,G-prefix)= trees or do we just build a few (*,*) trees and then use group information= to prune the forwarding. Zzh> Before the meeting Pascal and I were focusing on how to connect the su= b-trees at the top (what Pascal presented) and we did not get a chance to s= ync-up with others on building (*,G-prefixes) trees explicitly. Zzh> I thought Pascal once mentioned just building a few (*,*) trees and th= en hash traffic onto different trees and use group information to prune. If= Pascal did not mean that, and people do think it's good to build (*,G-pref= ix) trees (starting with a single (*,*) tree, and adding more (*,G-prefix) = or (*,G) trees as needed), then there is no disconnect. At least we seemed to reach an agreement with the room that differentiating= routing for traffic depending on whether it arrives form north or south wa= s problematic. If the list confirms, then the choice of the BIDIR option is finalized. Bui= lding the trees to the subToF is piece of cake. The game is to form the NBM= A RPL, built a tree there (call it NRT?). I presented my understanding of t= he latest discussions with Jeffrey, about making the NRT not too deep and n= ot too fat either, and transporting the NRT root ID in the node TIE. Happy = to chat more about it all, and very soon ready to produce text if we reach = agreement. Zzh> What does NRT stand for? Zzh> Another thing that we need to settle is, whether we use RIFT's own TIE= exchange to set up the tree or use (extended) PIM signaling/procedures. I = remember Tony once preferred non-RIFT mechanism. I think it would be good t= o just use native RIFT signaling instead of carrying PIM baggage. Zzh> Jeffrey All the best, Pascal From: RIFT > On Behalf = Of Antoni Przygienda Sent: lundi 1 avril 2019 17:28 To: rift@ietf.org Subject: [Rift] Thu core meet ... I'm back from IETF & ready to pick up on Thu. I have bunch small'ish things= on the model based on ongoing deployment discussions, maybe Bruno has some= thing on envelope already & I assume some more mcast? More topics? .... --- tony --_000_CO2PR05MB2455E39AA000B514759D8BFFD4550CO2PR05MB2455namp_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

I won= ’t be able to attend this week (and the next week).=

=  

Pleas= e see zzh> below.

=  

 

Juniper Confidential-Control= led Unclassified Information

From: RIFT <rift-bounces@ietf.org> On Behalf Of Pascal Thubert (pthubert)
Sent: Monday, April 1, 2019 12:16 PM
To: Antoni Przygienda <prz=3D40juniper.net@dmarc.ietf.org>; ri= ft@ietf.org
Subject: Re: [Rift] Thu core meet ...

 

Multicast would be good. Jeffrey told me that we sti= ll have some disconnect, the paint is still fresh.

=  

Zzh&g= t; The disconnect that I thought we had was whether we build (*,G-prefix) t= rees or do we just build a few (*,*) trees and then use group information to prune the forwarding.

Zzh&g= t; Before the meeting Pascal and I were focusing on how to connect the sub-= trees at the top (what Pascal presented) and we did not get a chance to sync-up with others on building (*,G-prefixes) trees e= xplicitly.

Zzh&g= t; I thought Pascal once mentioned just building a few (*,*) trees and then= hash traffic onto different trees and use group information to prune. If Pascal did not mean that, and people do think it&= #8217;s good to build (*,G-prefix) trees (starting with a single (*,*) tree= , and adding more (*,G-prefix) or (*,G) trees as needed), then there is no<= span style=3D"mso-spacerun:yes">  disconnect.

 

At least we seemed to reach an agreement with the ro= om that differentiating routing for traffic depending on whether it arrives= form north or south was problematic.

If the list confirms, then the choice of the BIDIR o= ption is finalized. Building the trees to the subToF is piece of cake. The = game is to form the NBMA RPL, built a tree there (call it NRT?). I presente= d my understanding of the latest discussions with Jeffrey, about making the NRT not too deep and not too fat either, an= d transporting the NRT root ID in the node TIE. Happy to chat more about it= all, and very soon ready to produce text if we reach agreement.=

=  

Zzh&g= t; What does NRT stand for?

Zzh&g= t; Another thing that we need to settle is, whether we use RIFT’s own= TIE exchange to set up the tree or use (extended) PIM signaling/procedures. I remember Tony once preferred non-RIFT mechanism. I= think it would be good to just use native RIFT signaling instead of carryi= ng PIM baggage.

=  

Zzh&g= t; Jeffrey

 

All the best,

 

Pascal

 

From: RIFT <= rift-bounces@ietf.org> On Behalf Of Antoni Przygienda
Sent: lundi 1 avril 2019 17:28
To: rift@ietf.org
Subject: [Rift] Thu core meet ...

 

I'm bac= k from IETF & ready to pick up on Thu. I have bunch small'ish things on= the model based on ongoing deployment discussions, maybe Bruno has somethi= ng on envelope already & I assume some more mcast? More topics? ....

&n= bsp;

--- ton= y

--_000_CO2PR05MB2455E39AA000B514759D8BFFD4550CO2PR05MB2455namp_-- From nobody Fri Apr 12 10:44:51 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 14D1C1203CC for ; Fri, 12 Apr 2019 10:44:50 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.337 X-Spam-Level: X-Spam-Status: No, score=-1.337 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, KHOP_DYNAMIC=1.363, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id c4rGZpqN6eGG for ; Fri, 12 Apr 2019 10:44:47 -0700 (PDT) Received: from mx0b-00273201.pphosted.com (mx0b-00273201.pphosted.com [67.231.152.164]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C8BBB12030A for ; Fri, 12 Apr 2019 10:44:47 -0700 (PDT) Received: from pps.filterd (m0108163.ppops.net [127.0.0.1]) by mx0b-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3CHdmPi001672; Fri, 12 Apr 2019 10:44:43 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : mime-version; s=PPS1017; bh=i0Jmj6i2Do99RsaUFkBAowlyE0WOV2xTvaMqAi3+KiM=; b=y7nn6a4QdP5GAy/2zEIGQpM9laSCq2NTVr9a7dUIP27BwyWEhUZZHGJFD7Cd6NhM6X4u HXnqPt8MnKeCJ/sAhH0MfYYkKCct4cSjFuyk6t9QXJXNRDY30loxscSIrd45Q5OB4bZk ljrUnIloTyL+FVdJEudk7dJBGDmzKm+U0mKfrvj+uafc6d6ji+DdB6wLZi8udpjhQDpG sIX0jcfKlOOpvqXtr0iD3coiAUrAtxBsvcBBC/khpt9wdZPrOfv5fJAdAd7LpsO1HZpx JYiCMw4Ivshvi75mFSi+wtr/RunshL8xvePFoavoaR0afi7zyH8xm8mnigUbDxtSIfUz tg== Received: from nam01-sn1-obe.outbound.protection.outlook.com (mail-sn1nam01lp2059.outbound.protection.outlook.com [104.47.32.59]) by mx0b-00273201.pphosted.com with ESMTP id 2rtwq8r7b5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Fri, 12 Apr 2019 10:44:43 -0700 Received: from MWHPR05MB3279.namprd05.prod.outlook.com (10.173.230.18) by MWHPR05MB3680.namprd05.prod.outlook.com (10.174.175.33) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1792.11; Fri, 12 Apr 2019 17:44:41 +0000 Received: from MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202]) by MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202%10]) with mapi id 15.20.1792.009; Fri, 12 Apr 2019 17:44:40 +0000 From: Antoni Przygienda To: Kris Price , "brunorijsman@gmail.com" CC: "rift@ietf.org" Thread-Topic: RIFT Thread-Index: AQHU8VRUPBOno7miBkKQqRko+b0GOKY4xxUk Date: Fri, 12 Apr 2019 17:44:40 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [66.129.239.11] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 43a81197-12cb-45d3-43e6-08d6bf6e8a76 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600139)(711020)(4605104)(4618075)(2017052603328)(7193020); SRVR:MWHPR05MB3680; x-ms-traffictypediagnostic: MWHPR05MB3680: x-ms-exchange-purlcount: 1 x-microsoft-antispam-prvs: x-forefront-prvs: 0005B05917 x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(346002)(396003)(136003)(376002)(366004)(39860400002)(199004)(189003)(476003)(5660300002)(54896002)(25786009)(7736002)(97736004)(7116003)(66066001)(86362001)(6116002)(3846002)(14454004)(236005)(52536014)(2906002)(4326008)(99286004)(33656002)(106356001)(105586002)(55016002)(74316002)(478600001)(110136005)(53936002)(19627405001)(221733001)(316002)(81156014)(446003)(76176011)(81166006)(8936002)(486006)(71200400001)(8676002)(26005)(9686003)(6246003)(11346002)(7696005)(102836004)(186003)(6436002)(71190400001)(606006)(229853002)(6306002)(6506007)(256004)(2501003)(105004)(53546011)(68736007)(966005)(3480700005); DIR:OUT; SFP:1102; SCL:1; SRVR:MWHPR05MB3680; H:MWHPR05MB3279.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: ASP4OTNa05ZI4e0QGznPgbCYv3FHzYsm0rJIdpA/o66PzKhHB8Ye1pwqjEhelNePFwM5mFYe61rT9dBEye3JJv/c7jbnorq4QVDVRQVnlP71idRwzAm4Ucy8wsCwf+7SQL/pNHuXVtz4XkiBax4MjiOtI5iDeV+uBGzYYOB0w+c0ypOFYnLrmSCfHwpJoIdL0Q2UtWmqk4lNH0x0CbHPh/Pat/OOFT8iryx9sbX7zqKHzVx8kzLrFzjFqSQbAudacWTDaVNzyJdoYlLGWRymvu3WfuwF5YXP2dtsXfjOAXU44tFjmxZHiLOC/otCgj+eIcwScJHNIxrTa04p0km8d+9H+jt4CT4pg5hznyULnKYhxqdBiyqMZsTLbVwJZhRwGAmH15AG1GXph/GcCMk4FzsadWVQI5tJ3ZSC+RNBFVM= Content-Type: multipart/alternative; boundary="_000_MWHPR05MB32798B45DD99D8ABF75B875AAC280MWHPR05MB3279namp_" MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-Network-Message-Id: 43a81197-12cb-45d3-43e6-08d6bf6e8a76 X-MS-Exchange-CrossTenant-originalarrivaltime: 12 Apr 2019 17:44:40.8156 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR05MB3680 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-12_10:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904120119 Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Apr 2019 17:44:50 -0000 --_000_MWHPR05MB32798B45DD99D8ABF75B875AAC280MWHPR05MB3279namp_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Hey Kris, great to see you engaging back ;-) I cc: rift mailing list for po= sterity 1) yepp, very nice python implementation, especially if one wants to unders= tand RIFT as running protocol rather than paper spec ;-) 2) Yepp, multi-planes did lead to lots of discussions in core-team meetings= around acceptable solutions and how we'd explain it properly. Explanation = largely due to Pascal's and Ilya's work, it takes a bit to soak the ASCII p= ictures but once you start to grok the concept of "crossbars crossbaring cr= ossbars" being Clos ;-) it's very easy to think through the stuff. 3) A knob to basically "always disaggregate southbound" is as simple thing= to do really. Just like Bruno has it in his computation first phase mostly= decides _which_ nodes need disaggregation. The result can be simply replac= ed by all southbound nodes & then disaggregation happens naturally. Observe= that you still want the default origination since a PoD doesn't see other = PoDs except via spine and disaggregation is _not_ transitive (I'm talking p= ositive now). There are other cases you want to advertise southobuond some = prefixes beside default and it's a normal thing to do, nothing says that yo= u only advertise default southbound anywhere. 4) Observe that we do NOT have a single ring! A ring is only as long as the= #planes you have. No'one will have 1000 planes ;-) So let's say you have 6= 4 switches in each plane and 4 planes. You will have 64 rings of lengh 4. O= bviously you can double-ring or ring within the plane as well to improve re= liability but basically the topology is coherent until 2! links in the same= ring break. 5) Always enabling disaggregation: That's point 3. Observe that does NOT so= lve your multi-plane problem on breakages since positive disaggregation is = NOT transitive. Yes, you can turn southbound PGP on and blast whole fabric = with all prefixes which basically makes your blast radius uncontained on an= y change (kind of flat IGP or rather IGP with DV prefixes ;-) and a single = link coming/going may lead to massive amount of convergence traffic due to = prefix reachability changes. Moreover, all your leafs (which is servers in = extreme case) need FIB size of the size of fabric host routes ... Spec authorship is still open ;-) so if you feel like improving/adding to d= raft, just let me know your moniker on bitbucket since thath's where the ne= west spec versions live ... https://bitbucket.org/riftrfc/rift_draft/src/master/ --- tony ________________________________ From: Kris Price Sent: Friday, April 12, 2019 10:22 AM To: Antoni Przygienda; brunorijsman@gmail.com Subject: RIFT Hey guys, ... couple of thoughts for you: I see someone much better at explaining things than me talked to you since = we last spoke :-) so you've caveated the case with how middle tiers in a Cl= os don't get full visibility of all nodes at that tier via the reflection b= ecause by very nature of the Clos they're not fully striped to that lower t= ier (that is the essence of how Clos topologies scale so is present in any = large fabric). You've covered that under the description as "hyper planes" = and work around is to use a ring at those tiers to pass control plane traff= ic. Would it be possible to somehow /safely/ have a 'knob' to turn off the aggr= egation at a tier so it'll always advertise all prefixes southbound? Partic= ularly this is useful in a network that might migrate to this protocol that= does not want to go back and cable a ring. And in some cases cabling a rin= g will be undesirable to some potential users anyway. (In some real world n= etworks that could mean cabling a ring connecting >1000 switches that form = the "spine columns". That=92s a long brittle ring.) Larger question, would it be possible to disable aggregation network wide? = (For someone that might be interested in using RIFT, but for reasons other = than the aggregation capability and where that capability may be seen as un= desirable.) There's a semi-related issue at the very bottom of the fabric but it's a bi= t difficult to explain, and I'm not sure it's really RIFTs problem to solve= , (in part really its due to a bad network design choice that exists) so I = might draw that up later. Cheers Kris --_000_MWHPR05MB32798B45DD99D8ABF75B875AAC280MWHPR05MB3279namp_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable
Hey Kris, great to see you engaging back ;-) I cc: rift mailing list for po= sterity


1) yepp, very nice python implementation, especially if one wants to unders= tand RIFT as running protocol rather than paper spec ;-)
2) Yepp, multi-planes did lead to lots of discussions in core-team meetings= around acceptable solutions and how we'd explain it properly. Explanation = largely due to Pascal's and Ilya's work, it takes a bit to soak the ASCII p= ictures but once you start to grok the concept of "crossbars crossbaring crossbars" being Clos ;-) = it's very easy to think through the stuff.
3) A knob to basically "always disaggregate southbound" is  = as simple thing to do really. Just like Bruno has it in his computation fir= st phase mostly decides _which_ nodes need disaggregation. The result can b= e simply replaced by all southbound nodes & then disaggregation happens naturally. Observe that you still want the default = origination since a PoD doesn't see other PoDs except via spine and disaggr= egation is _not_ transitive (I'm talking positive now). There are other cas= es you want to advertise southobuond some prefixes beside default and it's a normal thing to do, nothing says t= hat you only advertise default southbound anywhere.
4) Observe that we do NOT have a single ring! A ring is only as long as the= #planes you have. No'one will have 1000 planes ;-) So let's say you have 6= 4 switches in each plane and 4 planes. You will have 64 rings of lengh 4. O= bviously you can double-ring or ring within the plane as well to improve reliability but basically the top= ology is coherent until 2! links in the same ring break.
5) Always enabling disaggregation: That's point 3. Observe that does NOT so= lve your multi-plane problem on breakages since positive disaggregation is = NOT transitive. Yes, you can turn southbound PGP on and blast whole fabric = with all prefixes which basically makes your blast radius uncontained on any change (kind of flat IGP or rat= her IGP with DV prefixes ;-) and a single link coming/going may lead to mas= sive amount of convergence traffic due to prefix reachability changes. More= over, all your leafs (which is servers in extreme case) need FIB size of the size of fabric host routes ...

Spec authorship is still open ;-) so if you feel like improving/adding to d= raft, just let me know your moniker on bitbucket since thath's where the ne= west spec versions live ...


--- tony



From: Kris Price <kris= @krisprice.nz>
Sent: Friday, April 12, 2019 10:22 AM
To: Antoni Przygienda; brunorijsman@gmail.com
Subject: RIFT
 
Hey guys,

... couple of thoughts for you:

I see someone much better at explaining things than me talked to you since = we last spoke :-) so you've caveated the case with how middle tiers in a Cl= os don't get full visibility of all nodes at that tier via the reflection b= ecause by very nature of the Clos they're not fully striped to that lower tier (that is the essence of how C= los topologies scale so is present in any large fabric). You've covered tha= t under the description as "hyper planes" and work around is to u= se a ring at those tiers to pass control plane traffic.

Would it be possible to somehow /safely/ have a 'knob' to = turn off the aggregation at a tier so it'll always advertise all prefixes s= outhbound? Particularly this is useful in a network that might migrate to t= his protocol that does not want to go back and cable a ring. And in some cases cabling a ring will be undesir= able to some potential users anyway. (In some real world networks that coul= d mean cabling a ring connecting >1000 switches that form the "spin= e columns". That=92s a long brittle ring.)

Larger question, would it be possible to disable aggregation network wide? = (For someone that might be interested in using RIFT, but for reasons other = than the aggregation capability and where that capability may be seen as un= desirable.)

There's a semi-related issue at the very bottom of the fabric but it's= a bit difficult to explain, and I'm not sure it's really RIFTs problem to = solve, (in part really its due to a bad network design choice that exists) = so I might draw that up later.

Cheers
Kris



 
--_000_MWHPR05MB32798B45DD99D8ABF75B875AAC280MWHPR05MB3279namp_-- From nobody Wed Apr 17 07:47:24 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2E329120468 for ; Wed, 17 Apr 2019 07:47:19 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.998 X-Spam-Level: X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6IUH3qkvfcM9 for ; Wed, 17 Apr 2019 07:47:15 -0700 (PDT) Received: from mail-qt1-x82f.google.com (mail-qt1-x82f.google.com [IPv6:2607:f8b0:4864:20::82f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 742D11204C2 for ; Wed, 17 Apr 2019 07:47:15 -0700 (PDT) Received: by mail-qt1-x82f.google.com with SMTP id z16so27664635qtn.4 for ; Wed, 17 Apr 2019 07:47:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=OX1VbsElFBmsl13FfwdzWPzRD1pdtsilos8AYr5rzuA=; b=opgVAg4fCDiLsC5rceLir95/VOgVZPoqn8He1hlRoggjXi5CoF3Shz4MpgBuD3E+C9 97N48vQp0fY5yGKVE/7Ww6rHtOdd7EusYp4h8jykrtgosf+jx6RcAV5QelwfQn4s6ab4 UQ2ZEY8G9bWtNuA9bhLnLPRRHnIc2xGTSbbBSLYtaErsCeSsFO7qqryxnPg9NvWQUoyr 07vLFaOcX7PMXZSNmTwfpic5TPkNHnCejWrqFu3dXzk5VuXCgkl05JNXEbHE8hDUtbyE 9Jr2rM0Z5r6nSePoJcKJ6WQVhFEMnpp9WfrzmZemje7tsSWiIymVaPqYBVw3A7OE7t+P KaOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=OX1VbsElFBmsl13FfwdzWPzRD1pdtsilos8AYr5rzuA=; b=YHzYoCcmbogynvNKjMNHpnoRQnTMsKyqlr8T7o8NYvaes15V70SWDGTF0Lh8Q5t8/2 9lS+RW3GiJ+yFRyH8YRQ3mb/Of54JntuX6eLd4C+SrtbI4xI1Cc0OQRT4q09Kf58a150 VMyGNTxYSEuKJlInUA9le/u4Ji8/oa/EmTb/tNw3y8SyL9N4yv1YheSXN1+HWv27kUtV 8kzVCQfAo5+m7bBXgkeaGyVWlT55RWtiKbvVioOpDFs6fNfmN6b4RzZRj/hMQsEc6aeo Kkvlw2Bb8JO8rChSqEKho7AZL0TnE4hKhKeWHjkYuQ1OaqD50Ue3xPA9lx7y1+K9RZdn DZUw== X-Gm-Message-State: APjAAAU26D4LValREJ0Wz0JToirCqXHYDtxxjSx5MBIKkORIgJ07q+aw 2FWjzMwnRVhJ0D795hQe/vg= X-Google-Smtp-Source: APXvYqwvVi79BzNLb3UPheACkwbiK00EJgjD9lg+7NVx8snJ/tcLnsGXiuxeVbkvouqYoJhRbpIw+w== X-Received: by 2002:a0c:b8a8:: with SMTP id y40mr71291691qvf.27.1555512434079; Wed, 17 Apr 2019 07:47:14 -0700 (PDT) Received: from [192.168.0.101] (host-cotesma-166-44.smandes.com.ar. [201.220.166.44]) by smtp.gmail.com with ESMTPSA id t34sm47660596qth.36.2019.04.17.07.47.11 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 17 Apr 2019 07:47:13 -0700 (PDT) From: Bruno Rijsman Message-Id: Content-Type: multipart/alternative; boundary="Apple-Mail=_EE0C2654-D7CC-437C-82DB-D7CA62B7579B" Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\)) Date: Wed, 17 Apr 2019 11:47:10 -0300 In-Reply-To: Cc: "rift@ietf.org" To: Tony Przygienda , Kris Price References: X-Mailer: Apple Mail (2.3445.100.39) Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Apr 2019 14:47:22 -0000 --Apple-Mail=_EE0C2654-D7CC-437C-82DB-D7CA62B7579B Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi Kris, Thank you very much for your pull request; I merged it into the main = repo. I am very happy to see you actively contributing code! I added the request for a =E2=80=9Cdon=E2=80=99t aggregate south-bound = knob=E2=80=9D to my (internal) TODO list. It will be a while before I = get to it, because I would like to finish the base implementation = (security envelope, east-west rings, etc.) first. =E2=80=94 Bruno > On Apr 12, 2019, at 2:44 PM, Antoni Przygienda = wrote: >=20 > Hey Kris, great to see you engaging back ;-) I cc: rift mailing list = for posterity >=20 >=20 > 1) yepp, very nice python implementation, especially if one wants to = understand RIFT as running protocol rather than paper spec ;-) > 2) Yepp, multi-planes did lead to lots of discussions in core-team = meetings around acceptable solutions and how we'd explain it properly. = Explanation largely due to Pascal's and Ilya's work, it takes a bit to = soak the ASCII pictures but once you start to grok the concept of = "crossbars crossbaring crossbars" being Clos ;-) it's very easy to think = through the stuff.=20 > 3) A knob to basically "always disaggregate southbound" is as simple = thing to do really. Just like Bruno has it in his computation first = phase mostly decides _which_ nodes need disaggregation. The result can = be simply replaced by all southbound nodes & then disaggregation happens = naturally. Observe that you still want the default origination since a = PoD doesn't see other PoDs except via spine and disaggregation is _not_ = transitive (I'm talking positive now). There are other cases you want to = advertise southobuond some prefixes beside default and it's a normal = thing to do, nothing says that you only advertise default southbound = anywhere.=20 > 4) Observe that we do NOT have a single ring! A ring is only as long = as the #planes you have. No'one will have 1000 planes ;-) So let's say = you have 64 switches in each plane and 4 planes. You will have 64 rings = of lengh 4. Obviously you can double-ring or ring within the plane as = well to improve reliability but basically the topology is coherent until = 2! links in the same ring break.=20 > 5) Always enabling disaggregation: That's point 3. Observe that does = NOT solve your multi-plane problem on breakages since positive = disaggregation is NOT transitive. Yes, you can turn southbound PGP on = and blast whole fabric with all prefixes which basically makes your = blast radius uncontained on any change (kind of flat IGP or rather IGP = with DV prefixes ;-) and a single link coming/going may lead to massive = amount of convergence traffic due to prefix reachability changes. = Moreover, all your leafs (which is servers in extreme case) need FIB = size of the size of fabric host routes ... >=20 > Spec authorship is still open ;-) so if you feel like improving/adding = to draft, just let me know your moniker on bitbucket since thath's where = the newest spec versions live ...=20 >=20 > https://bitbucket.org/riftrfc/rift_draft/src/master/ = >=20 > --- tony=20 >=20 >=20 > From: Kris Price > Sent: Friday, April 12, 2019 10:22 AM > To: Antoni Przygienda; brunorijsman@gmail.com > Subject: RIFT > =20 > Hey guys, >=20 > ... couple of thoughts for you: >=20 > I see someone much better at explaining things than me talked to you = since we last spoke :-) so you've caveated the case with how middle = tiers in a Clos don't get full visibility of all nodes at that tier via = the reflection because by very nature of the Clos they're not fully = striped to that lower tier (that is the essence of how Clos topologies = scale so is present in any large fabric). You've covered that under the = description as "hyper planes" and work around is to use a ring at those = tiers to pass control plane traffic. >=20 > Would it be possible to somehow /safely/ have a 'knob' to turn off the = aggregation at a tier so it'll always advertise all prefixes southbound? = Particularly this is useful in a network that might migrate to this = protocol that does not want to go back and cable a ring. And in some = cases cabling a ring will be undesirable to some potential users anyway. = (In some real world networks that could mean cabling a ring connecting = >1000 switches that form the "spine columns". That=E2=80=99s a long = brittle ring.) >=20 > Larger question, would it be possible to disable aggregation network = wide? (For someone that might be interested in using RIFT, but for = reasons other than the aggregation capability and where that capability = may be seen as undesirable.) >=20 > There's a semi-related issue at the very bottom of the fabric but it's = a bit difficult to explain, and I'm not sure it's really RIFTs problem = to solve, (in part really its due to a bad network design choice that = exists) so I might draw that up later. >=20 > Cheers > Kris --Apple-Mail=_EE0C2654-D7CC-437C-82DB-D7CA62B7579B Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi = Kris,

Thank you very = much for your pull request; I merged it into the main repo.  I am = very happy to see you actively contributing code!
I added the request for a =E2=80=9Cdon=E2= =80=99t aggregate south-bound knob=E2=80=9D to my (internal) TODO list. =  It will be a while before I get to it, because I would like to = finish the base implementation (security envelope, east-west rings, = etc.) first.

=E2= =80=94 Bruno

On Apr 12, 2019, at 2:44 PM, Antoni = Przygienda <prz@juniper.net> wrote:

Hey Kris, great to see you = engaging back ;-) I cc: rift mailing list for posterity


1) yepp, very nice python = implementation, especially if one wants to understand RIFT as running = protocol rather than paper spec ;-)
2) Yepp, multi-planes did lead to lots of discussions = in core-team meetings around acceptable solutions and how we'd explain = it properly. Explanation largely due to Pascal's and Ilya's work, it = takes a bit to soak the ASCII pictures but once you start to grok the = concept of "crossbars crossbaring crossbars" being Clos ;-) it's very = easy to think through the stuff. 
3) A knob to basically "always = disaggregate southbound" is  as simple thing to do really. Just = like Bruno has it in his computation first phase mostly decides _which_ = nodes need disaggregation. The result can be simply replaced by all = southbound nodes & then disaggregation happens naturally. Observe = that you still want the default origination since a PoD doesn't see = other PoDs except via spine and disaggregation is _not_ transitive (I'm = talking positive now). There are other cases you want to advertise = southobuond some prefixes beside default and it's a normal thing to do, = nothing says that you only advertise default southbound anywhere. 
4) Observe that we do NOT have = a single ring! A ring is only as long as the #planes you have. No'one = will have 1000 planes ;-) So let's say you have 64 switches in each = plane and 4 planes. You will have 64 rings of lengh 4. Obviously you can = double-ring or ring within the plane as well to improve reliability but = basically the topology is coherent until 2! links in the same ring = break. 
5) Always enabling = disaggregation: That's point 3. Observe that does NOT solve your = multi-plane problem on breakages since positive disaggregation is NOT = transitive. Yes, you can turn southbound PGP on and blast whole fabric = with all prefixes which basically makes your blast radius uncontained on = any change (kind of flat IGP or rather IGP with DV prefixes ;-) and a = single link coming/going may lead to massive amount of convergence = traffic due to prefix reachability changes. Moreover, all your leafs = (which is servers in extreme case) need FIB size of the size of fabric = host routes ...

Spec authorship is still open = ;-) so if you feel like improving/adding to draft, just let me know your = moniker on bitbucket since thath's where the newest spec versions live = ... 


--- tony 



From: Kris Price <kris@krisprice.nz>
Sent: Friday, April 12, 2019 = 10:22 AM
To: Antoni Przygienda; brunorijsman@gmail.com
Subject: RIFT
 
Hey guys,

... couple of thoughts for you:

I = see someone much better at explaining things than me talked to you since = we last spoke :-) so you've caveated the case with how middle tiers in a = Clos don't get full visibility of all nodes at that tier via the = reflection because by very nature of the Clos they're not fully striped = to that lower tier (that is the essence of how Clos topologies scale so = is present in any large fabric). You've covered that under the = description as "hyper planes" and work around is to use a ring at those = tiers to pass control plane traffic.

Would it be possible to = somehow /safely/ have a 'knob' to turn off the aggregation at a tier so = it'll always advertise all prefixes southbound? Particularly this is = useful in a network that might migrate to this protocol that does not = want to go back and cable a ring. And in some cases cabling a ring will = be undesirable to some potential users anyway. (In some real world = networks that could mean cabling a ring connecting >1000 switches = that form the "spine columns". That=E2=80=99s a long brittle ring.)

Larger question, would it be possible to = disable aggregation network wide? (For someone that might be interested = in using RIFT, but for reasons other than the aggregation capability and = where that capability may be seen as undesirable.)

There's a semi-related = issue at the very bottom of the fabric but it's a bit difficult to = explain, and I'm not sure it's really RIFTs problem to solve, (in part = really its due to a bad network design choice that exists) so I might = draw that up later.

Cheers
Kris

= --Apple-Mail=_EE0C2654-D7CC-437C-82DB-D7CA62B7579B-- From nobody Wed Apr 17 11:33:51 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 187C21201BA for ; Wed, 17 Apr 2019 11:33:50 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.338 X-Spam-Level: X-Spam-Status: No, score=-1.338 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KHOP_DYNAMIC=1.363, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id yBMkqpHMoPcp for ; Wed, 17 Apr 2019 11:33:49 -0700 (PDT) Received: from mx0b-00273201.pphosted.com (mx0b-00273201.pphosted.com [67.231.152.164]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DB7531201B0 for ; Wed, 17 Apr 2019 11:33:48 -0700 (PDT) Received: from pps.filterd (m0108161.ppops.net [127.0.0.1]) by mx0b-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3HITdFL016437 for ; Wed, 17 Apr 2019 11:33:48 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : subject : date : message-id : content-type : content-transfer-encoding : mime-version; s=PPS1017; bh=lhUkv3zPpFk8mzKnI60i9UGoMxc6c+yKBTbfzchDg94=; b=Gso7qELxZ9iiXV/2kNHfga3HiOQTVq85SiBFMGFFNhJfJ7TlmTmRxoVkaB/fKM4a2pCm N3ZBxlvy073ULqkFYVcuOIqQE76LAWP8Z4L2/LcRsrWbBY1TzGnqyLznFUfmIeiSl1+M 5G37Q37aTq+8kSYV8PKm0VGbdQSe5TcYtHaD3ji0OKxpld7oa7PIcnH+oldkQ2lcLVzD 4aPUiIe+ZLPTpyPYJvq2Uh/605UAY7gPoRcdzu3UqvswhPDAnyJCjDvGSwp25QqoeRPK I+FDwjhySLQ4sDRSTbN+Y/3Cj+srY9p0UcRgjfVGewtp+wDN/BkJkh3WphJxT7sTGw1f 2g== Received: from nam04-bn3-obe.outbound.protection.outlook.com (mail-bn3nam04lp2051.outbound.protection.outlook.com [104.47.46.51]) by mx0b-00273201.pphosted.com with ESMTP id 2rx4k0gj11-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Wed, 17 Apr 2019 11:33:47 -0700 Received: from CO2PR05MB2455.namprd05.prod.outlook.com (10.166.95.137) by CO2PR05MB2421.namprd05.prod.outlook.com (10.166.200.144) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1813.9; Wed, 17 Apr 2019 18:33:46 +0000 Received: from CO2PR05MB2455.namprd05.prod.outlook.com ([fe80::81e2:bbe8:6851:16b2]) by CO2PR05MB2455.namprd05.prod.outlook.com ([fe80::81e2:bbe8:6851:16b2%6]) with mapi id 15.20.1813.011; Wed, 17 Apr 2019 18:33:46 +0000 From: "Jeffrey (Zhaohui) Zhang" To: "rift@ietf.org" Thread-Topic: RIFT session minutes uploaded for IETF104 Thread-Index: AdT1TAQ2jjkO2UvpS/GVL8RvRjnUgg== Content-Class: Date: Wed, 17 Apr 2019 18:33:45 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.1.100.23 dlp-reaction: no-action msip_labels: MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_Enabled=True; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_SiteId=bea78b3c-4cdb-4130-854a-1d193232e5f4; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_Owner=zzhang@juniper.net; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_SetDate=2019-04-17T18:33:43.3579069Z; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_Name=Juniper Internal; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_Application=Microsoft Azure Information Protection; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_Extended_MSFT_Method=Automatic; Sensitivity=Juniper Internal x-originating-ip: [24.218.241.3] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 38a6f9e2-1f6f-4eff-75cf-08d6c3633a00 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(5600141)(711020)(4605104)(4618075)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(2017052603328)(7193020); SRVR:CO2PR05MB2421; x-ms-traffictypediagnostic: CO2PR05MB2421: x-ms-exchange-purlcount: 1 x-microsoft-antispam-prvs: x-forefront-prvs: 0010D93EFE x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(39860400002)(366004)(136003)(346002)(376002)(396003)(199004)(189003)(53936002)(6916009)(7696005)(105586002)(478600001)(3846002)(106356001)(26005)(7736002)(305945005)(81166006)(86362001)(6116002)(6436002)(68736007)(5640700003)(2501003)(1730700003)(99286004)(966005)(14454004)(2906002)(2351001)(8676002)(186003)(6506007)(81156014)(97736004)(25786009)(558084003)(55016002)(5660300002)(6306002)(9686003)(33656002)(102836004)(8936002)(71190400001)(71200400001)(52536014)(256004)(486006)(74316002)(66066001)(316002)(476003); DIR:OUT; SFP:1102; SCL:1; SRVR:CO2PR05MB2421; H:CO2PR05MB2455.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: KHqsXuEQNhJ/9yhlfNxp0eZS7R6esdJfcfViHMFVXGYmfq6/IcQQJEpqOpLpUe9rZ7la0ueX7KHeTepuD7LsxsNtc9/RBocw97am477UEXDml/nY/iYePOLf1N6QDGkle0wjShGWypFJ471ultkfCtTkri4D3ncb6PiuU+tf40yMSFIpQdX6lf/amKwhnuTEDV6uZUMy63oSGizanafWXIJ+l9okNV+CHFy4bBBm8yi0LbaNAcwsBg/Vz6c8qjb8cXkOaUzAtivo2IJKdyfLG5BxHXXfzmbvBHQNqdfy8S7lRONpRDaFScgy+pmn/gSjhRQ7vyRE/XYs0nTVDG4272RUS0sufPBkr189r4uiTmGWwZevgD5mEfKilK/2PAXmgRvYqUuHiTW2IBbTF1CXI6Vf6veBciIl4jccVt2oMyk= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-Network-Message-Id: 38a6f9e2-1f6f-4eff-75cf-08d6c3633a00 X-MS-Exchange-CrossTenant-originalarrivaltime: 17 Apr 2019 18:33:45.9629 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: CO2PR05MB2421 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-17_08:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=783 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904170123 Archived-At: Subject: [Rift] RIFT session minutes uploaded for IETF104 X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Apr 2019 18:33:50 -0000 Thanks Sandy Zhang for taking the notes. Please review here: https://datatracker.ietf.org/doc/minutes-104-rift/ Jeffrey Juniper Internal From nobody Thu Apr 18 09:43:15 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EAB10120373 for ; Thu, 18 Apr 2019 09:43:13 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.999 X-Spam-Level: X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Qpzy5Nk-abWy for ; Thu, 18 Apr 2019 09:43:12 -0700 (PDT) Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com [IPv6:2a00:1450:4864:20::534]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C62FB120110 for ; Thu, 18 Apr 2019 09:43:11 -0700 (PDT) Received: by mail-ed1-x534.google.com with SMTP id i13so2314046edf.11 for ; Thu, 18 Apr 2019 09:43:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=AasowlMs91avS5Ug3QkBesGV4QoRpcCy4WHlbAQFPJ4=; b=aNJIwdZOw/0Gi2dBt3w+/Gw9xCb6asbl77KiskyHSG+F7U4mQUVzFXLe4qjoN/ojZM PAPlVn2bMAiPAae8X3flY2iDPtHu7ol59ZfpRZ6GFxK4TnW+ZhkG9K8rRiVXRrtQT5bq /194GOyZQBX2cduHm53zOH1icnGLwWPymMpfPCso/WbMp63nibdb/H0ChlhyVuAdNq73 MLuvz9cSM0mQpne17cg5aJvRhBNwDIKmnLCbuj/gbQG530VShOR+9n0NvKG14qYxAGzL 64KrCw3mHh+2pHqC5rbE3yt5AQjeelFmuHhXAHgyBER4je5EmkXLkVP331lzz/ulhiyr 7P1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=AasowlMs91avS5Ug3QkBesGV4QoRpcCy4WHlbAQFPJ4=; b=Rvzwu34P+NnykMDe5Fe7euUsPtY7bmzFtxn1J7p5y6w4JF+5xCQwuc8XeBmmvYlvHe QRXxztA3NrFkbABDBNryQDuSNgREOgyPnRobYVQOxePZvki8D/x0wK2SOdhKqD3mXG5G zpIpJZ8cSoMyRRGj4wMx11L3LJCa3NkFWyYt9Ch76swx5okMz0dNgrDvjp2xKsXRL67P zKzt6923CmV479kg0MRUwx4zsoTskgBrLldO238aGH8LXDjNIDfF00JZX+8DLLqLziB5 Df5OuyECn5vcyUfVmRsy7Xqi5fFpELMBGy0gSXG3gXWjd3QnFDWpgwwNKDUPGPlKgJRQ +96g== X-Gm-Message-State: APjAAAXKSy0agLSo6Fbh9KV4C+TourLN4sikDGlELf+TXqG/2qN7Y0Yx OFPcOmIs10KVr+zSvp/tj3X8pVaQy8VN5Qv3nJEyUab9 X-Google-Smtp-Source: APXvYqyRlcuuFNbQPHRfwaFy3YDm659fimKXNYZRjp4rzJkwbyEdazr6EatExR3gzERbPUjF2FDTbtlve4f7VPhx7Ls= X-Received: by 2002:a05:6402:1494:: with SMTP id e20mr43067249edv.22.1555605790349; Thu, 18 Apr 2019 09:43:10 -0700 (PDT) MIME-Version: 1.0 From: Tony Przygienda Date: Thu, 18 Apr 2019 09:42:33 -0700 Message-ID: To: rift@ietf.org Content-Type: multipart/alternative; boundary="0000000000008b5eb40586d0b254" Archived-At: Subject: [Rift] Today's core call synopsis ... X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Apr 2019 16:43:14 -0000 --0000000000008b5eb40586d0b254 Content-Type: text/plain; charset="UTF-8" Short synopsis: * Implementation discussion on security envelope, multiple node TIEs, miscabling indication, UDP and MTU implications * Short discussion on multicast status and rest of documents necessary https://www.dropbox.com/sh/ivumrmgy0zci4tt/AAD_H6KILrDFlRFzn_WAfTw9a?dl=0 --- tony --0000000000008b5eb40586d0b254 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Short synopsis:

<= /div>
* Implementation discussion on security envelope, multiple node T= IEs, miscabling indication, UDP and MTU implications
* Short disc= ussion on multicast status and rest of documents necessary
--0000000000008b5eb40586d0b254-- From nobody Thu Apr 18 10:28:30 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DAE36120454 for ; Thu, 18 Apr 2019 10:28:21 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.901 X-Spam-Level: X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=krisprice-nz.20150623.gappssmtp.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Jnr4Z1qHCfGN for ; Thu, 18 Apr 2019 10:28:17 -0700 (PDT) Received: from mail-lf1-x12c.google.com (mail-lf1-x12c.google.com [IPv6:2a00:1450:4864:20::12c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5A407120444 for ; Thu, 18 Apr 2019 10:28:17 -0700 (PDT) Received: by mail-lf1-x12c.google.com with SMTP id v1so2227688lfg.5 for ; Thu, 18 Apr 2019 10:28:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=krisprice-nz.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=H99fQu9xocMWFlDmy75KXeyDa0S7/RPughN0pVWQPtI=; b=NRWu9XJ98moVWZRbS0Hf7jBsdMCKYJIV0YZ3zJ7EHjldBjZKQmfmXAxZOwhgOMbssS pSWUNK9lMohcAs2avE2EaSB2NB1/wTOggae7aN8GLAM8hWS9TvqT21zH7LJ9vv/rRVSC /Pct2yrioVshtBUoaYG6HvXsa34HFr7aiMamVbPveYTzEZpNFV+T4pbIwXubOo3iI4Ue 0E9IiszCIhBYqKKOkG8O0h75nHAlmrTX/KLYK7jBsbfwdwCGL4VXnde2NZLp0c6bmMd5 IzItIFPxB3lpmM0lcTPtbnXJ3ajM6hOiX5JOBuJ26UrdwVk+wCEfnj8XgmNjXlSpHAVD S9Mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=H99fQu9xocMWFlDmy75KXeyDa0S7/RPughN0pVWQPtI=; b=DGJ6sO6RHLNJJ1/pMCrdNJgNHmrxwDj1reiLj7AmHjQpHjgwrfDF9uoLfxI01xTS/j 3dppGZcXsRYv63+iNfK1mXfMkgco/en5WWSa/1Q71L/r4eTvQmnd4NDajsTmx+nvhIkX J3AYkxtorzH7L0ltmG+Qhg3UPhWTy8EAa9Ew7CfeGiRBVSNNn+cOkDSooEHfFVJFW0uI D97ImtXiICYmgdwb9Cdvxel6z5GzK0d/K27vGtq6PIO3mQmxD5AVBM/eVJu+t7H7sFw2 hQUKCyj0Bc4gGtGfDfy8jhQJs84Ip1ZYiMjpXAFCDU+yeiMDf4kV6fuaCkfd3QCJc5sG KwYg== X-Gm-Message-State: APjAAAX9ECbnji/9T2q1ovQTQ7zbxU3RBl1B6ejC1SejB8aLLRBf6LaG jBNhO0DGMR2xnO4OiKBUh1iLCzqEpC6Dnq1bONz/yg== X-Google-Smtp-Source: APXvYqxBV87Ht/2Xh8tl7IqkRMpnbVoO1toOrx3iIg5McMbqVwHPRm/KBchbjpaUvPjHOrcUtCaeJFGnaRBPERr2kfs= X-Received: by 2002:ac2:482e:: with SMTP id 14mr32600324lft.1.1555608495357; Thu, 18 Apr 2019 10:28:15 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Kris Price Date: Thu, 18 Apr 2019 13:28:14 -0400 Message-ID: To: Antoni Przygienda Cc: "brunorijsman@gmail.com" , "rift@ietf.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Apr 2019 17:28:27 -0000 Hey Tony, On the rings: Ahh! I get it, okay that makes it better. I was also wondering if some kind of designated 'S-TIE' reflector / virtual links / or explicitly configured multi-hop adjacencies solution could be used (the issue being one of how do you route these packets between the peers without needing to do something like source route multiple hops southbound before being default routed northbound). Back on the subject of disaggregation: The other reason for asking for the always disaggregate option is to prevent the transient congestion that can occur on link failures. But I do see now on rereading the draft you've called this out in the second to last paragraph of 5.2.5.1., but it's left it as an implementation specific problem to solve. It seems this would arise frequently at the bottom two tiers of the network. Any loss of any single link to any rack (tier 0) would result in all other nodes at tier 1 disaggregating the prefix(es) for that rack and causing the potential transient incast-like congestion. I'm a bit concerned that this may be a noticeable event in some cases (e.g. a storage row/cluster or maybe where RoCE is in use), and one that would be fairly annoying to debug and remedy post transition to RIFT if you didn't foresee it and have the tools (knobs) in place to prevent it from happening without a PR and s/w upgrade. Should implementations have a conscious solution in advance for this, and what's the best way to ensure that? The 'always-disaggregate' knob is one. Another might be something like a 'min-next-hops' option where the local RIFT instance on tier 0 won't install a prefix unless it has received it from a minimum number of up streams. Both of these do run counter to the low-configuration nature of RIFT. Another might be a protocol change, something like nodes disaggregating prefixes by default until they know they are more than 1 hop from the bottom of fabric? (This may run into other convergence issues during fabric bring up and cold start and maybe there are other issues with it that need doodling out.) /2c Cheers :-) Kris On Fri, Apr 12, 2019 at 1:44 PM Antoni Przygienda wrote: > > Hey Kris, great to see you engaging back ;-) I cc: rift mailing list for = posterity > > > 1) yepp, very nice python implementation, especially if one wants to unde= rstand RIFT as running protocol rather than paper spec ;-) > 2) Yepp, multi-planes did lead to lots of discussions in core-team meetin= gs around acceptable solutions and how we'd explain it properly. Explanatio= n largely due to Pascal's and Ilya's work, it takes a bit to soak the ASCII= pictures but once you start to grok the concept of "crossbars crossbaring = crossbars" being Clos ;-) it's very easy to think through the stuff. > 3) A knob to basically "always disaggregate southbound" is as simple thi= ng to do really. Just like Bruno has it in his computation first phase most= ly decides _which_ nodes need disaggregation. The result can be simply repl= aced by all southbound nodes & then disaggregation happens naturally. Obser= ve that you still want the default origination since a PoD doesn't see othe= r PoDs except via spine and disaggregation is _not_ transitive (I'm talking= positive now). There are other cases you want to advertise southobuond som= e prefixes beside default and it's a normal thing to do, nothing says that = you only advertise default southbound anywhere. > 4) Observe that we do NOT have a single ring! A ring is only as long as t= he #planes you have. No'one will have 1000 planes ;-) So let's say you have= 64 switches in each plane and 4 planes. You will have 64 rings of lengh 4.= Obviously you can double-ring or ring within the plane as well to improve = reliability but basically the topology is coherent until 2! links in the sa= me ring break. > 5) Always enabling disaggregation: That's point 3. Observe that does NOT = solve your multi-plane problem on breakages since positive disaggregation i= s NOT transitive. Yes, you can turn southbound PGP on and blast whole fabri= c with all prefixes which basically makes your blast radius uncontained on = any change (kind of flat IGP or rather IGP with DV prefixes ;-) and a singl= e link coming/going may lead to massive amount of convergence traffic due t= o prefix reachability changes. Moreover, all your leafs (which is servers i= n extreme case) need FIB size of the size of fabric host routes ... > > Spec authorship is still open ;-) so if you feel like improving/adding to= draft, just let me know your moniker on bitbucket since thath's where the = newest spec versions live ... > > https://bitbucket.org/riftrfc/rift_draft/src/master/ > > --- tony > > > ________________________________ > From: Kris Price > Sent: Friday, April 12, 2019 10:22 AM > To: Antoni Przygienda; brunorijsman@gmail.com > Subject: RIFT > > Hey guys, > > ... couple of thoughts for you: > > I see someone much better at explaining things than me talked to you sinc= e we last spoke :-) so you've caveated the case with how middle tiers in a = Clos don't get full visibility of all nodes at that tier via the reflection= because by very nature of the Clos they're not fully striped to that lower= tier (that is the essence of how Clos topologies scale so is present in an= y large fabric). You've covered that under the description as "hyper planes= " and work around is to use a ring at those tiers to pass control plane tra= ffic. > > Would it be possible to somehow /safely/ have a 'knob' to turn off the ag= gregation at a tier so it'll always advertise all prefixes southbound? Part= icularly this is useful in a network that might migrate to this protocol th= at does not want to go back and cable a ring. And in some cases cabling a r= ing will be undesirable to some potential users anyway. (In some real world= networks that could mean cabling a ring connecting >1000 switches that for= m the "spine columns". That=E2=80=99s a long brittle ring.) > > Larger question, would it be possible to disable aggregation network wide= ? (For someone that might be interested in using RIFT, but for reasons othe= r than the aggregation capability and where that capability may be seen as = undesirable.) > > There's a semi-related issue at the very bottom of the fabric but it's a = bit difficult to explain, and I'm not sure it's really RIFTs problem to sol= ve, (in part really its due to a bad network design choice that exists) so = I might draw that up later. > > Cheers > Kris > > > > From nobody Thu Apr 18 11:14:20 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D0959120144 for ; Thu, 18 Apr 2019 11:14:18 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.338 X-Spam-Level: X-Spam-Status: No, score=-1.338 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, KHOP_DYNAMIC=1.363, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=no autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QpCt1r8bUxK5 for ; Thu, 18 Apr 2019 11:14:16 -0700 (PDT) Received: from mx0b-00273201.pphosted.com (mx0b-00273201.pphosted.com [67.231.152.164]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 6622612001B for ; Thu, 18 Apr 2019 11:14:16 -0700 (PDT) Received: from pps.filterd (m0108160.ppops.net [127.0.0.1]) by mx0b-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3II9HDt024224; Thu, 18 Apr 2019 11:14:12 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : mime-version; s=PPS1017; bh=SIt+lVRuQ3kqXyZabCe4RhHcD1pWpdle+eqEXHyh3zM=; b=zs6cVinIPEsfa9Qu2EtuquBbAMCWrM+lHsJhBDH7sn9/TLTOSKbtLZd2UUhpbGxmG3t9 K/H46P5H/OP5wLaxiwtbjKAyksfr4D7RoPtpPR0MznE4UiML4vb7ZTRVsh0FNkh7ChNl PO98260J2wc9YAe4vhvpXETv4boiGnxC4tfVEofuhgCx6v5Ixo4xsYFhuir2GOtNhwdw DaWCLZ4Dr/atm3syjrin+UixXenUKvWlFTy9HGcDBcPfY5N46HzLwLEJAewIwdiDgDL8 kOsrenyGpeJ0BcwffVUzRUEsrg1e8ucLMY3Imez+qbJ4p76hpYzoqzeMX6zBWUDjCmhd UA== Received: from nam02-cy1-obe.outbound.protection.outlook.com (mail-cys01nam02lp2052.outbound.protection.outlook.com [104.47.37.52]) by mx0b-00273201.pphosted.com with ESMTP id 2rxq8e8vkv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Thu, 18 Apr 2019 11:14:12 -0700 Received: from MWHPR05MB3279.namprd05.prod.outlook.com (10.173.230.18) by MWHPR05MB3182.namprd05.prod.outlook.com (10.173.229.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1813.9; Thu, 18 Apr 2019 18:14:09 +0000 Received: from MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202]) by MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202%10]) with mapi id 15.20.1835.007; Thu, 18 Apr 2019 18:14:09 +0000 From: Antoni Przygienda To: Kris Price CC: "brunorijsman@gmail.com" , "rift@ietf.org" Thread-Topic: RIFT Thread-Index: AQHU8VRUPBOno7miBkKQqRko+b0GOKY4xxUkgAludgCAAAZsrQ== Date: Thu, 18 Apr 2019 18:14:09 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [66.129.239.12] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 156368fb-00f1-41f8-a8ff-08d6c429a752 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(5600141)(711020)(4605104)(4618075)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(2017052603328)(7193020); SRVR:MWHPR05MB3182; x-ms-traffictypediagnostic: MWHPR05MB3182: x-microsoft-antispam-prvs: x-forefront-prvs: 0011612A55 x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(136003)(39860400002)(366004)(376002)(346002)(396003)(189003)(199004)(6506007)(55016002)(33656002)(3846002)(14454004)(221733001)(186003)(478600001)(7696005)(3480700005)(486006)(6116002)(19627405001)(86362001)(97736004)(316002)(76176011)(54906003)(102836004)(26005)(25786009)(105004)(66066001)(53936002)(7736002)(11346002)(99286004)(4326008)(81156014)(6246003)(74316002)(53546011)(446003)(476003)(5660300002)(52536014)(68736007)(2906002)(14444005)(256004)(81166006)(229853002)(6916009)(54896002)(71200400001)(6436002)(71190400001)(8936002)(8676002)(7116003)(9686003); DIR:OUT; SFP:1102; SCL:1; SRVR:MWHPR05MB3182; H:MWHPR05MB3279.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: pqmJ2ut9oPqi7C4NZt6zdIgPcBNHoD4z/Lhdagt7JG7AVcYWk868yFj+QEVveb10tRZGTPAKwMPn0Y3XVYhF+hVTmaeeForOJ8+N0hUTzLv9EOsD7ixV2VZjHo5xkoR+HRZFKaVjUaEpC7MNhi1z4EePf9xMNzxBFskfdy53CbKqkO+kUn6TRcW9r1EjG1RV8DiFWSnqUCaeOGFZa/xMZYlTDbrpy7/zDyMKhDftCqVPlbBDg37oTX5F+ITPjHph73AtTvLV6XrCe/1fDfa3NtzZVuBBa4k+FT+B1wdDPmDqYPnvmcOYyREnvNrRq4SDjNqHueJYdcVEWUzB2xATwbKWDaq3p7kL6LvyFOF7QGKaEc+i+xKq+zGQirlMxvnVswlqNlBYMrgoAqD3TrauKXbNFEGvbZUKDvd0wzZMRs0= Content-Type: multipart/alternative; boundary="_000_MWHPR05MB32798005D0A97DCC996CCB11AC260MWHPR05MB3279namp_" MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-Network-Message-Id: 156368fb-00f1-41f8-a8ff-08d6c429a752 X-MS-Exchange-CrossTenant-originalarrivaltime: 18 Apr 2019 18:14:09.7948 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR05MB3182 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-18_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904180113 Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Apr 2019 18:14:19 -0000 --_000_MWHPR05MB32798005D0A97DCC996CCB11AC260MWHPR05MB3279namp_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hey, Kris, inline ________________________________ From: Kris Price Sent: Thursday, April 18, 2019 10:28 AM To: Antoni Przygienda Cc: brunorijsman@gmail.com; rift@ietf.org Subject: Re: RIFT Hey Tony, On the rings: Ahh! I get it, okay that makes it better. I was also wondering if some kind of designated 'S-TIE' reflector / virtual links / or explicitly configured multi-hop adjacencies solution could be used (the issue being one of how do you route these packets between the peers without needing to do something like source route multiple hops southbound before being default routed northbound). good, I know it takes bit to grok the stuff. We did the best we could with = ASCII and language but the concepts need some chewing for sure, even if you= have been around big fabrics for a bit ;-) So, nothing like route reflecto= rs and so on, within a plane normal south reflection takes care of sync'ing= up all you need, outside the plane the ring takes care of sync'ing up plan= es (for flooding horizontal links below ToF are south and @ ToF level north= basically and with that you have all the topology to figure out negative d= isaggregation. I explicitly killed any "virtual link" suggestions, I went = through this particular hell in my life more than once and don't want to vi= sit it anymore ;-) ... Back on the subject of disaggregation: The other reason for asking for the always disaggregate option is to prevent the transient congestion that can occur on link failures. But I do see now on rereading the draft you've called this out in the second to last paragraph of 5.2.5.1., but it's left it as an implementation specific problem to solve. well, yes, no free lunch, either you gum up your fabric with all stuff and = suffer large blast radius or you dig the beauty of having minimum blast rad= ius and minimal topology info everywhere but on massive failures stuff need= s to be sloshed around so e'one has enough info to not blackhole. Finely en= ough, today's networks, especially fabrics, allow insane flooding rates wit= hout breaking half a sweat (first thing I played with when thinking about R= IFT design ;-) and I learned here some lessons from looking @ p2p networks = BTW. If you run my free package you'll see easily convergence rate of 7-10+= K TIEs in the database per second and that's the "usable rate" in the sense= that there is much more flooding on the links and it's the "best TIEs in L= SDB" rate already. UDP is really quite phenomenal and with a bit of additio= nal help (look @ the packet number & the "you flood too fast" indications) = you can dynamically adjust flooding to walk the edge of having losses. Brav= e new world ... It seems this would arise frequently at the bottom two tiers of the network. Any loss of any single link to any rack (tier 0) would result in all other nodes at tier 1 disaggregating the prefix(es) for that rack and causing the potential transient incast-like congestion. I'm a bit concerned that this may be a noticeable event in some cases (e.g. a storage row/cluster or maybe where RoCE is in use), and one that would be fairly annoying to debug and remedy post transition to RIFT if you didn't foresee it and have the tools (knobs) in place to prevent it from happening without a PR and s/w upgrade. yepp, you call the spade but you're a bit too pesimistic me thinks. Let's a= ssume 2 ToRs dual-homing a rack or couple racks of servers. if you loose a = link in a multi-homed server you basically end up having the other ToR de-a= ggregating just this server prefix to other servers (even if you run some k= ubernetes @ scale you may have 100 prefixes or so I'd say, I can't imagine = a server hosting thousands really) ... Then, if you think about the ToRs on= top of PoD then it's not as bad as you think. If you loose a single ToR in= a PoD towards a spine (I'm loose with terminology here) then you will NOT = see disaggregation as long the other ToRs in the PoD are still connected to= the PoD. Draw pictures & run the public consumption package ;-) More inte= resting discussions are bandwdith balancing on link losses (which I think w= e solved well northbound) and whether it even should be done southbound sin= ce notion of "available bandwidth southbound" is confounding ... Spec doesn= 't forbid it (the beauty of loop-free valley-free routing that gives you in= sane amount of lee-way how you choose to forward) BTW if somone is smart en= ough to figure that out ;-) ... Should implementations have a conscious solution in advance for this, and what's the best way to ensure that? The 'always-disaggregate' knob is one. Another might be something like a 'min-next-hops' option where the local RIFT instance on tier 0 won't install a prefix unless it has received it from a minimum number of up streams The always disaggregate knob is something you can do per level if you desir= e but it's basically a big hammer buying you much bigger blast radius in no= rmal operation. And if you pull RIFT onto servers in multi-plane fabrics yo= ur FIB may blow up if you do that (unless we think server adapters with 2M = FIB size, probably ain't gonna happen ;-). The other idea I don't grok, you have to explain in more detail. Both of these do run counter to the low-configuration nature of RIFT. Another might be a protocol change, something like nodes disaggregating prefixes by default until they know they are more than 1 hop from the bottom of fabric? (This may run into other convergence issues during fabric bring up and cold start and maybe there are other issues with it that need doodling out.) Yeah, doodle, I'm not concerned about the convergence and interested in you= r ideas. ZTP has no beef with prefixes, will work irregardless. So far I sa= w no indications that size of the fabric up to any reasonable bound of node= s will prevent it to cold-boot properly. ZTP FSM has no timers for that rea= son BTW (another no, no I put down there ;-) and flooding only has a single= retransmit timer. And, yes, the more stuff like forced disaggregation you start to fiddle wit= h, the more you loose ZTP (depending whether you want it or not, RIFT will = work either way). BTW, same with security, the more security you desire the= less ZTP you get ;-) Nature of the beast ... --- tony --_000_MWHPR05MB32798005D0A97DCC996CCB11AC260MWHPR05MB3279namp_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hey, Kris, inline


From: Kris Price <kris= @krisprice.nz>
Sent: Thursday, April 18, 2019 10:28 AM
To: Antoni Przygienda
Cc: brunorijsman@gmail.com; rift@ietf.org
Subject: Re: RIFT
 
Hey Tony,

On the rings: Ahh! I get it, okay that makes it better. I was also
wondering if some kind of designated 'S-TIE' reflector / virtual links
/ or explicitly configured multi-hop adjacencies solution could be
used (the issue being one of how do you route these packets between
the peers without needing to do something like source route multiple
hops southbound before being default routed northbound).

good, I know it takes bit to grok the stuff. We= did the best we could with ASCII and language but the concepts need some c= hewing for sure, even if you have been around big fabrics for a bit ;-) So,= nothing like route reflectors and so on, within a plane normal south reflection takes care of sync'ing up al= l you need, outside the plane the ring takes care of sync'ing up planes (fo= r flooding horizontal links below ToF are south and @ ToF level north basic= ally and with that you have all the topology to figure out negative disaggregation.  I explicitly kil= led any "virtual link" suggestions, I went through this particula= r hell in my life more than once and don't want to visit it anymore ;-) ...

Back on the subject of disaggregation:

The other reason for asking for the always disaggregate option is to
prevent the transient congestion that can occur on link failures. But
I do see now on rereading the draft you've called this out in the
second to last paragraph of 5.2.5.1., but it's left it as an
implementation specific problem to solve.

well, yes, no free lunch, either you gum up you= r fabric with all stuff and suffer large blast radius or you dig the beauty= of having minimum blast radius and minimal topology info everywhere but on= massive failures stuff needs to be sloshed around so e'one has enough info to not blackhole. Finely enough, t= oday's networks, especially fabrics, allow insane flooding rates without br= eaking half a sweat (first thing I played with when thinking about RIFT des= ign ;-) and I learned here some lessons from looking @ p2p networks BTW. If you run my free package you'll= see easily convergence rate of 7-10+K TIEs in the database per second = and that's the "usable rate" in the sense that there is much more= flooding on the links and it's the "best TIEs in LSDB" rate already. UDP is really quite phenomenal and with a bit = of additional help (look @ the packet number & the "you flood too = fast" indications) you can dynamically adjust flooding to walk the edg= e of having losses. Brave new world ...

It seems this would arise frequently at the bottom two tiers of the
network. Any loss of any single link to any rack (tier 0) would result
in all other nodes at tier 1 disaggregating the prefix(es) for that
rack and causing the potential transient incast-like congestion. I'm a
bit concerned that this may be a noticeable event in some cases (e.g.
a storage row/cluster or maybe where RoCE is in use), and one that
would be fairly annoying to debug and remedy post transition to RIFT
if you didn't foresee it and have the tools (knobs) in place to
prevent it from happening without a PR and s/w upgrade.

yepp, you call the spade but you're a bit too p= esimistic me thinks. Let's assume 2 ToRs dual-homing a rack or couple racks= of servers. if you loose a link in a multi-homed server you basically end = up having the other ToR de-aggregating just this server prefix to other servers (even if you run some kubernetes = @ scale you may have 100 prefixes or so I'd say, I can't imagine a server h= osting thousands really) ... Then, if you think about the ToRs on top of Po= D then it's not as bad as you think. If you loose a single ToR in a PoD towards a spine (I'm loose with termino= logy here) then you will NOT see disaggregation as long the other ToRs in t= he PoD are still connected to the PoD. Draw pictures & run the public c= onsumption package ;-)  More interesting discussions are bandwdith balancing on link losses (which I think we solve= d well northbound) and whether it even should be done southbound since noti= on of "available bandwidth southbound" is confounding ... Spec do= esn't forbid it (the beauty of loop-free valley-free routing that gives you insane amount of lee-way how you choose to forward)= BTW if somone is smart enough to figure that out ;-) ...

Should
implementations have a conscious solution in advance for this, and
what's the best way to ensure that? The 'always-disaggregate' knob is
one. Another might be something like a 'min-next-hops' option where
the local RIFT instance on tier 0 won't install a prefix unless it has
received it from a minimum number of up streams

The always disaggregate knob is something you c= an do per level if you desire but it's basically a big hammer buying you mu= ch bigger blast radius in normal operation. And if you pull RIFT onto serve= rs in multi-plane fabrics your FIB may blow up if you do that (unless we think server adapters with 2M FIB si= ze, probably ain't gonna happen ;-).

The other idea I don't grok, you have to explai= n in more detail.

Both of these do run counter to the low-configuration nature of RIFT.
Another might be a protocol change, something like nodes
disaggregating prefixes by default until they know they are more than
1 hop from the bottom of fabric? (This may run into other convergence
issues during fabric bring up and cold start and maybe there are other
issues with it that need doodling out.)

Yeah, doodle, I'm not concerned about the conve= rgence and interested in your ideas. ZTP has no beef with prefixes, will wo= rk irregardless. So far I saw no indications that size of the fabric up to = any reasonable bound of nodes will prevent it to cold-boot properly. ZTP FSM has no timers for that reason BT= W (another no, no I put down there ;-) and flooding only has a single retra= nsmit timer.

And, yes, the more stuff like forced disaggrega= tion you start to fiddle with, the more you loose ZTP (depending whether yo= u want it or not, RIFT will work either way). BTW, same with security, the = more security you desire the less ZTP you get ;-)  Nature of the beast ...

--- tony
--_000_MWHPR05MB32798005D0A97DCC996CCB11AC260MWHPR05MB3279namp_-- From nobody Thu Apr 18 13:07:16 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EEC0712012A for ; Thu, 18 Apr 2019 13:07:14 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.901 X-Spam-Level: X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=krisprice-nz.20150623.gappssmtp.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bnOXBeI1LpcR for ; Thu, 18 Apr 2019 13:07:11 -0700 (PDT) Received: from mail-lj1-x22d.google.com (mail-lj1-x22d.google.com [IPv6:2a00:1450:4864:20::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 74B9A12010C for ; Thu, 18 Apr 2019 13:07:11 -0700 (PDT) Received: by mail-lj1-x22d.google.com with SMTP id j89so2963546ljb.1 for ; Thu, 18 Apr 2019 13:07:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=krisprice-nz.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=T7CLOZ07iRWwFe64nZxyea6T9XgrAO12f4+jV3aQ0HU=; b=iCUPedzoVbxbkguO7kLugVlKXvGlqlVnZ1QIUvsyoNdLK8W61sUhshXlJ+y4yqq7BC uL/trUKBMx/M6PtEgfSGlI2PJ2N7QoSrpWLOAfpF/1t1Bpj3670kpWURr9ZA+6a+uKST 89cur5t1rNB6IguIRmKltwFkJR27+/rzOisz5kzhuaFATbwMKFS5n3olrF+KZewJbqGO 8z3tPa1kGXTWe6i4cLKDoc/3pu3Jmej7uNO4Z67r8V524oI+51fJDADgmsqea6rFM1LT /Q0HmuEUbrtEuoV/S6V8tCnVSUROog3GcxgskeGZyw2uyZXSzbARTSL1Ur5XlDa0Q8U1 Ea1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=T7CLOZ07iRWwFe64nZxyea6T9XgrAO12f4+jV3aQ0HU=; b=eEZ5IBHLAygTgdjn+x3QaHRPz9VF4HjTKQ8cUiU7szqFUyRCIzZ7TKoatKPmIWDn5t VVIGP6t5akBnuNsg4X10tbhxPUt42KdbuIjtsUfUJDBjSojoOK4jVQ14ODv9m42Ngc5+ I5PdcHF6TCKtWtf6NsHROpzdzVow/Bw1OiJheCwJAORBGsDOHEOkqkQ8oiuy9LXSxn31 vzenLWnlD/U3e24bY06MNLQ9N4aG+ze1p7bgBiWD4B3NxFaqIcIutC2TbkveOO/l5pAe 5Qbg89ZEcXPW14bwiBzmrGpzBx16qhK/F6OzV8q2XJLU7jvPKHl2sf05VRmHYHDOLdxO v+SA== X-Gm-Message-State: APjAAAW7Z3Edx79H6TK3a7B4WzYO/p1AxElodZASqrZuGqY1N0tBg2OQ pXi2ClDco9yLbZZn84S6Ui4umrlLFwmEkR1x0l9+0g== X-Google-Smtp-Source: APXvYqw4eGCj6Gtt1y3cm5591r0OqrhyCjjRuLZUeU18ofR9g6D/2Nk3Bn1EARIiYV5RoP25U1dS7rmYIfT//XPURlM= X-Received: by 2002:a2e:91cb:: with SMTP id u11mr69602ljg.64.1555618029184; Thu, 18 Apr 2019 13:07:09 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Kris Price Date: Thu, 18 Apr 2019 16:07:08 -0400 Message-ID: To: Antoni Przygienda Cc: "brunorijsman@gmail.com" , "rift@ietf.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Apr 2019 20:07:15 -0000 Hey Tony, inline: [snip] > On the rings: Ahh! I get it, okay that makes it better. I was also > wondering if some kind of designated 'S-TIE' reflector / virtual links > / or explicitly configured multi-hop adjacencies solution could be > used (the issue being one of how do you route these packets between > the peers without needing to do something like source route multiple > hops southbound before being default routed northbound). > > good, I know it takes bit to grok the stuff. We did the best we could wit= h ASCII and language but the concepts need some chewing for sure, even if y= ou have been around big fabrics for a bit ;-) So, nothing like route reflec= tors and so on, within a plane normal south reflection takes care of sync'i= ng up all you need, outside the plane the ring takes care of sync'ing up pl= anes (for flooding horizontal links below ToF are south and @ ToF level nor= th basically and with that you have all the topology to figure out negative= disaggregation. I explicitly killed any "virtual link" suggestions, I wen= t through this particular hell in my life more than once and don't want to = visit it anymore ;-) ... [KP]: I'm a bit skeptical of buy in to rings as a solution, but if you have customer's buying into that then that's cool. (I omitted describing the physical *shudder* when I wrote "virtual links" ;-)) [snip] > It seems this would arise frequently at the bottom two tiers of the > network. Any loss of any single link to any rack (tier 0) would result > in all other nodes at tier 1 disaggregating the prefix(es) for that > rack and causing the potential transient incast-like congestion. I'm a > bit concerned that this may be a noticeable event in some cases (e.g. > a storage row/cluster or maybe where RoCE is in use), and one that > would be fairly annoying to debug and remedy post transition to RIFT > if you didn't foresee it and have the tools (knobs) in place to > prevent it from happening without a PR and s/w upgrade. > > yepp, you call the spade but you're a bit too pesimistic me thinks. Let's= assume 2 ToRs dual-homing a rack or couple racks of servers. if you loose = a link in a multi-homed server you basically end up having the other ToR de= -aggregating just this server prefix to other servers (even if you run some= kubernetes @ scale you may have 100 prefixes or so I'd say, I can't imagin= e a server hosting thousands really) ... Then, if you think about the ToRs = on top of PoD then it's not as bad as you think. If you loose a single ToR = in a PoD towards a spine (I'm loose with terminology here) then you will NO= T see disaggregation as long the other ToRs in the PoD are still connected = to the PoD. Draw pictures & run the public consumption package ;-) More in= teresting discussions are bandwdith balancing on link losses (which I think= we solved well northbound) and whether it even should be done southbound s= ince notion of "available bandwidth southbound" is confounding ... Spec doe= sn't forbid it (the beauty of loop-free valley-free routing that gives you = insane amount of lee-way how you choose to forward) BTW if somone is smart = enough to figure that out ;-) ... [KP]: You're not the first to describe me as a pessimist. :-) I don't follow the 2x ToRs and multi-homed servers part, I haven't seen that used in a very long time, and granted I've been out of the game for a bit but is anyone still multihoming servers at scale? Maybe certain enterprise use cases, but they're not pushing the boundaries of scale so don't need the aggregation anyway. [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8 or 16 switches (or more) northbound (naturally let's call that next tier "tier-2"). If any single link between a tier-1 and tier-2 switch goes down (let's say between tier-1-1 and tier-2-1), all other nodes in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine that tier-2-1 no longer has southbound reachability for tier-1-1's prefixes and that they each need to disagregate these to prevent tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which would then need to forward up to tier-3 and back down). [KP]: With positive disagregation we can introduce transient congestion if there's a lot of traffic from tier-1-2..n to tier-1-1 because a switch may get the prefix from one upstream node first and install that before getting it from the remaining upstream nodes. (So we could for a brief instant go from 8 paths ECMP to 1 path then back up to 7 paths.) On the other hand if all prefixes are disaggregated, and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 is now only announcing a withdraw for the affected prefixes to tier-1-2..n, we can avoid generating this temporary incast-like scenario by design. [KP]: It's a preference for more deterministic behavior of the fabric over less deterministic behavior. > Should > implementations have a conscious solution in advance for this, and > what's the best way to ensure that? The 'always-disaggregate' knob is > one. Another might be something like a 'min-next-hops' option where > the local RIFT instance on tier 0 won't install a prefix unless it has > received it from a minimum number of up streams > > The always disaggregate knob is something you can do per level if you des= ire but it's basically a big hammer buying you much bigger blast radius in = normal operation. And if you pull RIFT onto servers in multi-plane fabrics = your FIB may blow up if you do that (unless we think server adapters with 2= M FIB size, probably ain't gonna happen ;-). [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a design consideration for anyone before thinking about routes from servers. At a small number selectively it's fine and practiced, e.g. advertising prefixes from servers that are doing software load balancing. [KP]: But advertising them say for every VM so you can move VMs anywhere... that's still going to have impacts on your network design. With FIB sizes as they are these days most people below the top 5 (or so) are going to be fine. And anyone in the top 5 (or so) are still going to be running into trouble. And if you do something like use the same switch for top of rack layer as at further layers in the Clos, then this RIFT scaling feature doesn't apply to you anyway as you have the same FIB size at all tiers. [KP]: In any case disaggregation at the bottom two tiers where this is much more likely to be a problem, still permits aggregation higher. > The other idea I don't grok, you have to explain in more detail. [KP]: As an alternative to disaggregation and announcing a withdraw when the link between tier-1-1 and tier-2-1 goes down. It could be that we have all the RIFT instances on tier-1 configured to know that they should not install a prefix *unless* they have seen it advertised from some minimum number of tier-2 nodes. E.g. if there are 8 tier-2 nodes, we might set that to say 4. Now we somewhat avoid the incast scenario where the switch installs the disaggregated prefix with one next hop into it's FIB. Instead it'll wait until it has a minimum of four next-hops. (This is spit balling, it may open other problems.) Cheers :-) Kris From nobody Thu Apr 18 15:10:29 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8708D120075 for ; Thu, 18 Apr 2019 15:10:27 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -2 X-Spam-Level: X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id yyaIAreGVtWl for ; Thu, 18 Apr 2019 15:10:24 -0700 (PDT) Received: from mail-qk1-x72e.google.com (mail-qk1-x72e.google.com [IPv6:2607:f8b0:4864:20::72e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D1E1212000F for ; Thu, 18 Apr 2019 15:10:23 -0700 (PDT) Received: by mail-qk1-x72e.google.com with SMTP id k189so2075959qkc.0 for ; Thu, 18 Apr 2019 15:10:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=pQ6XXWj/AWAANz0c4Sr+GMUXUcpekf7QK80cgf6Olj4=; b=OMjNsUtFtJD3J1vaqKO/PqLRBzd9ys4KwA8O7NuxT90UNiN/iUfkq7QHxMmKpgOE6M iILATy4n2uGKHQSbcILPZahOyZSRQx649scbNXku1AtclLevn9Pr/uVihrxVDoMaZPX8 RzZEIvYtzRiq4bKeoyDAxfIg7cdnueoKkkzSB/BDDu+Pe0yz70pWO7BAcEGRvzUty1YC dMgXNro01zKv/w4b2PZRLij6gm8AzXJSD3Ph8pGrYKmfPEIPe5TtD2DXHmeWbCTFsXIU ccN/KLP5X5e8wOq5O5lb+jWAZJpYxV/Duw/2GD0mLLzBCzIIwOSg43Te7Y5MECJJrt+6 BaYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=pQ6XXWj/AWAANz0c4Sr+GMUXUcpekf7QK80cgf6Olj4=; b=Q2TNIxHOd/85Od08wi6XwySyogVWUxlEpxjk2u3NC9KV5FPV0a1bv636wkuZTXjQyT 0lUvscDtUaymqmjVKc1IlMKpki5qXRDVvZIyp1EwC0xLSbj1puEJKIxGnnFCMVG3CD+s SkklI9XJd2QAUmTdBsG43p++OzMNGah7wDKcRNswUEZluxZo0MjE4V2rTG52UDa8bkaL 5ZymD6dJQmfQGNVt5GWhE5grEzD3DwE/unLnMwBHtYEjt6YKennvnQEDG+XgeiyKpt+u JcDIo4pGSSkoxwFdbDz1HlszJZp8iLFoWJ3PP+hXi0Kkuz/LqbotUQOC18NoyMFpHRzy 51wA== X-Gm-Message-State: APjAAAW46ZTMhvwXZ07QLgsobtsTUBsCal2XwEIwUrjjOCI4YgdN9Bwu auI8ouTfkDvD6rXJS/lcQA3hoCNQ X-Google-Smtp-Source: APXvYqyID1TeB7tP2q1sh3kLMeWyKueWwxmAR3eO0QqID1scenuEE3wZ0ij84EQPHsnEc5NXcaeldg== X-Received: by 2002:ae9:f403:: with SMTP id y3mr384123qkl.301.1555625422714; Thu, 18 Apr 2019 15:10:22 -0700 (PDT) Received: from [192.168.0.102] (host-cotesma-176-174.smandes.com.ar. [201.220.176.174]) by smtp.gmail.com with ESMTPSA id y34sm2131917qta.96.2019.04.18.15.10.20 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 18 Apr 2019 15:10:22 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\)) From: Bruno Rijsman In-Reply-To: Date: Thu, 18 Apr 2019 19:10:18 -0300 Cc: Tony Przygienda , "rift@ietf.org" Content-Transfer-Encoding: quoted-printable Message-Id: <8B6D2FC6-15D3-4133-AAB4-160E1D82A827@gmail.com> References: To: Kris Price X-Mailer: Apple Mail (2.3445.100.39) Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Apr 2019 22:10:28 -0000 Kris, What is your opinion is on negative aggregation as a solution for the = transient incast-like congestion after a failure with positive = disaggregation? =E2=80=94 Bruno > On Apr 18, 2019, at 5:07 PM, Kris Price wrote: >=20 > Hey Tony, inline: >=20 > [snip] >> On the rings: Ahh! I get it, okay that makes it better. I was also >> wondering if some kind of designated 'S-TIE' reflector / virtual = links >> / or explicitly configured multi-hop adjacencies solution could be >> used (the issue being one of how do you route these packets between >> the peers without needing to do something like source route multiple >> hops southbound before being default routed northbound). >>=20 >> good, I know it takes bit to grok the stuff. We did the best we could = with ASCII and language but the concepts need some chewing for sure, = even if you have been around big fabrics for a bit ;-) So, nothing like = route reflectors and so on, within a plane normal south reflection takes = care of sync'ing up all you need, outside the plane the ring takes care = of sync'ing up planes (for flooding horizontal links below ToF are south = and @ ToF level north basically and with that you have all the topology = to figure out negative disaggregation. I explicitly killed any "virtual = link" suggestions, I went through this particular hell in my life more = than once and don't want to visit it anymore ;-) ... >=20 > [KP]: I'm a bit skeptical of buy in to rings as a solution, but if you > have customer's buying into that then that's cool. (I omitted > describing the physical *shudder* when I wrote "virtual links" ;-)) >=20 > [snip] >> It seems this would arise frequently at the bottom two tiers of the >> network. Any loss of any single link to any rack (tier 0) would = result >> in all other nodes at tier 1 disaggregating the prefix(es) for that >> rack and causing the potential transient incast-like congestion. I'm = a >> bit concerned that this may be a noticeable event in some cases (e.g. >> a storage row/cluster or maybe where RoCE is in use), and one that >> would be fairly annoying to debug and remedy post transition to RIFT >> if you didn't foresee it and have the tools (knobs) in place to >> prevent it from happening without a PR and s/w upgrade. >>=20 >> yepp, you call the spade but you're a bit too pesimistic me thinks. = Let's assume 2 ToRs dual-homing a rack or couple racks of servers. if = you loose a link in a multi-homed server you basically end up having the = other ToR de-aggregating just this server prefix to other servers (even = if you run some kubernetes @ scale you may have 100 prefixes or so I'd = say, I can't imagine a server hosting thousands really) ... Then, if you = think about the ToRs on top of PoD then it's not as bad as you think. If = you loose a single ToR in a PoD towards a spine (I'm loose with = terminology here) then you will NOT see disaggregation as long the other = ToRs in the PoD are still connected to the PoD. Draw pictures & run the = public consumption package ;-) More interesting discussions are = bandwdith balancing on link losses (which I think we solved well = northbound) and whether it even should be done southbound since notion = of "available bandwidth southbound" is confounding ... Spec doesn't = forbid it (the beauty of loop-free valley-free routing that gives you = insane amount of lee-way how you choose to forward) BTW if somone is = smart enough to figure that out ;-) ... >=20 > [KP]: You're not the first to describe me as a pessimist. :-) I don't > follow the 2x ToRs and multi-homed servers part, I haven't seen that > used in a very long time, and granted I've been out of the game for a > bit but is anyone still multihoming servers at scale? Maybe certain > enterprise use cases, but they're not pushing the boundaries of scale > so don't need the aggregation anyway. >=20 > [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8 > or 16 switches (or more) northbound (naturally let's call that next > tier "tier-2"). If any single link between a tier-1 and tier-2 switch > goes down (let's say between tier-1-1 and tier-2-1), all other nodes > in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine > that tier-2-1 no longer has southbound reachability for tier-1-1's > prefixes and that they each need to disagregate these to prevent > tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which > would then need to forward up to tier-3 and back down). >=20 > [KP]: With positive disagregation we can introduce transient > congestion if there's a lot of traffic from tier-1-2..n to tier-1-1 > because a switch may get the prefix from one upstream node first and > install that before getting it from the remaining upstream nodes. (So > we could for a brief instant go from 8 paths ECMP to 1 path then back > up to 7 paths.) On the other hand if all prefixes are disaggregated, > and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 is now > only announcing a withdraw for the affected prefixes to tier-1-2..n, > we can avoid generating this temporary incast-like scenario by design. >=20 > [KP]: It's a preference for more deterministic behavior of the fabric > over less deterministic behavior. >=20 >> Should >> implementations have a conscious solution in advance for this, and >> what's the best way to ensure that? The 'always-disaggregate' knob is >> one. Another might be something like a 'min-next-hops' option where >> the local RIFT instance on tier 0 won't install a prefix unless it = has >> received it from a minimum number of up streams >>=20 >> The always disaggregate knob is something you can do per level if you = desire but it's basically a big hammer buying you much bigger blast = radius in normal operation. And if you pull RIFT onto servers in = multi-plane fabrics your FIB may blow up if you do that (unless we think = server adapters with 2M FIB size, probably ain't gonna happen ;-). >=20 > [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a > design consideration for anyone before thinking about routes from > servers. At a small number selectively it's fine and practiced, e.g. > advertising prefixes from servers that are doing software load > balancing. >=20 > [KP]: But advertising them say for every VM so you can move VMs > anywhere... that's still going to have impacts on your network design. > With FIB sizes as they are these days most people below the top 5 (or > so) are going to be fine. And anyone in the top 5 (or so) are still > going to be running into trouble. And if you do something like use the > same switch for top of rack layer as at further layers in the Clos, > then this RIFT scaling feature doesn't apply to you anyway as you have > the same FIB size at all tiers. >=20 > [KP]: In any case disaggregation at the bottom two tiers where this is > much more likely to be a problem, still permits aggregation higher. >=20 >> The other idea I don't grok, you have to explain in more detail. >=20 > [KP]: As an alternative to disaggregation and announcing a withdraw > when the link between tier-1-1 and tier-2-1 goes down. It could be > that we have all the RIFT instances on tier-1 configured to know that > they should not install a prefix *unless* they have seen it advertised > from some minimum number of tier-2 nodes. E.g. if there are 8 tier-2 > nodes, we might set that to say 4. Now we somewhat avoid the incast > scenario where the switch installs the disaggregated prefix with one > next hop into it's FIB. Instead it'll wait until it has a minimum of > four next-hops. (This is spit balling, it may open other problems.) >=20 > Cheers :-) > Kris From nobody Thu Apr 18 15:11:57 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B2430120075 for ; Thu, 18 Apr 2019 15:11:55 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -2 X-Spam-Level: X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2v-8VSePNF8u for ; Thu, 18 Apr 2019 15:11:53 -0700 (PDT) Received: from mail-qt1-x82e.google.com (mail-qt1-x82e.google.com [IPv6:2607:f8b0:4864:20::82e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 6F2E51200D8 for ; Thu, 18 Apr 2019 15:11:53 -0700 (PDT) Received: by mail-qt1-x82e.google.com with SMTP id z17so3822054qts.13 for ; Thu, 18 Apr 2019 15:11:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=j/tFjo0MlMONvR28VDFi9GqRBkNPPgGthud6YzX5nZg=; b=SalF8teNFFf4q2UU303ZGqSCwTdjdqifgNAp8D97myGwabx0s305EDsRjI4s3vcM9V gzIAx+irBTacaF+GXKRRIy+uY+x2MDUkj2wZNjEMnS2Yy0ylI5hwYLUoiORy/oumSZp6 SboQIwf4GIkhcJont6FBlT9KNORRR8OjreK6hN7ydGKKz2ZF/qZRVCkRejbLPIkjIlkS 0RyNaljAZ6A2BPPB3ptrT0S6qjmrpi1gP/Ma+qjoUBcmWfXPoQQxWl4HmU3gZFRkI2Sn DG3WK5SpjR/QNFMoaeq9Va2GamXInJQdKScPUSItytHx0OlKf0Z3kV94OT0vEjZn+28E W9lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=j/tFjo0MlMONvR28VDFi9GqRBkNPPgGthud6YzX5nZg=; b=lAnR2poObOiUvsRek/vaqbFVlTDjXG+5InL0kRoc6eirOForL1yMt05J1GvhBQxBtY HwVZ67A0hTWyMAbcFholR7ZxXB9jxXDM5uGAoyOvhZ5O5T+hwZFcwMBbBT4KL/nCduHx MyWTnDNGeLc9aA+Nt7F+u8p3Y+fVFAG/5kYgqrS745kkI7Dx0FUfguu/dk0aWPZvijY1 Ra901jLS8pbqhn5RTZqujpqX7lNHxmOQGYYCEQSkSWNaFeqny17Lfc9UxNH7PSJwxfLI jzuDlaYs+afYEdKIwqILz/0HWxvUGwVssWZpgBhoPmXTL9+m4vb2GXntCAEWWTh92OWg nQSQ== X-Gm-Message-State: APjAAAW4izOT0FAdvhzVWuK2oEK+JGoq7jZYCtPWVTK8Ldrmx84/O6ki BoNWX/IqtTYwl6FyHTLWKFI= X-Google-Smtp-Source: APXvYqxX1PaAbC9AhnFetwRSAs2L/EQc9H938KpdiKnRldO12hG1z80GJRYGtEItOAlg20W0RJ5Atg== X-Received: by 2002:ac8:24cf:: with SMTP id t15mr473283qtt.112.1555625512397; Thu, 18 Apr 2019 15:11:52 -0700 (PDT) Received: from [192.168.0.102] (host-cotesma-176-174.smandes.com.ar. [201.220.176.174]) by smtp.gmail.com with ESMTPSA id y34sm2134024qta.96.2019.04.18.15.11.50 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 18 Apr 2019 15:11:51 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\)) From: Bruno Rijsman In-Reply-To: <8B6D2FC6-15D3-4133-AAB4-160E1D82A827@gmail.com> Date: Thu, 18 Apr 2019 19:11:48 -0300 Cc: Tony Przygienda , "rift@ietf.org" Content-Transfer-Encoding: quoted-printable Message-Id: <9AA1773B-CF1F-4446-B3D4-DD6DF7ED131A@gmail.com> References: <8B6D2FC6-15D3-4133-AAB4-160E1D82A827@gmail.com> To: Kris Price X-Mailer: Apple Mail (2.3445.100.39) Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Apr 2019 22:11:56 -0000 Typo: make that negative *disaggregation* > On Apr 18, 2019, at 7:10 PM, Bruno Rijsman = wrote: >=20 > Kris, >=20 > What is your opinion is on negative aggregation as a solution for the = transient incast-like congestion after a failure with positive = disaggregation? >=20 > =E2=80=94 Bruno >=20 >> On Apr 18, 2019, at 5:07 PM, Kris Price wrote: >>=20 >> Hey Tony, inline: >>=20 >> [snip] >>> On the rings: Ahh! I get it, okay that makes it better. I was also >>> wondering if some kind of designated 'S-TIE' reflector / virtual = links >>> / or explicitly configured multi-hop adjacencies solution could be >>> used (the issue being one of how do you route these packets between >>> the peers without needing to do something like source route multiple >>> hops southbound before being default routed northbound). >>>=20 >>> good, I know it takes bit to grok the stuff. We did the best we = could with ASCII and language but the concepts need some chewing for = sure, even if you have been around big fabrics for a bit ;-) So, nothing = like route reflectors and so on, within a plane normal south reflection = takes care of sync'ing up all you need, outside the plane the ring takes = care of sync'ing up planes (for flooding horizontal links below ToF are = south and @ ToF level north basically and with that you have all the = topology to figure out negative disaggregation. I explicitly killed any = "virtual link" suggestions, I went through this particular hell in my = life more than once and don't want to visit it anymore ;-) ... >>=20 >> [KP]: I'm a bit skeptical of buy in to rings as a solution, but if = you >> have customer's buying into that then that's cool. (I omitted >> describing the physical *shudder* when I wrote "virtual links" ;-)) >>=20 >> [snip] >>> It seems this would arise frequently at the bottom two tiers of the >>> network. Any loss of any single link to any rack (tier 0) would = result >>> in all other nodes at tier 1 disaggregating the prefix(es) for that >>> rack and causing the potential transient incast-like congestion. I'm = a >>> bit concerned that this may be a noticeable event in some cases = (e.g. >>> a storage row/cluster or maybe where RoCE is in use), and one that >>> would be fairly annoying to debug and remedy post transition to RIFT >>> if you didn't foresee it and have the tools (knobs) in place to >>> prevent it from happening without a PR and s/w upgrade. >>>=20 >>> yepp, you call the spade but you're a bit too pesimistic me thinks. = Let's assume 2 ToRs dual-homing a rack or couple racks of servers. if = you loose a link in a multi-homed server you basically end up having the = other ToR de-aggregating just this server prefix to other servers (even = if you run some kubernetes @ scale you may have 100 prefixes or so I'd = say, I can't imagine a server hosting thousands really) ... Then, if you = think about the ToRs on top of PoD then it's not as bad as you think. If = you loose a single ToR in a PoD towards a spine (I'm loose with = terminology here) then you will NOT see disaggregation as long the other = ToRs in the PoD are still connected to the PoD. Draw pictures & run the = public consumption package ;-) More interesting discussions are = bandwdith balancing on link losses (which I think we solved well = northbound) and whether it even should be done southbound since notion = of "available bandwidth southbound" is confounding ... Spec doesn't = forbid it (the beauty of loop-free valley-free routing that gives you = insane amount of lee-way how you choose to forward) BTW if somone is = smart enough to figure that out ;-) ... >>=20 >> [KP]: You're not the first to describe me as a pessimist. :-) I don't >> follow the 2x ToRs and multi-homed servers part, I haven't seen that >> used in a very long time, and granted I've been out of the game for a >> bit but is anyone still multihoming servers at scale? Maybe certain >> enterprise use cases, but they're not pushing the boundaries of scale >> so don't need the aggregation anyway. >>=20 >> [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8 >> or 16 switches (or more) northbound (naturally let's call that next >> tier "tier-2"). If any single link between a tier-1 and tier-2 switch >> goes down (let's say between tier-1-1 and tier-2-1), all other nodes >> in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine >> that tier-2-1 no longer has southbound reachability for tier-1-1's >> prefixes and that they each need to disagregate these to prevent >> tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which >> would then need to forward up to tier-3 and back down). >>=20 >> [KP]: With positive disagregation we can introduce transient >> congestion if there's a lot of traffic from tier-1-2..n to tier-1-1 >> because a switch may get the prefix from one upstream node first and >> install that before getting it from the remaining upstream nodes. (So >> we could for a brief instant go from 8 paths ECMP to 1 path then back >> up to 7 paths.) On the other hand if all prefixes are disaggregated, >> and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 is = now >> only announcing a withdraw for the affected prefixes to tier-1-2..n, >> we can avoid generating this temporary incast-like scenario by = design. >>=20 >> [KP]: It's a preference for more deterministic behavior of the fabric >> over less deterministic behavior. >>=20 >>> Should >>> implementations have a conscious solution in advance for this, and >>> what's the best way to ensure that? The 'always-disaggregate' knob = is >>> one. Another might be something like a 'min-next-hops' option where >>> the local RIFT instance on tier 0 won't install a prefix unless it = has >>> received it from a minimum number of up streams >>>=20 >>> The always disaggregate knob is something you can do per level if = you desire but it's basically a big hammer buying you much bigger blast = radius in normal operation. And if you pull RIFT onto servers in = multi-plane fabrics your FIB may blow up if you do that (unless we think = server adapters with 2M FIB size, probably ain't gonna happen ;-). >>=20 >> [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a >> design consideration for anyone before thinking about routes from >> servers. At a small number selectively it's fine and practiced, e.g. >> advertising prefixes from servers that are doing software load >> balancing. >>=20 >> [KP]: But advertising them say for every VM so you can move VMs >> anywhere... that's still going to have impacts on your network = design. >> With FIB sizes as they are these days most people below the top 5 (or >> so) are going to be fine. And anyone in the top 5 (or so) are still >> going to be running into trouble. And if you do something like use = the >> same switch for top of rack layer as at further layers in the Clos, >> then this RIFT scaling feature doesn't apply to you anyway as you = have >> the same FIB size at all tiers. >>=20 >> [KP]: In any case disaggregation at the bottom two tiers where this = is >> much more likely to be a problem, still permits aggregation higher. >>=20 >>> The other idea I don't grok, you have to explain in more detail. >>=20 >> [KP]: As an alternative to disaggregation and announcing a withdraw >> when the link between tier-1-1 and tier-2-1 goes down. It could be >> that we have all the RIFT instances on tier-1 configured to know that >> they should not install a prefix *unless* they have seen it = advertised >> from some minimum number of tier-2 nodes. E.g. if there are 8 tier-2 >> nodes, we might set that to say 4. Now we somewhat avoid the incast >> scenario where the switch installs the disaggregated prefix with one >> next hop into it's FIB. Instead it'll wait until it has a minimum of >> four next-hops. (This is spit balling, it may open other problems.) >>=20 >> Cheers :-) >> Kris >=20 From nobody Thu Apr 18 16:10:53 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3276F120112 for ; Thu, 18 Apr 2019 16:10:51 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.328 X-Spam-Level: X-Spam-Status: No, score=-1.328 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, KHOP_DYNAMIC=1.363, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01] autolearn=no autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QnOxRot9_2i3 for ; Thu, 18 Apr 2019 16:10:47 -0700 (PDT) Received: from mx0b-00273201.pphosted.com (mx0b-00273201.pphosted.com [67.231.152.164]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A43F61200FE for ; Thu, 18 Apr 2019 16:10:47 -0700 (PDT) Received: from pps.filterd (m0108160.ppops.net [127.0.0.1]) by mx0b-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3IN48fI006417; Thu, 18 Apr 2019 16:10:44 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : mime-version; s=PPS1017; bh=t8XZQVS9S6NrRdhG+dyVAD8Prra2BCTJRLkgQz36APU=; b=gfbLuKNM9Q0MpCQb7uvJfOOMQkJDWDfOtJDprsX+oyZETNX+WYP2Z/6zOV331IWAJeQX venaTyeRysnQd4d+PSUTqZQVVbAzS21+PdRFBCiLv5bEpvex6UD9bNNdWACgb1G2MD7S PSkvEREXX19Mc2gfo01nZ3t9uQG+fie1mYbSGoaQOhUuLzL30BNMl96G7wHlPP6/f/fS lSLQOKEXBo52u+RlGaqqkiocPAZ2TzPYsLQrIS5lSc5c2e7R/SJUZYycTJmHrs/nCgJp cxi8RPOI7GlR+lTEHCFjj3sQo5/T3pecfycLCY+I3yn9H/9xdJ2yWMBXlqxmEYYGWpo1 cA== Received: from nam05-by2-obe.outbound.protection.outlook.com (mail-by2nam05lp2051.outbound.protection.outlook.com [104.47.50.51]) by mx0b-00273201.pphosted.com with ESMTP id 2rxy6a8e3g-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Thu, 18 Apr 2019 16:10:43 -0700 Received: from MWHPR05MB3279.namprd05.prod.outlook.com (10.173.230.18) by MWHPR05MB2941.namprd05.prod.outlook.com (10.168.246.7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1813.9; Thu, 18 Apr 2019 23:10:40 +0000 Received: from MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202]) by MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202%10]) with mapi id 15.20.1835.007; Thu, 18 Apr 2019 23:10:40 +0000 From: Antoni Przygienda To: Kris Price CC: "brunorijsman@gmail.com" , "rift@ietf.org" Thread-Topic: RIFT Thread-Index: AQHU8VRUPBOno7miBkKQqRko+b0GOKY4xxUkgAludgCAAAZsrYAAJfoAgAAv1Mg= Date: Thu, 18 Apr 2019 23:10:40 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [66.129.239.10] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: b1b6a603-d701-4217-aa09-08d6c453139b x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(5600141)(711020)(4605104)(4618075)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(2017052603328)(7193020); SRVR:MWHPR05MB2941; x-ms-traffictypediagnostic: MWHPR05MB2941: x-ms-exchange-purlcount: 1 x-microsoft-antispam-prvs: x-forefront-prvs: 0011612A55 x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(136003)(346002)(396003)(366004)(39860400002)(376002)(189003)(199004)(486006)(6506007)(8676002)(7116003)(14454004)(5660300002)(476003)(7736002)(74316002)(105004)(446003)(33656002)(186003)(6916009)(3846002)(86362001)(2906002)(11346002)(55016002)(76176011)(229853002)(6436002)(66476007)(52536014)(66556008)(76116006)(66066001)(478600001)(733005)(6116002)(54906003)(26005)(221733001)(25786009)(53546011)(71200400001)(19627405001)(256004)(99286004)(53936002)(30864003)(6306002)(7696005)(6246003)(71190400001)(93886005)(316002)(102836004)(606006)(81156014)(8936002)(97736004)(9686003)(3480700005)(4326008)(54896002)(68736007)(66574012)(66446008)(14444005)(236005)(64756008)(81166006)(66946007)(73956011); DIR:OUT; SFP:1102; SCL:1; SRVR:MWHPR05MB2941; H:MWHPR05MB3279.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: s36rlRcQCB4bwa2GOCtPo3vvbpCUYYXo4OuI7NgOAPslR0zga9lO87/ZHNlSlO47mqdktQty+nI5f0zRbaOHygUYu5WCLfEpQNQoS8ch9esNfvWGjp5n5owv+QLuI3T3IQgGHaeoMp9yr3Knoz5vboV+qrY7YlQqs2Ng07/VDXBfZuzrkesleBf1c1Wpb7N77siWgkzQP3Wg53r0vukILC/2OFqdTUAhy7j2YcMWH/OcPZjZj+UQ6NDLsUwrBDSOwXi02HS8qwhx9qeoBJn+9u42fbC1CpkSsYggf78mcjfR93hM1bwkCBLrfZuXCHRLgSDF27jEO4x3+UwUK+MYiHdyva1BGwet+TRO3M24KfMbjl6TDmq7xRK2NRMVIqLYXzAx4P4syFDDa9/omS1CMoZsRWugr93095Fi4aa226k= Content-Type: multipart/alternative; boundary="_000_MWHPR05MB3279E66C8723A77D342EC95AAC260MWHPR05MB3279namp_" MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-Network-Message-Id: b1b6a603-d701-4217-aa09-08d6c453139b X-MS-Exchange-CrossTenant-originalarrivaltime: 18 Apr 2019 23:10:40.8491 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR05MB2941 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-18_11:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904180134 Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Apr 2019 23:10:51 -0000 --_000_MWHPR05MB3279E66C8723A77D342EC95AAC260MWHPR05MB3279namp_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable ________________________________ From: Kris Price Sent: Thursday, April 18, 2019 1:07 PM To: Antoni Przygienda Cc: brunorijsman@gmail.com; rift@ietf.org Subject: Re: RIFT Hey Tony, inline: [snip] > On the rings: Ahh! I get it, okay that makes it better. I was also > wondering if some kind of designated 'S-TIE' reflector / virtual links > / or explicitly configured multi-hop adjacencies solution could be > used (the issue being one of how do you route these packets between > the peers without needing to do something like source route multiple > hops southbound before being default routed northbound). > > good, I know it takes bit to grok the stuff. We did the best we could wit= h ASCII and language but the concepts need some chewing for sure, even if y= ou have been around big fabrics for a bit ;-) So, nothing like route reflec= tors and so on, within a plane normal south reflection takes care of sync'i= ng up all you need, outside the plane the ring takes care of sync'ing up pl= anes (for flooding horizontal links below ToF are south and @ ToF level nor= th basically and with that you have all the topology to figure out negative= disaggregation. I explicitly killed any "virtual link" suggestions, I wen= t through this particular hell in my life more than once and don't want to = visit it anymore ;-) ... [KP]: I'm a bit skeptical of buy in to rings as a solution, but if you have customer's buying into that then that's cool. (I omitted describing the physical *shudder* when I wrote "virtual links" ;-)) We spent a lot of time chewing different design points. Your choices are ei= ther "flat host routes everywhere" or "in case of failures your servers may= become top of your fabric" in case of multi-plane ... Or you ring on top i= n multi-plane (which BTW some top 5 already do and generally, 90% of people= are happy with single-plane where you don't need any rings @ ToF ;-) ... = You missed the core team discussions that have been had for weeks and month= s ;-) Look @ recordings pls ... [snip] > It seems this would arise frequently at the bottom two tiers of the > network. Any loss of any single link to any rack (tier 0) would result > in all other nodes at tier 1 disaggregating the prefix(es) for that > rack and causing the potential transient incast-like congestion. I'm a > bit concerned that this may be a noticeable event in some cases (e.g. > a storage row/cluster or maybe where RoCE is in use), and one that > would be fairly annoying to debug and remedy post transition to RIFT > if you didn't foresee it and have the tools (knobs) in place to > prevent it from happening without a PR and s/w upgrade. > > yepp, you call the spade but you're a bit too pesimistic me thinks. Let's= assume 2 ToRs dual-homing a rack or couple racks of servers. if you loose = a link in a multi-homed server you basically end up having the other ToR de= -aggregating just this server prefix to other servers (even if you run some= kubernetes @ scale you may have 100 prefixes or so I'd say, I can't imagin= e a server hosting thousands really) ... Then, if you think about the ToRs = on top of PoD then it's not as bad as you think. If you loose a single ToR = in a PoD towards a spine (I'm loose with terminology here) then you will NO= T see disaggregation as long the other ToRs in the PoD are still connected = to the PoD. Draw pictures & run the public consumption package ;-) More in= teresting discussions are bandwdith balancing on link losses (which I think= we solved well northbound) and whether it even should be done southbound s= ince notion of "available bandwidth southbound" is confounding ... Spec doe= sn't forbid it (the beauty of loop-free valley-free routing that gives you = insane amount of lee-way how you choose to forward) BTW if somone is smart = enough to figure that out ;-) ... [KP]: You're not the first to describe me as a pessimist. :-) I don't follow the 2x ToRs and multi-homed servers part, I haven't seen that used in a very long time, and granted I've been out of the game for a bit but is anyone still multihoming servers at scale? Maybe certain enterprise use cases, but they're not pushing the boundaries of scale so don't need the aggregation anyway. Modern architectures I see will be moving to good extent to ROTH IMO due to= micro-segmentation and tunnel origination on servers. [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8 or 16 switches (or more) northbound (naturally let's call that next tier "tier-2"). If any single link between a tier-1 and tier-2 switch goes down (let's say between tier-1-1 and tier-2-1), all other nodes in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine that tier-2-1 no longer has southbound reachability for tier-1-1's prefixes and that they each need to disagregate these to prevent tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which would then need to forward up to tier-3 and back down). I think we have a disconnect here. ToF level will only disaggregate if a To= F looses _all_ ToP connections to a PoD in a single plane design so I don't= follow your argument. If you run multi-plane design you should multi-home = each Pod multiple times into your plane as well. If you don't, dugh, you mu= st disaggregate since the plane will blackhole. [KP]: With positive disagregation we can introduce transient congestion if there's a lot of traffic from tier-1-2..n to tier-1-1 because a switch may get the prefix from one upstream node first and install that before getting it from the remaining upstream nodes. (So we could for a brief instant go from 8 paths ECMP to 1 path then back up to 7 paths.) On the other hand if all prefixes are disaggregated, and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 is now only announcing a withdraw for the affected prefixes to tier-1-2..n, we can avoid generating this temporary incast-like scenario by design. [KP]: It's a preference for more deterministic behavior of the fabric over less deterministic behavior. Well, having blast radius of whole fabric is in a sense deterministic with = every server changing/rebooting shaking whole fabric. I wouldn't call it op= timal though. Far more helpful than "deterministic" is in control system theory (https://= en.wikipedia.org/wiki/Stability_theory) to think about "stability" where de= sireable positive stability is correlated with minimal blast radius. The mo= re inputs shake more of your system the less "stability" you have. [https://upload.wikimedia.org/wikipedia/commons/3/3b/Stability_Diagram.png]= Stability theory - Wikipedia In mathematics, stability theory addresses the stability of solutions of di= fferential equations and of trajectories of dynamical systems under small p= erturbations of initial conditions. The heat equation, for example, is a st= able partial differential equation because small perturbations of initial d= ata lead to small variations in temperature at a later time as a result of = the maximum principle. en.wikipedia.org But again, if you want to disaggreagate e'thing all the time, RIFT won't st= op you and you still will be benefiting from flood reduction and N-flooding= -only in RIFT which makes for about 25% of normal flooding volume based on = empirical data here ... > Should > implementations have a conscious solution in advance for this, and > what's the best way to ensure that? The 'always-disaggregate' knob is > one. Another might be something like a 'min-next-hops' option where > the local RIFT instance on tier 0 won't install a prefix unless it has > received it from a minimum number of up streams > > The always disaggregate knob is something you can do per level if you des= ire but it's basically a big hammer buying you much bigger blast radius in = normal operation. And if you pull RIFT onto servers in multi-plane fabrics = your FIB may blow up if you do that (unless we think server adapters with 2= M FIB size, probably ain't gonna happen ;-). [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a design consideration for anyone before thinking about routes from servers. At a small number selectively it's fine and practiced, e.g. advertising prefixes from servers that are doing software load balancing. Then we disconnect. Think about flat host routing & what rebooting one serv= er does to you in terms of flooding & resulting computaton and so on and wh= at RIFT blast radius is. There is a world of difference. [KP]: But advertising them say for every VM so you can move VMs anywhere... that's still going to have impacts on your network design. With FIB sizes as they are these days most people below the top 5 (or so) are going to be fine. And anyone in the top 5 (or so) are still going to be running into trouble. And if you do something like use the same switch for top of rack layer as at further layers in the Clos, then this RIFT scaling feature doesn't apply to you anyway as you have the same FIB size at all tiers. [KP]: In any case disaggregation at the bottom two tiers where this is much more likely to be a problem, still permits aggregation higher. Obviously, your choice. If all servers in a PoD want to have all server pre= fixes disaggregated, RIFT won't prevent you if you get the implementaion kn= ob. > The other idea I don't grok, you have to explain in more detail. [KP]: As an alternative to disaggregation and announcing a withdraw when the link between tier-1-1 and tier-2-1 goes down. It could be that we have all the RIFT instances on tier-1 configured to know that they should not install a prefix *unless* they have seen it advertised from some minimum number of tier-2 nodes. E.g. if there are 8 tier-2 nodes, we might set that to say 4. Now we somewhat avoid the incast scenario where the switch installs the disaggregated prefix with one next hop into it's FIB. Instead it'll wait until it has a minimum of four next-hops. (This is spit balling, it may open other problems.) yes, it's always-negative-disaggregation which is possible, however much ha= rder to implement and you would somehow need to ring the ToP to have all th= e necesasry topolgoy information to achieve that (that's why we ring ToF in= multi-plane design). Argument has been made before, we spent tons time wit= h Pascal going pro and cons until the current design was found the best cho= ice. --_000_MWHPR05MB3279E66C8723A77D342EC95AAC260MWHPR05MB3279namp_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable



From: Kris Price <kris= @krisprice.nz>
Sent: Thursday, April 18, 2019 1:07 PM
To: Antoni Przygienda
Cc: brunorijsman@gmail.com; rift@ietf.org
Subject: Re: RIFT
 
Hey Tony, inline:

[snip]
> On the rings: Ahh! I get it, okay that makes it better. I was also
> wondering if some kind of designated 'S-TIE' reflector / virtual links=
> / or explicitly configured multi-hop adjacencies solution could be
> used (the issue being one of how do you route these packets between > the peers without needing to do something like source route multiple > hops southbound before being default routed northbound).
>
> good, I know it takes bit to grok the stuff. We did the best we could = with ASCII and language but the concepts need some chewing for sure, even i= f you have been around big fabrics for a bit ;-) So, nothing like route ref= lectors and so on, within a plane normal south reflection takes care of sync'ing up all you need, outside the plane= the ring takes care of sync'ing up planes (for flooding horizontal links b= elow ToF are south and @ ToF level north basically and with that you have a= ll the topology to figure out negative disaggregation.  I explicitly killed any "virtual link" sug= gestions, I went through this particular hell in my life more than once and= don't want to visit it anymore ;-) ...

[KP]: I'm a bit skeptical of buy in to rings as a solution, but if you
have customer's buying into that then that's cool. (I omitted
describing the physical *shudder* when I wrote "virtual links" ;-= ))

We spent a lot of time chewing different design= points. Your choices are either "flat host routes everywhere" or= "in case of failures your servers may become top of your fabric"= in case of multi-plane ... Or you ring on top in multi-plane (which BTW some top 5 already do and generally, 90% of people are happy wi= th single-plane where you don't need any rings @ ToF ;-) ...  You miss= ed the core team discussions that have been had for weeks and months ;-) Lo= ok @ recordings pls ...

[snip]
> It seems this would arise frequently at the bottom two tiers of the > network. Any loss of any single link to any rack (tier 0) would result=
> in all other nodes at tier 1 disaggregating the prefix(es) for that > rack and causing the potential transient incast-like congestion. I'm a=
> bit concerned that this may be a noticeable event in some cases (e.g.<= br> > a storage row/cluster or maybe where RoCE is in use), and one that
> would be fairly annoying to debug and remedy post transition to RIFT > if you didn't foresee it and have the tools (knobs) in place to
> prevent it from happening without a PR and s/w upgrade.
>
> yepp, you call the spade but you're a bit too pesimistic me thinks. Le= t's assume 2 ToRs dual-homing a rack or couple racks of servers. if you loo= se a link in a multi-homed server you basically end up having the other ToR= de-aggregating just this server prefix to other servers (even if you run some kubernetes @ scale you may have 100= prefixes or so I'd say, I can't imagine a server hosting thousands really)= ... Then, if you think about the ToRs on top of PoD then it's not as bad a= s you think. If you loose a single ToR in a PoD towards a spine (I'm loose with terminology here) then you wi= ll NOT see disaggregation as long the other ToRs in the PoD are still conne= cted to the PoD. Draw pictures & run the public consumption package ;-)=   More interesting discussions are bandwdith balancing on link losses (which I think we solved well northbound) and whe= ther it even should be done southbound since notion of "available band= width southbound" is confounding ... Spec doesn't forbid it (the beaut= y of loop-free valley-free routing that gives you insane amount of lee-way how you choose to forward) BTW if somone is s= mart enough to figure that out ;-) ...

[KP]: You're not the first to describe me as a pessimist. :-) I don't
follow the 2x ToRs and multi-homed servers part, I haven't seen that
used in a very long time, and granted I've been out of the game for a
bit but is anyone still multihoming servers at scale? Maybe certain
enterprise use cases, but they're not pushing the boundaries of scale
so don't need the aggregation anyway.

Modern architectures I see will be moving to go= od extent to ROTH IMO due to micro-segmentation and tunnel origination on s= ervers.

[KP]: A top of rack switch ("tier-1" lets say) may be connected t= o 8
or 16 switches (or more) northbound (naturally let's call that next
tier "tier-2"). If any single link between a tier-1 and tier-2 sw= itch
goes down (let's say between tier-1-1 and tier-2-1), all other nodes
in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine
that tier-2-1 no longer has southbound reachability for tier-1-1's
prefixes and that they each need to disagregate these to prevent
tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which
would then need to forward up to tier-3 and back down).

I think we have a disconnect here. ToF level wi= ll only disaggregate if a ToF looses _all_ ToP connections to a PoD in a si= ngle plane design so I don't follow your argument. If you run multi-plane d= esign you should multi-home each Pod multiple times into your plane as well. If you don't, dugh, you must disag= gregate since the plane will blackhole.

[KP]: With positive disagregation we can introduce transient
congestion if there's a lot of traffic from tier-1-2..n to tier-1-1
because a switch may get the prefix from one upstream node first and
install that before getting it from the remaining upstream nodes. (So
we could for a brief instant go from 8 paths ECMP to 1 path then back
up to 7 paths.) On the other hand if all prefixes are disaggregated,
and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 is now
only announcing a withdraw for the affected prefixes to tier-1-2..n,
we can avoid generating this temporary incast-like scenario by design.

[KP]: It's a preference for more deterministic behavior of the fabric
over less deterministic behavior.

Well, having blast radius of whole fabric is in= a sense deterministic with every server changing/rebooting shaking whole f= abric. I wouldn't call it optimal though.

Far more helpful than "deterministic"= is in control system theory (https://en.wikipedia.org/wiki/Stability_th= eory) to think about "stability" where desireable positive stability is correlated with minimal blast radius. The more input= s shake more of your system the less "stability" you have.
3D""
In mathematics, stability theory addresses the stability of solutions of di= fferential equations and of trajectories of dynamical systems under small p= erturbations of initial conditions. The heat equation, for example, is a st= able partial differential equation because small perturbations of initial data lead to small variations in te= mperature at a later time as a result of the maximum principle.
en.wikipedia.org

But again, if you want to disaggreagate e'thing all the time, RIFT won'= t stop you and you still will be benefiting from flood reduction and N-floo= ding-only in RIFT which makes for about 25% of normal flooding volume based on empirical data here ...

> Should
> implementations have a conscious solution in advance for this, and
> what's the best way to ensure that? The 'always-disaggregate' knob is<= br> > one. Another might be something like a 'min-next-hops' option where > the local RIFT instance on tier 0 won't install a prefix unless it has=
> received it from a minimum number of up streams
>
> The always disaggregate knob is something you can do per level if you = desire but it's basically a big hammer buying you much bigger blast radius = in normal operation. And if you pull RIFT onto servers in multi-plane fabri= cs your FIB may blow up if you do that (unless we think server adapters with 2M FIB size, probably ain't gonna ha= ppen ;-).

[KP]: Blast radius doesn't seem bigger to me. FIB explosion is a
design consideration for anyone before thinking about routes from
servers. At a small number selectively it's fine and practiced, e.g.
advertising prefixes from servers that are doing software load
balancing.

Then we disconnect. Think about flat host routi= ng & what rebooting one server does to you in terms of flooding & r= esulting computaton and so on and what RIFT blast radius is. There is a wor= ld of difference.

[KP]: But advertising them say for every VM so you can move VMs
anywhere... that's still going to have impacts on your network design.
With FIB sizes as they are these days most people below the top 5 (or
so) are going to be fine. And anyone in the top 5 (or so) are still
going to be running into trouble. And if you do something like use the
same switch for top of rack layer as at further layers in the Clos,
then this RIFT scaling feature doesn't apply to you anyway as you have
the same FIB size at all tiers.

[KP]: In any case disaggregation at the bottom two tiers where this is
much more likely to be a problem, still permits aggregation higher.

Obviously, your choice. If all servers in a PoD= want to have all server prefixes disaggregated, RIFT won't prevent you if = you get the implementaion knob.

> The other idea I don't grok, you have to explain in more detail.

[KP]: As an alternative to disaggregation and announcing a withdraw
when the link between tier-1-1 and tier-2-1 goes down. It could be
that we have all the RIFT instances on tier-1 configured to know that
they should not install a prefix *unless* they have seen it advertised
from some minimum number of tier-2 nodes. E.g. if there are 8 tier-2
nodes, we might set that to say 4. Now we somewhat avoid the incast
scenario where the switch installs the disaggregated prefix with one
next hop into it's FIB. Instead it'll wait until it has a minimum of
four next-hops. (This is spit balling, it may open other problems.)

yes, it's always-negative-disaggregation which = is possible, however much harder to implement and you would somehow need to= ring the ToP to have all the necesasry topolgoy information to achieve tha= t (that's why we ring ToF in multi-plane design). Argument has been made before, we spent tons time with Pascal goi= ng pro and cons until the current design was found the best choice.
--_000_MWHPR05MB3279E66C8723A77D342EC95AAC260MWHPR05MB3279namp_-- From nobody Fri Apr 19 00:11:06 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 015571202BE for ; Fri, 19 Apr 2019 00:11:05 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.998 X-Spam-Level: X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KTwcUw4cN3nY for ; Fri, 19 Apr 2019 00:11:03 -0700 (PDT) Received: from mail-ed1-x531.google.com (mail-ed1-x531.google.com [IPv6:2a00:1450:4864:20::531]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EF5FF1202BB for ; Fri, 19 Apr 2019 00:11:02 -0700 (PDT) Received: by mail-ed1-x531.google.com with SMTP id g6so3738310edc.8 for ; Fri, 19 Apr 2019 00:11:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=J3EfQ8zj4hbDZbWtpt71tyq7K7jfGvqx5W8aEAtHOhw=; b=hdYXl8Chu4K6kb3Drq5C2aeBV1TuxiysSM+M6zD2wt6tfap4/5Np4vFZOPkmAtpc6Z yV23YTGJA0b2bYRSJpatENm8d5ygv0glIx8MMFe0ViqyjzIZ8lS63JsaM49pnu5XJyfk 1kcMdUd64ADKct1ZPPQU82umnOPy1V+K5kGl/RT+yWi1oBl2wHKsuWktBemhDxSvmD1E SH7havvvLTVM6FgTDdF0BkgoFqujo9CHeFerPclm8w4CrghW1q0JYnZdC9RVlzFK+QFH 3TKtWaBiFwDoZEAAhRx9ALBsisv3JaEhDDHriMnccBAU9u7/386anVhQpsMba33/CAgl BIIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=J3EfQ8zj4hbDZbWtpt71tyq7K7jfGvqx5W8aEAtHOhw=; b=Pi3KO5oqlGV0GIChQgOyNpuSKxz4hcbuCKS5ktTRpAadJOxI7yIWRxrOucKGy3bLoN nCaGbw+3edVmwZcwc7iLw5dAVD0Y8nqmN6m1BX3ZE8YyWLZSfSg1lK8JlSePeo6fNS1v XmDTRZ+efD5vy7d0lHnrxpexb6JI7yPSjgwB6ipis6PRgTpyU5ksggJJqPOhEKDiWWpp yGZkdDyb0AeD7qC++ZclfBbKl2Bq62hibPO2ChYBxj42YuN0YHqEZbsUeEY8Z1wk2MRI /PsfCX6xmKXVFcdqdwUnf40Q08H+a3oLy2ZNidBcqcLdAFE0losds5DV0XryC3iT2FAf EyEA== X-Gm-Message-State: APjAAAUT0lr6jkP9FCYdaSzYNA9u9b9wgqdcFng8sAqJ52hynaA7QD++ GI04qRNhKyfdPVbs9M4Qkv+ndmVySOueMhPd+tzhLEbs X-Google-Smtp-Source: APXvYqxi1Xv/QkbBA2jSvhf42gE/0/GjlZT+28kIJ22QZCBgy3TGucQoIVJdZ58wFM0ilimuvFidPF9nMXUFpHKFFTY= X-Received: by 2002:a50:ac02:: with SMTP id v2mr1453146edc.86.1555657861379; Fri, 19 Apr 2019 00:11:01 -0700 (PDT) MIME-Version: 1.0 From: Tony Przygienda Date: Fri, 19 Apr 2019 00:10:25 -0700 Message-ID: To: rift@ietf.org Content-Type: multipart/alternative; boundary="000000000000383b4d0586dcd2c5" Archived-At: Subject: [Rift] New public binary of RIFT 0.9.4 published ... X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Apr 2019 07:11:05 -0000 --000000000000383b4d0586dcd2c5 Content-Type: text/plain; charset="UTF-8" https://www.juniper.net/us/en/dm/free-rift-trial/ New free, public version 0.9.4 available now, this covers also 0.9.3 content which has not been uploaded thanks --- tony --000000000000383b4d0586dcd2c5 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
https://www.juniper.net/us/en/dm/free-rift-trial/<= /div>

New free, public version 0.9.4 availab= le now, this covers also 0.9.3 content which has not been uploaded

thanks

--- tony
--000000000000383b4d0586dcd2c5-- From nobody Sat Apr 20 09:16:33 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7E1EB120044 for ; Sat, 20 Apr 2019 09:16:31 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.9 X-Spam-Level: X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=krisprice-nz.20150623.gappssmtp.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PUPTyPR9QV-e for ; Sat, 20 Apr 2019 09:16:28 -0700 (PDT) Received: from mail-lj1-x235.google.com (mail-lj1-x235.google.com [IPv6:2a00:1450:4864:20::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EEA3112013D for ; Sat, 20 Apr 2019 09:16:27 -0700 (PDT) Received: by mail-lj1-x235.google.com with SMTP id t4so6955330ljc.2 for ; Sat, 20 Apr 2019 09:16:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=krisprice-nz.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=0jW2lbDDotzl5eu/vnHkDwMMa2ugKyqif9rtM4vgUVY=; b=vK4nI2FVENHQLFB5p2MEcaZQWdBXMbpnwoPYZNKJLzfCaa52wobUuE83cKhADPx5yi M6y+zCG7mwr7Ul1AV8dy7E2KlupzScmFY4hskyC0HCQ1snpGyfIsLiu0VGWDEadlYQT0 o4L5DxlxaLcxi3jMK9V+50mlHuqFUNRNKjfab7Ay1AynBqw9FT0pNklzyoNV28DNuOTi ybI3tnPCZJGliqhdprB/6ceZ4tI+xl/chyxwfTFzSABc/iKZ4aVAKHXTd1UBul/VCLxa E8CQ0xf5W2TtC4e68AvXUmMbh/feOqFBmnSXsmAT5KKxW6hx6Vqw7HjV07EQxFbVO8FV 3IHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=0jW2lbDDotzl5eu/vnHkDwMMa2ugKyqif9rtM4vgUVY=; b=Iqbab2J7TyLvS5lRxLFEXU4db+iIsb+J+UBUoX9j1EGwi1GHVFX6b6P05VKz4V04Tb Luko4nr5n5XoDwppZ4LdJmximlyTt5MiJ0YNemZ7Zh2yPNTiTaTPUHp4jNTT6o1nHoz7 MfRMJfYG4qlgGhMZOSInz0geEor898GX+4o1+buJl5JQWMvSpvjIkT2PSWg0zt/ZHppp TaXoQ99oTRX/FL0/qUiu12D62KhqezTl/IcTlAcu7jvk1yFCAdyVTpTApFmJi3S0sKqd QQLyIOiI8hK4Y+PJ5EKeWjAZd3glfggp5batHWVr1QVV7q1ziYfANtVufS5dPPIOmN6v iq/Q== X-Gm-Message-State: APjAAAWUgMaeHiy7MM7sqlpDDoMT8nvHQ8kjW4k1lbVMhZ2oS7wIxx7W fkv+g4sGjQt0tBVfKbTkgTjkYTC04N+EepuCinyOBA== X-Google-Smtp-Source: APXvYqz3QNMrxcNr5AnbQYcmsGrLa15LqxQyL1oNwltGkdKiMiK+Zn77O2kFeH/+8CidNKXcMSqgd0f6Oh4h4Vuc/zA= X-Received: by 2002:a2e:731a:: with SMTP id o26mr4730043ljc.69.1555776985744; Sat, 20 Apr 2019 09:16:25 -0700 (PDT) MIME-Version: 1.0 References: <8B6D2FC6-15D3-4133-AAB4-160E1D82A827@gmail.com> In-Reply-To: <8B6D2FC6-15D3-4133-AAB4-160E1D82A827@gmail.com> From: Kris Price Date: Sat, 20 Apr 2019 12:16:27 -0400 Message-ID: To: Bruno Rijsman Cc: Tony Przygienda , "rift@ietf.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Apr 2019 16:16:31 -0000 Hi Bruno, Not sure if I followed negative disaggregation correctly. Is this used at levels below the top of fabric or was it something that was discussed as a possibility? Cheers Kris On Thu, Apr 18, 2019 at 6:10 PM Bruno Rijsman wrot= e: > > Kris, > > What is your opinion is on negative aggregation as a solution for the tra= nsient incast-like congestion after a failure with positive disaggregation? > > =E2=80=94 Bruno > > > On Apr 18, 2019, at 5:07 PM, Kris Price wrote: > > > > Hey Tony, inline: > > > > [snip] > >> On the rings: Ahh! I get it, okay that makes it better. I was also > >> wondering if some kind of designated 'S-TIE' reflector / virtual links > >> / or explicitly configured multi-hop adjacencies solution could be > >> used (the issue being one of how do you route these packets between > >> the peers without needing to do something like source route multiple > >> hops southbound before being default routed northbound). > >> > >> good, I know it takes bit to grok the stuff. We did the best we could = with ASCII and language but the concepts need some chewing for sure, even i= f you have been around big fabrics for a bit ;-) So, nothing like route ref= lectors and so on, within a plane normal south reflection takes care of syn= c'ing up all you need, outside the plane the ring takes care of sync'ing up= planes (for flooding horizontal links below ToF are south and @ ToF level = north basically and with that you have all the topology to figure out negat= ive disaggregation. I explicitly killed any "virtual link" suggestions, I = went through this particular hell in my life more than once and don't want = to visit it anymore ;-) ... > > > > [KP]: I'm a bit skeptical of buy in to rings as a solution, but if you > > have customer's buying into that then that's cool. (I omitted > > describing the physical *shudder* when I wrote "virtual links" ;-)) > > > > [snip] > >> It seems this would arise frequently at the bottom two tiers of the > >> network. Any loss of any single link to any rack (tier 0) would result > >> in all other nodes at tier 1 disaggregating the prefix(es) for that > >> rack and causing the potential transient incast-like congestion. I'm a > >> bit concerned that this may be a noticeable event in some cases (e.g. > >> a storage row/cluster or maybe where RoCE is in use), and one that > >> would be fairly annoying to debug and remedy post transition to RIFT > >> if you didn't foresee it and have the tools (knobs) in place to > >> prevent it from happening without a PR and s/w upgrade. > >> > >> yepp, you call the spade but you're a bit too pesimistic me thinks. Le= t's assume 2 ToRs dual-homing a rack or couple racks of servers. if you loo= se a link in a multi-homed server you basically end up having the other ToR= de-aggregating just this server prefix to other servers (even if you run s= ome kubernetes @ scale you may have 100 prefixes or so I'd say, I can't ima= gine a server hosting thousands really) ... Then, if you think about the To= Rs on top of PoD then it's not as bad as you think. If you loose a single T= oR in a PoD towards a spine (I'm loose with terminology here) then you will= NOT see disaggregation as long the other ToRs in the PoD are still connect= ed to the PoD. Draw pictures & run the public consumption package ;-) More= interesting discussions are bandwdith balancing on link losses (which I th= ink we solved well northbound) and whether it even should be done southboun= d since notion of "available bandwidth southbound" is confounding ... Spec = doesn't forbid it (the beauty of loop-free valley-free routing that gives y= ou insane amount of lee-way how you choose to forward) BTW if somone is sma= rt enough to figure that out ;-) ... > > > > [KP]: You're not the first to describe me as a pessimist. :-) I don't > > follow the 2x ToRs and multi-homed servers part, I haven't seen that > > used in a very long time, and granted I've been out of the game for a > > bit but is anyone still multihoming servers at scale? Maybe certain > > enterprise use cases, but they're not pushing the boundaries of scale > > so don't need the aggregation anyway. > > > > [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8 > > or 16 switches (or more) northbound (naturally let's call that next > > tier "tier-2"). If any single link between a tier-1 and tier-2 switch > > goes down (let's say between tier-1-1 and tier-2-1), all other nodes > > in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine > > that tier-2-1 no longer has southbound reachability for tier-1-1's > > prefixes and that they each need to disagregate these to prevent > > tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which > > would then need to forward up to tier-3 and back down). > > > > [KP]: With positive disagregation we can introduce transient > > congestion if there's a lot of traffic from tier-1-2..n to tier-1-1 > > because a switch may get the prefix from one upstream node first and > > install that before getting it from the remaining upstream nodes. (So > > we could for a brief instant go from 8 paths ECMP to 1 path then back > > up to 7 paths.) On the other hand if all prefixes are disaggregated, > > and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 is now > > only announcing a withdraw for the affected prefixes to tier-1-2..n, > > we can avoid generating this temporary incast-like scenario by design. > > > > [KP]: It's a preference for more deterministic behavior of the fabric > > over less deterministic behavior. > > > >> Should > >> implementations have a conscious solution in advance for this, and > >> what's the best way to ensure that? The 'always-disaggregate' knob is > >> one. Another might be something like a 'min-next-hops' option where > >> the local RIFT instance on tier 0 won't install a prefix unless it has > >> received it from a minimum number of up streams > >> > >> The always disaggregate knob is something you can do per level if you = desire but it's basically a big hammer buying you much bigger blast radius = in normal operation. And if you pull RIFT onto servers in multi-plane fabri= cs your FIB may blow up if you do that (unless we think server adapters wit= h 2M FIB size, probably ain't gonna happen ;-). > > > > [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a > > design consideration for anyone before thinking about routes from > > servers. At a small number selectively it's fine and practiced, e.g. > > advertising prefixes from servers that are doing software load > > balancing. > > > > [KP]: But advertising them say for every VM so you can move VMs > > anywhere... that's still going to have impacts on your network design. > > With FIB sizes as they are these days most people below the top 5 (or > > so) are going to be fine. And anyone in the top 5 (or so) are still > > going to be running into trouble. And if you do something like use the > > same switch for top of rack layer as at further layers in the Clos, > > then this RIFT scaling feature doesn't apply to you anyway as you have > > the same FIB size at all tiers. > > > > [KP]: In any case disaggregation at the bottom two tiers where this is > > much more likely to be a problem, still permits aggregation higher. > > > >> The other idea I don't grok, you have to explain in more detail. > > > > [KP]: As an alternative to disaggregation and announcing a withdraw > > when the link between tier-1-1 and tier-2-1 goes down. It could be > > that we have all the RIFT instances on tier-1 configured to know that > > they should not install a prefix *unless* they have seen it advertised > > from some minimum number of tier-2 nodes. E.g. if there are 8 tier-2 > > nodes, we might set that to say 4. Now we somewhat avoid the incast > > scenario where the switch installs the disaggregated prefix with one > > next hop into it's FIB. Instead it'll wait until it has a minimum of > > four next-hops. (This is spit balling, it may open other problems.) > > > > Cheers :-) > > Kris > From nobody Sat Apr 20 09:58:07 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 79D8512014C for ; Sat, 20 Apr 2019 09:58:05 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.338 X-Spam-Level: X-Spam-Status: No, score=-1.338 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, KHOP_DYNAMIC=1.363, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=no autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Idk7nWEEqszO for ; Sat, 20 Apr 2019 09:58:02 -0700 (PDT) Received: from mx0a-00273201.pphosted.com (mx0a-00273201.pphosted.com [208.84.65.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 24FCD12002E for ; Sat, 20 Apr 2019 09:58:01 -0700 (PDT) Received: from pps.filterd (m0108156.ppops.net [127.0.0.1]) by mx0a-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3KGsLe4014725; Sat, 20 Apr 2019 09:57:59 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : mime-version; s=PPS1017; bh=fNY+zY1B5OiFo82NgTlfxRZdG4VrWlwFE2B2cxqUo3w=; b=O6J0O+4CDnjDMV5QMQb825HOog22WR/kbKpEtm/7B+BDFuMB0yjuUxEpDF0SfITitnub zPeS6GjwlX522a6uEqqS8q76NWjHKLM5ELQg3r/XQuMOcVZ3Cm2KqNinbqJ3OIe5h3CR rKyHOY52/YSWRsz7qDb+Rzwc0jRmhtR9W7K7+0uY9zkPRTfT2tNl3kPQm37jAHsP25/M /kb61K7OWLZKfPfbSlo6wQh5cDAZjk9oUcXYjPw3Yqm9i9zzA1uXL8BY+Eih+i0KOiGL tz997PdiM+Th1lL+wV1My3dY12d5WBe2lvKDhBYyR1QDmwk3HahhCFKoqdPywrt4Sr3s RQ== Received: from nam03-co1-obe.outbound.protection.outlook.com (mail-co1nam03lp2056.outbound.protection.outlook.com [104.47.40.56]) by mx0a-00273201.pphosted.com with ESMTP id 2s02em89g3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Sat, 20 Apr 2019 09:57:59 -0700 Received: from MWHPR05MB3279.namprd05.prod.outlook.com (10.173.230.18) by MWHPR05MB3263.namprd05.prod.outlook.com (10.173.230.14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1835.6; Sat, 20 Apr 2019 16:57:56 +0000 Received: from MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202]) by MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202%10]) with mapi id 15.20.1835.007; Sat, 20 Apr 2019 16:57:56 +0000 From: Antoni Przygienda To: Kris Price , Bruno Rijsman CC: "rift@ietf.org" Thread-Topic: RIFT Thread-Index: AQHU8VRUPBOno7miBkKQqRko+b0GOKY4xxUkgAludgCAAAZsrYAAJfoAgAAiaQCAAsHNgIAACazQ Date: Sat, 20 Apr 2019 16:57:56 +0000 Message-ID: References: <8B6D2FC6-15D3-4133-AAB4-160E1D82A827@gmail.com>, In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [108.228.12.76] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: d25b83f5-2ca5-4321-b681-08d6c5b1565d x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600141)(711020)(4605104)(4618075)(2017052603328)(7193020); SRVR:MWHPR05MB3263; x-ms-traffictypediagnostic: MWHPR05MB3263: x-microsoft-antispam-prvs: x-forefront-prvs: 0013079544 x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(376002)(396003)(346002)(366004)(136003)(39860400002)(189003)(199004)(26005)(6246003)(5660300002)(14454004)(93886005)(53546011)(71200400001)(3480700005)(33656002)(71190400001)(7116003)(4326008)(99286004)(6436002)(186003)(53936002)(86362001)(102836004)(66066001)(81156014)(7696005)(221733001)(74316002)(110136005)(8676002)(105004)(76176011)(76116006)(68736007)(256004)(64756008)(66446008)(9686003)(66556008)(66476007)(73956011)(66946007)(14444005)(2906002)(476003)(486006)(11346002)(446003)(316002)(19627405001)(7736002)(54896002)(6116002)(8936002)(3846002)(478600001)(25786009)(229853002)(55016002)(97736004)(66574012)(52536014)(6506007)(81166006); DIR:OUT; SFP:1102; SCL:1; SRVR:MWHPR05MB3263; H:MWHPR05MB3279.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: aHNJx09ftExHiBeL19AEoXiB0zxO1z2qrcpV8AzEkCE0KAQfwCS6IwF20GeVKAq2dOUkwwam7I49jGn3FRCHOgwy/SjhquX6fdP9iwoUluKR8FmR2j2M7fTr+JX5DLxBnR5pd/GfPNJQwZ9t4/tk99wv5OWhoP3rerPpbuGwYuS97IbUdVtdGuJCUXb8zWdpCi3fDP0IksevM13P3v0G1UJ39B8iXmKok2yeZY//at6OuMKBer4ALgweDFfq7O0OEaUoHHx5/+uzs1Wrs+JWFohQbXLkJZ4FLgr+i+QIRfJl9gC5KqkKjDePE7sXprfHhqPzmfUXWoo5V/LjZHiinesnJB+DPpKqKjweEbXnxGkpeDYnBNoKENkjqmwG8wx6+e3m6qKQnEaIbeZgH+kgYiNRBX5en8HhQxvSCkI+L18= Content-Type: multipart/alternative; boundary="_000_MWHPR05MB3279BEF36FCF955D93B90100AC200MWHPR05MB3279namp_" MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-Network-Message-Id: d25b83f5-2ca5-4321-b681-08d6c5b1565d X-MS-Exchange-CrossTenant-originalarrivaltime: 20 Apr 2019 16:57:56.6117 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR05MB3263 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-20_06:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904200128 Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Apr 2019 16:58:06 -0000 --_000_MWHPR05MB3279BEF36FCF955D93B90100AC200MWHPR05MB3279namp_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Kris, negative disaggreagtion is used if and only if a) fabric has mutliple planes b) a node gets completely separated in terms of cross-sectional bandwidth f= rom a plane, call that "fallen leafs" Negative disaggregation is transitive only to the level where the breakage = is healed Sections 5.2.5.2, 6.5 explains that all in nice details and specs out the m= echanisms necessary. All in protocol since a bit now ... thanks --- tony ________________________________ From: Kris Price Sent: Saturday, April 20, 2019 9:16 AM To: Bruno Rijsman Cc: Antoni Przygienda; rift@ietf.org Subject: Re: RIFT Hi Bruno, Not sure if I followed negative disaggregation correctly. Is this used at levels below the top of fabric or was it something that was discussed as a possibility? Cheers Kris On Thu, Apr 18, 2019 at 6:10 PM Bruno Rijsman wrot= e: > > Kris, > > What is your opinion is on negative aggregation as a solution for the tra= nsient incast-like congestion after a failure with positive disaggregation? > > =97 Bruno > > > On Apr 18, 2019, at 5:07 PM, Kris Price wrote: > > > > Hey Tony, inline: > > > > [snip] > >> On the rings: Ahh! I get it, okay that makes it better. I was also > >> wondering if some kind of designated 'S-TIE' reflector / virtual links > >> / or explicitly configured multi-hop adjacencies solution could be > >> used (the issue being one of how do you route these packets between > >> the peers without needing to do something like source route multiple > >> hops southbound before being default routed northbound). > >> > >> good, I know it takes bit to grok the stuff. We did the best we could = with ASCII and language but the concepts need some chewing for sure, even i= f you have been around big fabrics for a bit ;-) So, nothing like route ref= lectors and so on, within a plane normal south reflection takes care of syn= c'ing up all you need, outside the plane the ring takes care of sync'ing up= planes (for flooding horizontal links below ToF are south and @ ToF level = north basically and with that you have all the topology to figure out negat= ive disaggregation. I explicitly killed any "virtual link" suggestions, I = went through this particular hell in my life more than once and don't want = to visit it anymore ;-) ... > > > > [KP]: I'm a bit skeptical of buy in to rings as a solution, but if you > > have customer's buying into that then that's cool. (I omitted > > describing the physical *shudder* when I wrote "virtual links" ;-)) > > > > [snip] > >> It seems this would arise frequently at the bottom two tiers of the > >> network. Any loss of any single link to any rack (tier 0) would result > >> in all other nodes at tier 1 disaggregating the prefix(es) for that > >> rack and causing the potential transient incast-like congestion. I'm a > >> bit concerned that this may be a noticeable event in some cases (e.g. > >> a storage row/cluster or maybe where RoCE is in use), and one that > >> would be fairly annoying to debug and remedy post transition to RIFT > >> if you didn't foresee it and have the tools (knobs) in place to > >> prevent it from happening without a PR and s/w upgrade. > >> > >> yepp, you call the spade but you're a bit too pesimistic me thinks. Le= t's assume 2 ToRs dual-homing a rack or couple racks of servers. if you loo= se a link in a multi-homed server you basically end up having the other ToR= de-aggregating just this server prefix to other servers (even if you run s= ome kubernetes @ scale you may have 100 prefixes or so I'd say, I can't ima= gine a server hosting thousands really) ... Then, if you think about the To= Rs on top of PoD then it's not as bad as you think. If you loose a single T= oR in a PoD towards a spine (I'm loose with terminology here) then you will= NOT see disaggregation as long the other ToRs in the PoD are still connect= ed to the PoD. Draw pictures & run the public consumption package ;-) More= interesting discussions are bandwdith balancing on link losses (which I th= ink we solved well northbound) and whether it even should be done southboun= d since notion of "available bandwidth southbound" is confounding ... Spec = doesn't forbid it (the beauty of loop-free valley-free routing that gives y= ou insane amount of lee-way how you choose to forward) BTW if somone is sma= rt enough to figure that out ;-) ... > > > > [KP]: You're not the first to describe me as a pessimist. :-) I don't > > follow the 2x ToRs and multi-homed servers part, I haven't seen that > > used in a very long time, and granted I've been out of the game for a > > bit but is anyone still multihoming servers at scale? Maybe certain > > enterprise use cases, but they're not pushing the boundaries of scale > > so don't need the aggregation anyway. > > > > [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8 > > or 16 switches (or more) northbound (naturally let's call that next > > tier "tier-2"). If any single link between a tier-1 and tier-2 switch > > goes down (let's say between tier-1-1 and tier-2-1), all other nodes > > in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine > > that tier-2-1 no longer has southbound reachability for tier-1-1's > > prefixes and that they each need to disagregate these to prevent > > tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which > > would then need to forward up to tier-3 and back down). > > > > [KP]: With positive disagregation we can introduce transient > > congestion if there's a lot of traffic from tier-1-2..n to tier-1-1 > > because a switch may get the prefix from one upstream node first and > > install that before getting it from the remaining upstream nodes. (So > > we could for a brief instant go from 8 paths ECMP to 1 path then back > > up to 7 paths.) On the other hand if all prefixes are disaggregated, > > and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 is now > > only announcing a withdraw for the affected prefixes to tier-1-2..n, > > we can avoid generating this temporary incast-like scenario by design. > > > > [KP]: It's a preference for more deterministic behavior of the fabric > > over less deterministic behavior. > > > >> Should > >> implementations have a conscious solution in advance for this, and > >> what's the best way to ensure that? The 'always-disaggregate' knob is > >> one. Another might be something like a 'min-next-hops' option where > >> the local RIFT instance on tier 0 won't install a prefix unless it has > >> received it from a minimum number of up streams > >> > >> The always disaggregate knob is something you can do per level if you = desire but it's basically a big hammer buying you much bigger blast radius = in normal operation. And if you pull RIFT onto servers in multi-plane fabri= cs your FIB may blow up if you do that (unless we think server adapters wit= h 2M FIB size, probably ain't gonna happen ;-). > > > > [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a > > design consideration for anyone before thinking about routes from > > servers. At a small number selectively it's fine and practiced, e.g. > > advertising prefixes from servers that are doing software load > > balancing. > > > > [KP]: But advertising them say for every VM so you can move VMs > > anywhere... that's still going to have impacts on your network design. > > With FIB sizes as they are these days most people below the top 5 (or > > so) are going to be fine. And anyone in the top 5 (or so) are still > > going to be running into trouble. And if you do something like use the > > same switch for top of rack layer as at further layers in the Clos, > > then this RIFT scaling feature doesn't apply to you anyway as you have > > the same FIB size at all tiers. > > > > [KP]: In any case disaggregation at the bottom two tiers where this is > > much more likely to be a problem, still permits aggregation higher. > > > >> The other idea I don't grok, you have to explain in more detail. > > > > [KP]: As an alternative to disaggregation and announcing a withdraw > > when the link between tier-1-1 and tier-2-1 goes down. It could be > > that we have all the RIFT instances on tier-1 configured to know that > > they should not install a prefix *unless* they have seen it advertised > > from some minimum number of tier-2 nodes. E.g. if there are 8 tier-2 > > nodes, we might set that to say 4. Now we somewhat avoid the incast > > scenario where the switch installs the disaggregated prefix with one > > next hop into it's FIB. Instead it'll wait until it has a minimum of > > four next-hops. (This is spit balling, it may open other problems.) > > > > Cheers :-) > > Kris > --_000_MWHPR05MB3279BEF36FCF955D93B90100AC200MWHPR05MB3279namp_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable
Kris, negative disaggreagtion is used if and only if

a) fabric has mutliple planes
b) a node gets completely separated in terms of cross-sectional bandwidth f= rom a plane, call that "fallen leafs"

Negative disaggregation is transitive only to the level where the breakage = is healed

Sections 5.2.5.2, 6.5 explains that all in nice details and specs out the m= echanisms necessary. All in protocol since a bit now ...

thanks

--- tony

From: Kris Price <kris@k= risprice.nz>
Sent: Saturday, April 20, 2019 9:16 AM
To: Bruno Rijsman
Cc: Antoni Przygienda; rift@ietf.org
Subject: Re: RIFT
 
Hi Bruno,

Not sure if I followed negative disaggregation correctly. Is this used
at levels below the top of fabric or was it something that was
discussed as a possibility?

Cheers
Kris

On Thu, Apr 18, 2019 at 6:10 PM Bruno Rijsman <brunorijsman@gmail.com>= ; wrote:
>
> Kris,
>
> What is your opinion is on negative aggregation as a solution for the = transient incast-like congestion after a failure with positive disaggregati= on?
>
> =97 Bruno
>
> > On Apr 18, 2019, at 5:07 PM, Kris Price <kris@krisprice.nz>= wrote:
> >
> > Hey Tony, inline:
> >
> > [snip]
> >> On the rings: Ahh! I get it, okay that makes it better. I was= also
> >> wondering if some kind of designated 'S-TIE' reflector / virt= ual links
> >> / or explicitly configured multi-hop adjacencies solution cou= ld be
> >> used (the issue being one of how do you route these packets b= etween
> >> the peers without needing to do something like source route m= ultiple
> >> hops southbound before being default routed northbound).
> >>
> >> good, I know it takes bit to grok the stuff. We did the best = we could with ASCII and language but the concepts need some chewing for sur= e, even if you have been around big fabrics for a bit ;-) So, nothing like = route reflectors and so on, within a plane normal south reflection takes care of sync'ing up all you need, outside th= e plane the ring takes care of sync'ing up planes (for flooding horizontal = links below ToF are south and @ ToF level north basically and with that you= have all the topology to figure out negative disaggregation.  I explicitly killed any "virtual l= ink" suggestions, I went through this particular hell in my life more = than once and don't want to visit it anymore ;-) ...
> >
> > [KP]: I'm a bit skeptical of buy in to rings as a solution, but i= f you
> > have customer's buying into that then that's cool. (I omitted
> > describing the physical *shudder* when I wrote "virtual link= s" ;-))
> >
> > [snip]
> >> It seems this would arise frequently at the bottom two tiers = of the
> >> network. Any loss of any single link to any rack (tier 0) wou= ld result
> >> in all other nodes at tier 1 disaggregating the prefix(es) fo= r that
> >> rack and causing the potential transient incast-like congesti= on. I'm a
> >> bit concerned that this may be a noticeable event in some cas= es (e.g.
> >> a storage row/cluster or maybe where RoCE is in use), and one= that
> >> would be fairly annoying to debug and remedy post transition = to RIFT
> >> if you didn't foresee it and have the tools (knobs) in place = to
> >> prevent it from happening without a PR and s/w upgrade.
> >>
> >> yepp, you call the spade but you're a bit too pesimistic me t= hinks. Let's assume 2 ToRs dual-homing a rack or couple racks of servers. i= f you loose a link in a multi-homed server you basically end up having the = other ToR de-aggregating just this server prefix to other servers (even if you run some kubernetes @ scale you may h= ave 100 prefixes or so I'd say, I can't imagine a server hosting thousands = really) ... Then, if you think about the ToRs on top of PoD then it's not a= s bad as you think. If you loose a single ToR in a PoD towards a spine (I'm loose with terminology here) th= en you will NOT see disaggregation as long the other ToRs in the PoD are st= ill connected to the PoD. Draw pictures & run the public consumption pa= ckage ;-)  More interesting discussions are bandwdith balancing on link losses (which I think we solved well north= bound) and whether it even should be done southbound since notion of "= available bandwidth southbound" is confounding ... Spec doesn't forbid= it (the beauty of loop-free valley-free routing that gives you insane amount of lee-way how you choose to forward) BTW if = somone is smart enough to figure that out ;-) ...
> >
> > [KP]: You're not the first to describe me as a pessimist. :-) I d= on't
> > follow the 2x ToRs and multi-homed servers part, I haven't seen t= hat
> > used in a very long time, and granted I've been out of the game f= or a
> > bit but is anyone still multihoming servers at scale? Maybe certa= in
> > enterprise use cases, but they're not pushing the boundaries of s= cale
> > so don't need the aggregation anyway.
> >
> > [KP]: A top of rack switch ("tier-1" lets say) may be c= onnected to 8
> > or 16 switches (or more) northbound (naturally let's call that ne= xt
> > tier "tier-2"). If any single link between a tier-1 and= tier-2 switch
> > goes down (let's say between tier-1-1 and tier-2-1), all other no= des
> > in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determin= e
> > that tier-2-1 no longer has southbound reachability for tier-1-1'= s
> > prefixes and that they each need to disagregate these to prevent<= br> > > tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (w= hich
> > would then need to forward up to tier-3 and back down).
> >
> > [KP]: With positive disagregation we can introduce transient
> > congestion if there's a lot of traffic from tier-1-2..n to tier-1= -1
> > because a switch may get the prefix from one upstream node first = and
> > install that before getting it from the remaining upstream nodes.= (So
> > we could for a brief instant go from 8 paths ECMP to 1 path then = back
> > up to 7 paths.) On the other hand if all prefixes are disaggregat= ed,
> > and when the link between tier-1-1 and tier-2-1 fails, tier-2-1 i= s now
> > only announcing a withdraw for the affected prefixes to tier-1-2.= .n,
> > we can avoid generating this temporary incast-like scenario by de= sign.
> >
> > [KP]: It's a preference for more deterministic behavior of the fa= bric
> > over less deterministic behavior.
> >
> >> Should
> >> implementations have a conscious solution in advance for this= , and
> >> what's the best way to ensure that? The 'always-disaggregate'= knob is
> >> one. Another might be something like a 'min-next-hops' option= where
> >> the local RIFT instance on tier 0 won't install a prefix unle= ss it has
> >> received it from a minimum number of up streams
> >>
> >> The always disaggregate knob is something you can do per leve= l if you desire but it's basically a big hammer buying you much bigger blas= t radius in normal operation. And if you pull RIFT onto servers in multi-pl= ane fabrics your FIB may blow up if you do that (unless we think server adapters with 2M FIB size, probably ain't gon= na happen ;-).
> >
> > [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a<= br> > > design consideration for anyone before thinking about routes from=
> > servers. At a small number selectively it's fine and practiced, e= .g.
> > advertising prefixes from servers that are doing software load > > balancing.
> >
> > [KP]: But advertising them say for every VM so you can move VMs > > anywhere... that's still going to have impacts on your network de= sign.
> > With FIB sizes as they are these days most people below the top 5= (or
> > so) are going to be fine. And anyone in the top 5 (or so) are sti= ll
> > going to be running into trouble. And if you do something like us= e the
> > same switch for top of rack layer as at further layers in the Clo= s,
> > then this RIFT scaling feature doesn't apply to you anyway as you= have
> > the same FIB size at all tiers.
> >
> > [KP]: In any case disaggregation at the bottom two tiers where th= is is
> > much more likely to be a problem, still permits aggregation highe= r.
> >
> >> The other idea I don't grok, you have to explain in more deta= il.
> >
> > [KP]: As an alternative to disaggregation and announcing a withdr= aw
> > when the link between tier-1-1 and tier-2-1 goes down. It could b= e
> > that we have all the RIFT instances on tier-1 configured to know = that
> > they should not install a prefix *unless* they have seen it adver= tised
> > from some minimum number of tier-2 nodes. E.g. if there are 8 tie= r-2
> > nodes, we might set that to say 4. Now we somewhat avoid the inca= st
> > scenario where the switch installs the disaggregated prefix with = one
> > next hop into it's FIB. Instead it'll wait until it has a minimum= of
> > four next-hops. (This is spit balling, it may open other problems= .)
> >
> > Cheers :-)
> > Kris
>
--_000_MWHPR05MB3279BEF36FCF955D93B90100AC200MWHPR05MB3279namp_-- From nobody Sat Apr 20 11:58:47 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CACA9120164 for ; Sat, 20 Apr 2019 11:58:44 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.9 X-Spam-Level: X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=krisprice-nz.20150623.gappssmtp.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uIlQxW3P1KnE for ; Sat, 20 Apr 2019 11:58:42 -0700 (PDT) Received: from mail-lf1-x131.google.com (mail-lf1-x131.google.com [IPv6:2a00:1450:4864:20::131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EB9AA120162 for ; Sat, 20 Apr 2019 11:58:41 -0700 (PDT) Received: by mail-lf1-x131.google.com with SMTP id i68so6214431lfi.10 for ; Sat, 20 Apr 2019 11:58:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=krisprice-nz.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=0JUdCnqtE5ocWuTCr6MQrvJ2LCxLPENekz/JFLRf+2M=; b=ylYK3Jo/i99QVEhLuE6P6OBmTRKu+JfnvmzGOws647p37Zkq5mC9+TiPRTjIXPDXWi JcKkRYDJN/MNJrXfAt3fxZRwHER/213203g5b6mY58tdtAyk/jE87wqJP+qkA3olAzAG a0xeox3XX0rvVU5sQkJqRcmC/yBDJRUfoP9uf+2Wysy6Q/hJeN6G/wlQEdESorgu08lU srz9JwfHEHUxXPXA9Tn4VhXA5t7SpFLf4h/Lb4N8UN0w0Z1mc3MMoabdOZHdPAZTfKs8 bnL09FrNXF5gQ70N7SLn3KrNYJ38iuhOCbU37i4TIRk0RZGHBvQ+/6vAig8ssgulI8yY 9j2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=0JUdCnqtE5ocWuTCr6MQrvJ2LCxLPENekz/JFLRf+2M=; b=filEn5hNJ7wdiUVsO6eC3U3M9qvqBUVE/OS0y5h4aEUSJGXx46OFtVPMaQuOEV4XB0 niFuyrXffAndzXTPE5OxQ+jp1EiKQj6YL+eG/RqiZEsi33m92g4rzv/DpQpWol8BmEvi lAPkdUZRx9gnTp0uMgfZ7NfsZJpXcCeGVr4eXmOBTJIoyAO4yRW2BamJW+ETGBXOhdwh x5rvy3yOtwi83sEbTeEkjsnDEfwaCd9yxCyUt9puGXvG+UzMncA5W1mCjs47HwqWPmhY nrCdxosefLGOzOSWmG9xL2usQa7mOuu1/xHMFxm/tDc2HhI/W9LdDF2o+Z3YQZuLwa4H nSJQ== X-Gm-Message-State: APjAAAUNZjwkzLxuJlVG4pDspp6iO3JL5vtO3TpU4fOKeDzrgSaae61q BaRopgD7xtKYyx4LQUVMFvpTP0ji/uXirIfiNUmDhg== X-Google-Smtp-Source: APXvYqyxIzIbeQkx6vU0yt9DDozXCBDViWNHOrLQeFy8o4LtWmykhTMl7GyzqVy6ae7IIUax9qhWbbH1KKXJYXORVq4= X-Received: by 2002:ac2:5a47:: with SMTP id r7mr6150730lfn.116.1555786720021; Sat, 20 Apr 2019 11:58:40 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Kris Price Date: Sat, 20 Apr 2019 14:58:41 -0400 Message-ID: To: Antoni Przygienda Cc: "brunorijsman@gmail.com" , "rift@ietf.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Apr 2019 18:58:45 -0000 Inline: [snip] > [Prz] Modern architectures I see will be moving to good extent to ROTH IM= O due to micro-segmentation and tunnel origination on servers. [KP]: Will respond to this further down. > [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8 > or 16 switches (or more) northbound (naturally let's call that next > tier "tier-2"). If any single link between a tier-1 and tier-2 switch > goes down (let's say between tier-1-1 and tier-2-1), all other nodes > in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine > that tier-2-1 no longer has southbound reachability for tier-1-1's > prefixes and that they each need to disagregate these to prevent > tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which > would then need to forward up to tier-3 and back down). > > [Prz] I think we have a disconnect here. ToF level will only disaggregate= if a ToF looses _all_ ToP connections to a PoD in a single plane design so= I don't follow your argument. If you run multi-plane design you should mul= ti-home each Pod multiple times into your plane as well. If you don't, dugh= , you must disaggregate since the plane will blackhole. [KP]: I think the disconnect is due my not using RIFT labels for devices. I'm not talking about top of fabric. In RIFT labels I'm describing a PoD, where the Leafs are top of rack switches. Then when a Top of PoD<->Leaf link fails, the other Top of PoD switches will disaggregate the prefixes on and below that Leaf leading to the incast problem described. (This is described in the draft.) [snip] > [KP]: It's a preference for more deterministic behavior of the fabric > over less deterministic behavior. > > [Prz] Well, having blast radius of whole fabric is in a sense determinist= ic with every server changing/rebooting shaking whole fabric. I wouldn't ca= ll it optimal though. [KP]: Will respond to this further down. > Far more helpful than "deterministic" is in control system theory (https:= //en.wikipedia.org/wiki/Stability_theory) to think about "stability" where = desireable positive stability is correlated with minimal blast radius. The = more inputs shake more of your system the less "stability" you have. [KP]: Absolutely, I'm not a mathematician, but reducing the amount of change under small perturbations is a concern at the back of my head when I described a preference for more deterministic behavior. Aside from addressing the incast concern, it was an intuition that adding routes, and subtracting routes as they come and go would be less change than mass adds/removals when disagregating. So e.g. sticking with the example of one PoD, where the link between a top of rack and top of PoD switch goes down. That means one top of PoD device withdraws one route (or maybe 1*many routes if routing on host is happening), and that was less of a churn than say 7 other top of PoD switches advertising 7*many new routes. > [Prz] But again, if you want to disaggreagate e'thing all the time, RIFT = won't stop you and you still will be benefiting from flood reduction and N-= flooding-only in RIFT which makes for about 25% of normal flooding volume b= ased on empirical data here ... [KP]: That's good, that was my hope. There are benefits to RIFT beyond it's positive disaggregation feature. [snip] > [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a > design consideration for anyone before thinking about routes from > servers. At a small number selectively it's fine and practiced, e.g. > advertising prefixes from servers that are doing software load > balancing. > > [Prz] Then we disconnect. Think about flat host routing & what rebooting = one server does to you in terms of flooding & resulting computaton and so o= n and what RIFT blast radius is. There is a world of difference. [KP]: I don't fully follow the statements about routing on the host and shaking the fabric. Sure it would be a bad idea[tm] to do this in a flat OSPF domain. As I understood it in RIFT with it's link state up, distance vector down design, if we have a route come and go at the edge then it will be propagated all the way northbound, but will not be propagated southbound. If we were disaggregated between the Top of Rack switches and the Top of PoD switches, then a route appearing or dissapearing does propagate back down from the Top of PoD to the Top of Rack, but that's it. It doesn't shake the wider fabric any more than without disaggreation. If we were disagregated between the Top of Fabric and Top of PoD routes would be propagated back down to the other Top of PoDs, but not below. [KP]: WRT to the routing on host it also seems in contradiction with concerns about fabric stability. If fabric stability is a concern I would think you still want addressing hierarchy and to use another layer of indirection to achieve service mobility, to keep the fabric unaware of the services constantly popping in and out of existence. Yes, selectively this works, e.g. in the software load balancer scenario, tunnel ingest, etc. That is fine and widely practiced within limits. But the way you're describing it sounds like it's expected to be used generously, with every host announcing prefixes, and there's an expectation to move those prefixes such that you end up with a random distribution. (Which is fine, I am probably out of touch with fashion.) So with dissagregation in the PoD servers being single homed would still see just the default. But all Top of Rack switches will see the prefixes from other servers in the PoD, vs. when aggregation is in effect and they'll only see the default (plus any disaggregated due to a failure). The Top of PoD will have all prefixes in the PoD and below, and further up in higher layers they'll all need to scale up their FIB requirements to see all fabric routes. That's the same in all cases with RIFT due to link state up. [KP]: From purely the scaling perspective the aggregation feature is useful where the number of routes produced by the servers in a PoD can overwhelm the top of rack switches in that PoD but not the Top of PoD switches, and so on up higher layers in the fabric. The FIB available in devices these days would seem to preclude that. [snip] > yes, it's always-negative-disaggregation which is possible, however much = harder to implement and you would somehow need to ring the ToP to have all = the necesasry topolgoy information to achieve that (that's why we ring ToF = in multi-plane design). Argument has been made before, we spent tons time w= ith Pascal going pro and cons until the current design was found the best c= hoice.[snip] [KP]: It looks like negative disaggregation could be an elegant protocol level solution if feasible and reliable. [KP]: I think the answer is that RIFT has looked into the problem and prospective operators are happy with the risks and trade off so I accept that. :-) Thanks! Kris From nobody Sat Apr 20 18:36:36 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E378B1201C5 for ; Sat, 20 Apr 2019 18:36:34 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.998 X-Spam-Level: X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id H_AnzfZBVQZl for ; Sat, 20 Apr 2019 18:36:31 -0700 (PDT) Received: from mail-ed1-x52a.google.com (mail-ed1-x52a.google.com [IPv6:2a00:1450:4864:20::52a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4844E1201A9 for ; Sat, 20 Apr 2019 18:36:31 -0700 (PDT) Received: by mail-ed1-x52a.google.com with SMTP id k45so7095144edb.6 for ; Sat, 20 Apr 2019 18:36:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=sjslSBdqMNreLQFeCi+MbI/kxd2MpV8PsJmEQq2MgE8=; b=YCoxE1CcKZX34CO4CUnCzllatVxqPWC1+1xaaKCTFvmO94fv9N6YYSPFGDuUdnmvak gfnM+j1BLCLfCpoBFUUH3BY+Uo1x50o8Gw6ybjP/2G0J1z5j5p6nYU7roZQY+oWtdd8d xIFtvgy99JbwvIxw1SGdkvaaxCMRv2/Ib2QodHKSPGPWQpKvpF6GCdEOy8AQj1Mi4QNV VBcQ4rvynya5jtTwCMBSnvTvBYbTApU6TT5iwDgYjJmsypxBdTBaMOo+guCJOLeZkzOy Qce2oXtk0CpRTE/Zy29olitM5PIRD24UrGT3j8Od+jG7PI1ol3zIYhIOWN7bk3uBlPga lGrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=sjslSBdqMNreLQFeCi+MbI/kxd2MpV8PsJmEQq2MgE8=; b=eiOaDq18MlWUUEYnG5puxSLvpTraRLvpoeYQEypwUQTmlAefB31e7wyeqMvKvTJnvB VP8D/F3vvX6gsOReBwm++GehUjOxCcLrxKObe3+/bMtvBLUP23GkmqvrS61GY7p5f9Wa NjaQokHAK5J6rUaGUcY0uT3ETKwTm+XzjZSM+Kgjt3Kgq0deqhoL6iY8SaFscz7eh5/7 n+7bdc/bHfQe/S3GKCrpa0uuw7apRTmis5ghgOmreCgwkVYFgb75/XmZldn4P1H868To 8IpZnWIY8Sn5spM7W9+YE+xZ+0/rmD7gH2OmtSq7rZh68PaQGnVXPqjaoQQpJ733faWX e5Jw== X-Gm-Message-State: APjAAAUSa742+OenIFBQvuKsaUcYFKPc1Mo2Vfvb1FkdoUDTCVpX9nbs iZ98Q8DqgOSr3XUfl5K3sjimicfYNGTBZbSk3j0= X-Google-Smtp-Source: APXvYqz5FB4JKbpOVjtQRRYZ6xwMA6FnXUvZ0lM7hVgKqvt81gwGolAo2OGp7UXAEjur6BAmnWvAG+de4HqUESDRSL0= X-Received: by 2002:a50:9052:: with SMTP id z18mr7289770edz.256.1555810589544; Sat, 20 Apr 2019 18:36:29 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Tony Przygienda Date: Sat, 20 Apr 2019 18:35:52 -0700 Message-ID: To: Kris Price Cc: Antoni Przygienda , "rift@ietf.org" , "brunorijsman@gmail.com" Content-Type: multipart/alternative; boundary="00000000000087174e05870061e3" Archived-At: Subject: Re: [Rift] RIFT X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Apr 2019 01:36:35 -0000 --00000000000087174e05870061e3 Content-Type: text/plain; charset="UTF-8" On Sat, Apr 20, 2019 at 11:58 AM Kris Price wrote: > ... > > > [KP]: A top of rack switch ("tier-1" lets say) may be connected to 8 > > or 16 switches (or more) northbound (naturally let's call that next > > tier "tier-2"). If any single link between a tier-1 and tier-2 switch > > goes down (let's say between tier-1-1 and tier-2-1), all other nodes > > in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine > > that tier-2-1 no longer has southbound reachability for tier-1-1's > > prefixes and that they each need to disagregate these to prevent > > tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which > > would then need to forward up to tier-3 and back down). > Let's look @ at a figure . [A,B,C,D] . [E] . +-----+ +-----+ Level 2 . | E | | F | A/32 @ [C,D] . +-+-+-+ +-+-+-+ B/32 @ [C,D] . | | | | C/32 @ C . | | +-----+ | D/32 @ D . | | | | . | +------+ | . | | | | Level 1 . [A,B] +-+---+ | | +---+-+ [A,B] . [D] | C +--+ +-+ D | [C] . +-+-+-+ +-+-+-+ . 0/0 @ [E,F] | | | | 0/0 @ [E,F] . A/32 @ A | | +-----+ | A/32 @ A . B/32 @ B | | | | B/32 @ B . | +------+ | . | | | | . +-+---+ | | +---+-+ . | A +--+ +-+ B | Level 0 . 0/0 @ [C,D] +-----+ +-----+ 0/0 @ [C,D] Let's call A ToR and it's holding 8 server addresses. If you loos D-A the only disaggregation you will see is C disaggregating to B the 8 addresses. This is unavoidable. I assume we agree. > > > [Prz] I think we have a disconnect here. ToF level will only > disaggregate if a ToF looses _all_ ToP connections to a PoD in a single > plane design so I don't follow your argument. If you run multi-plane design > you should multi-home each Pod multiple times into your plane as well. If > you don't, dugh, you must disaggregate since the plane will blackhole. > > [KP]: I think the disconnect is due my not using RIFT labels for > devices. I'm not talking about top of fabric. In RIFT labels I'm > describing a PoD, where the Leafs are top of rack switches. Then when > a Top of PoD<->Leaf link fails, the other Top of PoD switches will > disaggregate the prefixes on and below that Leaf leading to the incast > problem described. (This is described in the draft.) > right. It's really the only possible choice between having aggreation and having to react to link failure by disaggregating. > > > Far more helpful than "deterministic" is in control system theory ( > https://en.wikipedia.org/wiki/Stability_theory) to think about > "stability" where desireable positive stability is correlated with minimal > blast radius. The more inputs shake more of your system the less > "stability" you have. > > [KP]: Absolutely, I'm not a mathematician, but reducing the amount of > change under small perturbations is a concern at the back of my head > when I described a preference for more deterministic behavior. Aside > from addressing the incast concern, it was an intuition that adding > routes, and subtracting routes as they come and go would be less > change than mass adds/removals when disagregating. So e.g. sticking > with the example of one PoD, where the link between a top of rack and > top of PoD switch goes down. That means one top of PoD device > withdraws one route (or maybe 1*many routes if routing on host is > happening), and that was less of a churn than say 7 other top of PoD > switches advertising 7*many new routes. > if that's your preference you can simply configure all your Level 1 switches to always disaggregate @ the cost of extra flooding on every address change and FIB size in Level 0. > > [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a > > design consideration for anyone before thinking about routes from > > servers. At a small number selectively it's fine and practiced, e.g. > > advertising prefixes from servers that are doing software load > > balancing. > Your blast radius is somewhat bigger. Every server coming will affect all Level 0s in the PoD. > > [KP]: I don't fully follow the statements about routing on the host > and shaking the fabric. Sure it would be a bad idea[tm] to do this in > a flat OSPF domain. As I understood it in RIFT with it's link state > up, distance vector down design, if we have a route come and go at the > edge then it will be propagated all the way northbound, but will not > be propagated southbound. If we were disaggregated between the Top of > Rack switches and the Top of PoD switches, then a route appearing or > dissapearing does propagate back down from the Top of PoD to the Top > of Rack, but that's it. It doesn't shake the wider fabric any more > than without disaggreation. If we were disagregated between the Top of > Fabric and Top of PoD routes would be propagated back down to the > other Top of PoDs, but not below. > Yes, I meant if you run a flat OSPF domain with host routes (as people actually do if the scale holds up). Otherwise, yes, we agree. We just needed to talk in same words about same things ;-) > > [KP]: WRT to the routing on host it also seems in contradiction with > concerns about fabric stability. If fabric stability is a concern I > would think you still want addressing hierarchy and to use another > layer of indirection to achieve service mobility, to keep the fabric > unaware of the services constantly popping in and out of existence. > yes and no. Depends what you need. if you want to multi-home servers (since impact on your services is non-negligible if you loose e.g. a ToR) and need automatic bandwidth balancing north (nice thing), no need for MC-LAG in L2, tunnel origination on server without stitching, automatic disaggregation on failures, view of full topology on top of fabric and so on this has lots of appeal. And then, if you start to do things like running BIRD on host & then redistribute a default route and so on you don't have lots of these capabilites and on top another layer/protocol isntance to manage. > Yes, selectively this works, e.g. in the software load balancer > scenario, tunnel ingest, etc. That is fine and widely practiced within > limits. But the way you're describing it sounds like it's expected to > be used generously, with every host announcing prefixes, and there's > an expectation to move those prefixes such that you end up with a > random distribution. (Which is fine, I am probably out of touch with > fashion.) yupp, that is the expectation (i.e. RIFT is designed to be able to support that if needed). Look @ mobility section ;-) Then you really have a "fabric" vs. a "network", i.e. something that gives you bandwidth the same way chips give you RAM. You don't think on which RAM bank your allocation has to reside to work, why should you be all concerned where and how you hook stuff up and whether your services move addresses if all you need is just "more bandwidth". > So with dissagregation in the PoD servers being single homed > would still see just the default. if your server is single homed running any kind of routing protocol seems a waste really (unless you statically provision addreses and want them carried through rather than using DHCP and so on). You can as well point a static out, it's not like you can load balance, react to failures or anything much. > But all Top of Rack switches will > see the prefixes from other servers in the PoD, vs. when aggregation > is in effect and they'll only see the default (plus any disaggregated > due to a failure). The Top of PoD will have all prefixes in the PoD > and below, and further up in higher layers they'll all need to scale > up their FIB requirements to see all fabric routes. That's the same in > all cases with RIFT due to link state up. > Only ToF needs all routes (which is level 2 in 5-stage folded) in case of single plane fabric. In multi-plane fabric things are more complex. Any reasonable failure should be healed by negative disaggregation in levels higher up but one could construct completely pathological scenarios where you have to propagate all the way down since if a server can reach another server through certain planes only, it must know which planes to avoid to prevent a up-fabric/down-fabric/up-fabric again effectively turning other servers into ToF (which we call "fabric inversion" and seems extremely undesirable, BTW, in such scenarios your flooding on normal protocols also has to go up/down/up so once that happens you really don't have any kind of "hierarchical fabric" but bunch of nodes & links where traffic tries to get places somehow). I think the draft explains that decently well. > > [KP]: From purely the scaling perspective the aggregation feature is > useful where the number of routes produced by the servers in a PoD can > overwhelm the top of rack switches in that PoD but not the Top of PoD > switches, and so on up higher layers in the fabric. The FIB available > in devices these days would seem to preclude that. > > Right, so it's an interesting discussion and you're very focused on the way you prefer to deploy it and then that all makes sense. But if you have for reasons above pull RIFT all way down into multi-home server you realize that your FIB is small and that storing your underlay routes competes directly with your overlay routes which are the ones paying the bills so the dynamic changes. If your ToRs are originating overlays (as in EVPN e.g.) you'll face the same calculus. RIFT is agnostic, run it to the ToR, disaggregate servers if you want, will woirk fine but it allows you to pull all the way to ROTH and fast mobility of addresses, or run EVPN origination on the ToRs and use all-active or MC-LAG or whatever from servers, it will all work. So, we have an applicability document pending and this kind of stuff should all go into it IMO. Feel free to drum up the crowd and start/massage it. > [snip] > > > yes, it's always-negative-disaggregation which is possible, however much > harder to implement and you would somehow need to ring the ToP to have all > the necesasry topolgoy information to achieve that (that's why we ring ToF > in multi-plane design). Argument has been made before, we spent tons time > with Pascal going pro and cons until the current design was found the best > choice.[snip] > > [KP]: It looks like negative disaggregation could be an elegant > protocol level solution if feasible and reliable. > we spec'ed it our solid me thinks & you find examples and so on in the spec. Implementation doesn't look very challenging, most intersting is the recursive FIB hole punching in case of negative disaggregates but in fabrics it seems very unlikely people will carry lots aggregates together with more specific so then the problem doesn't even exist. Silicon is oblivious to it BTW, it all happens in control plane. If you read that and have further input, all interested in that ... nice you're drilling, I think lots people looked @ the stuff over last year+ and we closed all the holes and discussed out all the design choices but one more pair of experienced eyes never hurts --- tony --00000000000087174e05870061e3 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Sat, Apr 20, 2019= at 11:58 AM Kris Price <kris@krisp= rice.nz> wrote:
...

> [KP]: A top of rack switch ("tier-1" lets say) may be connec= ted to 8
> or 16 switches (or more) northbound (naturally let's call that nex= t
> tier "tier-2"). If any single link between a tier-1 and tier= -2 switch
> goes down (let's say between tier-1-1 and tier-2-1), all other nod= es
> in tier-2 (that is tier-2-2, tier-2-3, to tier-2-n) will determine
> that tier-2-1 no longer has southbound reachability for tier-1-1's=
> prefixes and that they each need to disagregate these to prevent
> tier-1-2..n from sending any traffic for tier-1-1 via tier-2-1 (which<= br> > would then need to forward up to tier-3 and back down).

Let's look=C2=A0@ at a figure

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 [A,B,C,D]
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 [E]
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 +-----+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 +-----+
= =C2=A0Level 2 =C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0 E=C2=A0 |=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 |=C2=A0 F=C2=A0 | A/32 @ [C,D]
=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 +-+-+-+=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 +-+-+-+ B/32 @ [C,D]
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | |=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | |=C2=A0=C2=A0 C/32 @ C
=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 | |=C2=A0=C2=A0=C2=A0 +-----+ |=C2=A0=C2=A0 D/32 @ D
=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 | |=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |
= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 | +------+=C2=A0=C2=A0=C2=A0=C2=A0 |
=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | |=C2=A0=C2=A0=C2=A0=C2=A0 |
=C2=A0Level= 1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 [A,B= ] +-+---+=C2=A0 | | +---+-+ [A,B]
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 [D]=C2=A0=C2=A0 |=C2=A0 C=C2=A0 +--+ +-+=C2=A0 D=C2=A0 | [C]
=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = +-+-+-+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 +-+-+-+
=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0 0/0=C2=A0 @ = [E,F] | |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | |=C2=A0= =C2=A0 0/0=C2=A0 @ [E,F]
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0 A/32 @ A=C2=A0=C2=A0=C2=A0=C2=A0 = | |=C2=A0=C2=A0=C2=A0 +-----+ |=C2=A0=C2=A0 A/32 @ A
=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0 B/32 @= B=C2=A0=C2=A0=C2=A0=C2=A0 | |=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 |=C2=A0=C2=A0 B/32 @ B
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | +------+=C2=A0=C2= =A0=C2=A0=C2=A0 |
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | |=C2= =A0=C2=A0=C2=A0=C2=A0 |
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 +-+---+=C2=A0 | | +---+-+
=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0 A= =C2=A0 +--+ +-+=C2=A0 B=C2=A0 |
=C2=A0Level 0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 . 0/0 @ [C,D] +-----+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 +-----+ 0/0 @ [C,D]=

=C2=A0
Let's call A ToR and it's ho= lding 8 server addresses. If you loos D-A the only disaggregation you will = see is C disaggregating to B the 8 addresses. This is unavoidable. I assume= we agree.=C2=A0


>
> [Prz] I think we have a disconnect here. ToF level will only disaggreg= ate if a ToF looses _all_ ToP connections to a PoD in a single plane design= so I don't follow your argument. If you run multi-plane design you sho= uld multi-home each Pod multiple times into your plane as well. If you don&= #39;t, dugh, you must disaggregate since the plane will blackhole.

[KP]: I think the disconnect is due my not using RIFT labels for
devices. I'm not talking about top of fabric. In RIFT labels I'm describing a PoD, where the Leafs are top of rack switches. Then when
a Top of PoD<->Leaf link fails, the other Top of PoD switches will disaggregate the prefixes on and below that Leaf leading to the incast
problem described. (This is described in the draft.)
<= br>
right. It's really the only possible choice between havin= g aggreation and having to react to link failure by disaggregating.
=C2=A0

> Far more helpful than "deterministic" is in control system t= heory (https://en.wikipedia.org/wiki/Stability_theory<= /a>) to think about "stability" where desireable positive stabili= ty is correlated with minimal blast radius. The more inputs shake more of y= our system the less "stability" you have.

[KP]: Absolutely, I'm not a mathematician, but reducing the amount of change under small perturbations is a concern at the back of my head
when I described a preference for more deterministic behavior. Aside
from addressing the incast concern, it was an intuition that adding
routes, and subtracting routes as they come and go would be less
change than mass adds/removals when disagregating. So e.g. sticking
with the example of one PoD, where the link between a top of rack and
top of PoD switch goes down. That means one top of PoD device
withdraws one route (or maybe 1*many routes if routing on host is
happening), and that was less of a churn than say 7 other top of PoD
switches advertising 7*many new routes.

if that's your preference you can simply configure all your Level 1 sw= itches to always disaggregate @ the cost of extra flooding on every address= change and FIB size in Level 0.


> [KP]: Blast radius doesn't seem bigger to me. FIB explosion is a > design consideration for anyone before thinking about routes from
> servers. At a small number selectively it's fine and practiced, e.= g.
> advertising prefixes from servers that are doing software load
> balancing.



[KP]: I don't fully follow the statements about routing on the host
and shaking the fabric. Sure it would be a bad idea[tm] to do this in
a flat OSPF domain. As I understood it in RIFT with it's link state
up, distance vector down design, if we have a route come and go at the
edge then it will be propagated all the way northbound, but will not
be propagated southbound. If we were disaggregated between the Top of
Rack switches and the Top of PoD switches, then a route appearing or
dissapearing does propagate back down from the Top of PoD to the Top
of Rack, but that's it. It doesn't shake the wider fabric any more<= br> than without disaggreation. If we were disagregated between the Top of
Fabric and Top of PoD routes would be propagated back down to the
other Top of PoDs, but not below.


Otherwise, yes, we= agree. We just needed to talk in same words about same things ;-)
=C2=A0

[KP]: WRT to the routing on host it also seems in contradiction with
concerns about fabric stability. If fabric stability is a concern I
would think you still want addressing hierarchy and to use another
layer of indirection to achieve service mobility, to keep the fabric
unaware of the services constantly popping in and out of existence.

yes and no. Depends what you need. if you want= to multi-home servers (since impact on your services is non-negligible if = you loose e.g. a ToR) and need automatic bandwidth balancing north (nice th= ing), no need for MC-LAG in L2, tunnel origination on server without stitch= ing, automatic disaggregation on failures, view of full topology on top of = fabric and so on this has lots of appeal. And then, if you start to do thin= gs like running BIRD on host & then redistribute a default route and so= on you don't have lots of these capabilites and on top another layer/p= rotocol isntance to manage.
=C2=A0
Yes, selectively this works, e.g. in the software load balancer
scenario, tunnel ingest, etc. That is fine and widely practiced within
limits. But the way you're describing it sounds like it's expected = to
be used generously, with every host announcing prefixes, and there's an expectation to move those prefixes such that you end up with a
random distribution. (Which is fine, I am probably out of touch with
fashion.)

yupp, that is the expectation (i= .e. RIFT is designed to be able to support that if needed). Look @ mobility= section ;-)=C2=A0 Then you really have a "fabric" vs. a "ne= twork", i.e. something that gives you bandwidth the same way chips giv= e you RAM. You don't think on which RAM bank your allocation has to res= ide to work, why should you be all concerned where and how you hook stuff u= p and whether your services move addresses if all you need is just "mo= re bandwidth".
=C2=A0
So with dissagregation in the PoD servers being singl= e homed
would still see just the default.

if your = server is single homed running any kind of routing protocol seems a waste r= eally (unless you statically provision addreses and want them carried throu= gh rather than using DHCP and so on). You can as well point a static out,= =C2=A0 it's not like you can load balance, react to failures or anythin= g much.=C2=A0
=C2=A0
But all Top of Rack switches will
see the prefixes from other servers in the PoD, vs. when aggregation
is in effect and they'll only see the default (plus any disaggregated due to a failure). The Top of PoD will have all prefixes in the PoD
and below, and further up in higher layers they'll all need to scale up their FIB requirements to see all fabric routes. That's the same in<= br> all cases with RIFT due to link state up.

Only ToF needs all routes (which is level 2 in 5-stage folded) in case o= f single plane fabric. In multi-plane fabric things are more complex. Any r= easonable failure should be healed by negative disaggregation in levels hig= her up but one could construct completely pathological scenarios where you = have to propagate all the way down since if a server can reach another serv= er through certain planes only, it must know which planes to avoid to preve= nt a up-fabric/down-fabric/up-fabric again effectively turning other server= s into ToF (which we call "fabric inversion" and seems extremely = undesirable, BTW, in such scenarios your flooding on normal protocols also = has to go up/down/up so once that happens you really don't have any kin= d of "hierarchical fabric" but bunch of nodes & links where t= raffic tries to get places somehow). I think the draft explains that decent= ly well.

[KP]: From purely the scaling perspective the aggregation feature is
useful where the number of routes produced by the servers in a PoD can
overwhelm the top of rack switches in that PoD but not the Top of PoD
switches, and so on up higher layers in the fabric. The FIB available
in devices these days would seem to preclude that.



So, we have an applicability document pending and this kind of stuff shou= ld all go into it IMO. Feel free to drum up the crowd and start/massage it.=
[snip]

> yes, it's always-negative-disaggregation which is possible, howeve= r much harder to implement and you would somehow need to ring the ToP to ha= ve all the necesasry topolgoy information to achieve that (that's why w= e ring ToF in multi-plane design). Argument has been made before, we spent = tons time with Pascal going pro and cons until the current design was found= the best choice.[snip]

[KP]: It looks like negative disaggregation could be an elegant
protocol level solution if feasible and reliable.



--00000000000087174e05870061e3-- From nobody Mon Apr 22 16:16:56 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0261712010C for ; Mon, 22 Apr 2019 16:16:56 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.999 X-Spam-Level: X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RshqKn_VEXSQ for ; Mon, 22 Apr 2019 16:16:54 -0700 (PDT) Received: from mail-qt1-x82c.google.com (mail-qt1-x82c.google.com [IPv6:2607:f8b0:4864:20::82c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E79771200B1 for ; Mon, 22 Apr 2019 16:16:53 -0700 (PDT) Received: by mail-qt1-x82c.google.com with SMTP id k2so14076747qtm.1 for ; Mon, 22 Apr 2019 16:16:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:mime-version:date:subject:message-id:to; bh=zJVwOCuZsyGGCXsQlUJ8ml2q13Q/Veef3tC9SUDqD+k=; b=LOakYmpUTlR6kC1Uk4bsI8FZuwfDerLpHrStc2hh9vg9bwkkHnMtp7HecckPyIp2Vp 3ovar5JX3m8jx7uZkdUHXy9qUU5yejxb1y2nNwfzGE5EnB/yfLZmtCix/hCcVvVoAEua qgYPpScd3Rdhg/yvfrNiXOogyPcum3ENMtY2RWquV1RWSI3KteFuujQ8YXKlMv6Ys7vV qZZ8a2gNP0jF1el7aETl1doU3vwMoCvGECgFWGVMD6A9N7j2BY6p+jhw6D6FCUtqb2uN MG5A32TOcOQHNKrO1RWqHGDFFoULn3EK+eFOTYsejRfjat5N2s0OKzBS/0VHK3vetV/w hanw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:mime-version:date:subject:message-id:to; bh=zJVwOCuZsyGGCXsQlUJ8ml2q13Q/Veef3tC9SUDqD+k=; b=dEqTOJ2Q2PFaFr10Zhb4Ebw0H39LZ2OluL+VH59z3NEPG0m1nS8tAtd6+T9LvXQbUC y5qfh82NuPdS/kyEFvIOt1gSY52Av6ma/UjFc/ktavPHfNM3ZUezsMHCms1ArRTECPIE pR2BrNofRfJy89gz/bd96bZPS1DSyYJo9Yz9BdbPHJALHreC7oARiDq2nXBT+sVQz40l 1FcracFvH1yN6QYLatuPzXJy0fRc+U5ft1ESeCO/2ZjzNEACa+/w47UAR1R4DzjH0gy6 +UJrbN3C1ceGOkzR987i9KMnHvsWPhceeFRb8fWa+XJ3z78luc6OS5oQVHQau37DGp5j 4xTg== X-Gm-Message-State: APjAAAWAzbNYmdn600p/kKJSev251UJSunbdgXiTIgTQkMOV3r+2agl5 SjZurN2dDSK5DdCG6qzGfOcR0Act X-Google-Smtp-Source: APXvYqzl6JMEQlWrCzUazWBmn7uvzoUPyUzUX0+ksJYZYzhGC6L5EM4e6T3OX0WuAjtTSdWboAuEEQ== X-Received: by 2002:a0c:9e9a:: with SMTP id r26mr17736673qvd.57.1555975012531; Mon, 22 Apr 2019 16:16:52 -0700 (PDT) Received: from [192.168.0.100] (host-cotesma-169-30.smandes.com.ar. [201.220.169.30]) by smtp.gmail.com with ESMTPSA id p6sm3588247qkc.13.2019.04.22.16.16.49 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 22 Apr 2019 16:16:51 -0700 (PDT) From: Bruno Rijsman Content-Type: multipart/alternative; boundary="Apple-Mail=_0D187170-CC42-46DC-8F1E-0F014122CCDE" Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\)) Date: Mon, 22 Apr 2019 20:16:47 -0300 Message-Id: To: rift@ietf.org X-Mailer: Apple Mail (2.3445.100.39) Archived-At: Subject: [Rift] Initial implementation of security in RIFT-Python is complete X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Apr 2019 23:16:56 -0000 --Apple-Mail=_0D187170-CC42-46DC-8F1E-0F014122CCDE Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii I have finished the initial implementation of security in RIFT-Python = (security envelope, keys, fingerprints, nonces, packet-nr, etc. etc.) See http://bit.ly/rift-python-security-feature-guide = for a detailed = feature guide. While implementing the code, I gathered a number of comments on the = security section of the draft -05. I will report these in a follow-up = e-mail. -- Bruno= --Apple-Mail=_0D187170-CC42-46DC-8F1E-0F014122CCDE Content-Transfer-Encoding: 7bit Content-Type: text/html; charset=us-ascii
I have finished the initial implementation of security in RIFT-Python (security envelope, keys, fingerprints, nonces, packet-nr, etc. etc.)


While implementing the code, I gathered a number of comments on the security section of the draft -05. I will report these in a follow-up e-mail.

-- Bruno
--Apple-Mail=_0D187170-CC42-46DC-8F1E-0F014122CCDE-- From nobody Mon Apr 22 18:17:56 2019 Return-Path: X-Original-To: rift@ietf.org Delivered-To: rift@ietfa.amsl.com Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id E32841200D5; Mon, 22 Apr 2019 18:17:53 -0700 (PDT) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit From: IETF Meeting Session Request Tool To: Cc: zzhang@juniper.net, rift-chairs@ietf.org, rift@ietf.org, aretana.ietf@gmail.com X-Test-IDTracker: no X-IETF-IDTracker: 6.95.0 Auto-Submitted: auto-generated Precedence: bulk Message-ID: <155598227385.21056.15657401702434711935.idtracker@ietfa.amsl.com> Date: Mon, 22 Apr 2019 18:17:53 -0700 Archived-At: Subject: [Rift] rift - New Meeting Session Request for IETF 105 X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Apr 2019 01:17:54 -0000 A new meeting session request has just been submitted by Zhaohui (Jeffrey) Zhang, a Chair of the rift working group. --------------------------------------------------------- Working Group Name: Routing In Fat Trees Area Name: Routing Area Session Requester: Zhaohui Zhang Number of Sessions: 1 Length of Session(s): 2 Hours Number of Attendees: 40 Conflicts to Avoid: First Priority: bier bess rtgwg lsr pce mpls lsvr idr spring Second Priority: pim mboned teas ccamp sfc Third Priority: bfd detnet nvo3 netconf netmod People who must be present: Tony Przygienda Alvaro Retana Zhaohui (Jeffrey) Zhang Jeff Tantsura Resources Requested: Special Requests: --------------------------------------------------------- From nobody Tue Apr 23 07:42:08 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 05463120171 for ; Tue, 23 Apr 2019 07:42:07 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -2 X-Spam-Level: X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PP5_PxqOy3-2 for ; Tue, 23 Apr 2019 07:42:05 -0700 (PDT) Received: from mail-qk1-x72d.google.com (mail-qk1-x72d.google.com [IPv6:2607:f8b0:4864:20::72d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 630E81200F9 for ; Tue, 23 Apr 2019 07:42:05 -0700 (PDT) Received: by mail-qk1-x72d.google.com with SMTP id w73so6540130qkb.13 for ; Tue, 23 Apr 2019 07:42:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:mime-version:subject:message-id:date :to; bh=nmjViqMITOkxYZhw4AomzAmy6AlYaP/HUMpJ7hX+xRo=; b=MvxlkS1klqVzOWxqS4hP93qztXnTxQeTP/iS+Y1cKVn3+jlKwYrzDEhESpfDNalr7R 2w6D593hfLScFtuHBXODUyoqQ293DPycS7aOnuTf+Ig+mBm5tje131G7dZiEBl9wJlJk eAxfKQrKecKPh0ua5J4eX1f3fpivZI7hAhGOPjbXtFwI/RtHT5WLiaisykQdGUxq9H9x OioIjCR8lKnhAfaoOPMZCbgJyOD1vHZuWrDmDn5lpBtu1RSasnP0GA+4BEefl3ms+vvn 6bZrp5f0bfpK42aOjWvVC6KtQBEtDil1NC3Zr1n3yos8sFp7q5/KVcO/5VKb/eIF0nJ4 hHmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version :subject:message-id:date:to; bh=nmjViqMITOkxYZhw4AomzAmy6AlYaP/HUMpJ7hX+xRo=; b=Me584NLI43Lw/JjDBm1TU7PilHx40wn8SUlxvSze7E6UwMGpNqkWVTYF+6M7Z9+lpN JdVFo64lTWo7M22H3DvqM0Tk8WxV5ge5ycRvoQHewndrwMIF1OC8wcVyVAKYkH4EzP/e 1ypAVqJh/I5yM1Fqsc1k+sZMagyqYwR9AvPZ/m4O9PG7sb5fiSfRHqabL4gr1K/0T205 FNXT9lF6m3aMJpHdcn/ZKMH3ac+g01o9wDiY5SOBrWdipexRfqYebrgvT4Vd5z+rvTqq kg0g9UTZXWyuBWAShEU0uRLDQnQt0fkaX3v6Thh/R0Mgkm95AZ1yd0Y6nZQOLtEVyKkC KUNQ== X-Gm-Message-State: APjAAAU/+VfzSAEHi9vk+dcGv6hRlk1JYga0+CBWcveqDeTVxtAQ7buj 78td2SHfNQZpZFm1IZT8igLBOwIn X-Google-Smtp-Source: APXvYqyxcIprq6LJYGZBqkbJH2WtyN9z81MImHK+447IU7dB5AsRlABiGkZr/EZm0qKEzYLocu5dsQ== X-Received: by 2002:ae9:c005:: with SMTP id u5mr9263328qkk.179.1556030522672; Tue, 23 Apr 2019 07:42:02 -0700 (PDT) Received: from [192.168.0.101] (host-cotesma-177-90.smandes.com.ar. [201.220.177.90]) by smtp.gmail.com with ESMTPSA id i24sm10183783qti.76.2019.04.23.07.42.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 23 Apr 2019 07:42:02 -0700 (PDT) From: Bruno Rijsman Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\)) Message-Id: <0660FAD1-B80C-4D37-B4D6-6CE4F6759BCD@gmail.com> Date: Tue, 23 Apr 2019 11:41:59 -0300 To: rift@ietf.org X-Mailer: Apple Mail (2.3445.100.39) Archived-At: Subject: [Rift] 2 weeks for comments on security section X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Apr 2019 14:42:07 -0000 PS: Due to other activities, it will take me a week or two to finish the = write-up of my comments on the security section of the RIFT draft. I = want to make sure my write-up is accurate and complete before I send it = out, because some of my comments are quite subtle or opinionated. =E2=80=94 Bruno= From nobody Tue Apr 23 10:18:24 2019 Return-Path: X-Original-To: rift@ietf.org Delivered-To: rift@ietfa.amsl.com Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 2B68B120478; Tue, 23 Apr 2019 10:18:16 -0700 (PDT) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit From: internet-drafts@ietf.org To: Cc: rift@ietf.org X-Test-IDTracker: no X-IETF-IDTracker: 6.95.0 Auto-Submitted: auto-generated Precedence: bulk Reply-To: rift@ietf.org Message-ID: <155603989611.32473.18257430692008042166@ietfa.amsl.com> Date: Tue, 23 Apr 2019 10:18:16 -0700 Archived-At: Subject: [Rift] I-D Action: draft-ietf-rift-rift-05.txt X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Apr 2019 17:18:16 -0000 A New Internet-Draft is available from the on-line Internet-Drafts directories. This draft is a work item of the Routing In Fat Trees WG of the IETF. Title : RIFT: Routing in Fat Trees Author : The RIFT Team Filename : draft-ietf-rift-rift-05.txt Pages : 138 Date : 2019-04-23 Abstract: This document outlines a specialized, dynamic routing protocol for Clos and fat-tree network topologies. The protocol (1) deals with fully automated construction of fat-tree topologies based on detection of links, (2) minimizes the amount of routing state held at each level, (3) automatically prunes and load balances topology flooding exchanges over a sufficient subset of links, (4) supports automatic disaggregation of prefixes on link and node failures to prevent black-holing and suboptimal routing, (5) allows traffic steering and re-routing policies, (6) allows loop-free non-ECMP forwarding, (7) automatically re-balances traffic towards the spines based on bandwidth available and finally (8) provides mechanisms to synchronize a limited key-value data-store that can be used after protocol convergence to e.g. bootstrap higher levels of functionality on nodes. The IETF datatracker status page for this draft is: https://datatracker.ietf.org/doc/draft-ietf-rift-rift/ There are also htmlized versions available at: https://tools.ietf.org/html/draft-ietf-rift-rift-05 https://datatracker.ietf.org/doc/html/draft-ietf-rift-rift-05 A diff from the previous version is available at: https://www.ietf.org/rfcdiff?url2=draft-ietf-rift-rift-05 Please note that it may take a couple of minutes from the time of submission until the htmlized version and diff are available at tools.ietf.org. Internet-Drafts are also available by anonymous FTP at: ftp://ftp.ietf.org/internet-drafts/ From nobody Tue Apr 23 10:18:59 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 853D51204DC for ; Tue, 23 Apr 2019 10:18:50 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.999 X-Spam-Level: X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8tYoNoVitEL5 for ; Tue, 23 Apr 2019 10:18:48 -0700 (PDT) Received: from mail-ed1-x535.google.com (mail-ed1-x535.google.com [IPv6:2a00:1450:4864:20::535]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id AB3F1120494 for ; Tue, 23 Apr 2019 10:18:46 -0700 (PDT) Received: by mail-ed1-x535.google.com with SMTP id a6so13376186edv.1 for ; Tue, 23 Apr 2019 10:18:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=Uh6WzPSOonwANByW5AgiirMelJVtEyIcPUhKWbyfdSU=; b=D8t2QnwePMMde7NYQEHQ2GE/nMJm1pEmkd1PEf0IPdzGXGdEqg9gdHd8Lm1A1CLTvg bLEXz2dwDHsc2Zlv54I39ZiNpbFVTvYjeAe98lGZ2PX5E5UI5SGqqwHggQ3X2EVgCLWT wdXOX16veMhdtn5mzFIWlGfIqe7YQhn0ITk6zDmOLrfAMP7y/z1AApm2ivxsyot4Rwo7 LRzRVLSsQq1hh13+So9JjrQZtfPdcfBIcDw57NbskDmmzIokrhepjmRrGP9aL0bMvKkb 3ktjkqdaoFmfnHNvvqZRzVASyezIO+sOqr4RvUR4E0depyGKbEvWJhYvpxGQ2NqhMmS0 NwCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=Uh6WzPSOonwANByW5AgiirMelJVtEyIcPUhKWbyfdSU=; b=X1jO5VNXa5Xa4P1xofYSZzJZkBl00z7IhILlmiw+6A2fUlLCruhQXNRvxEe/Es91FF 7jxjOQ9YKcG9ay0Cp0FHDVstaSXBipCsi+H8yY3vMiQLP/zgk9/CBPa2CZGqI3JK2DJW aqhve/S1NiTshm4a6uyl9Bsgoq+88NvPzbLuPFxzMctFbckDdDW0ve811RQ1g/uRvHVK w5XmRCoa8W2YzSid8lukWegc883fshdTMlgyIRfDZpJF03jYl49X5iRIubnHdyP1M61l 4G7thM6ncsiLoHROZw+jqQyILWjv7i8GQe9vs8IlVzRc+DvvS6rXEsrsnHo9whsXgCpp Zsvg== X-Gm-Message-State: APjAAAVOEWJcuUGttD5HHKnNLnxIL1BkIf48dRMH9meCp6BDlPr37zW1 /djaA066x1LvwuN/CX0LLB5oZzDt51FRrdY1HG5xNA== X-Google-Smtp-Source: APXvYqyBPTPp2bJK/yKnFLWtc5gPbKyznCacoj7s25Vc+WhMCf/vK41uTVrMPEzEoKG+OMoyEfd+/2FQyVGNxk3173c= X-Received: by 2002:a17:906:25d1:: with SMTP id n17mr12924229ejb.257.1556039924822; Tue, 23 Apr 2019 10:18:44 -0700 (PDT) MIME-Version: 1.0 From: Tony Przygienda Date: Tue, 23 Apr 2019 10:18:08 -0700 Message-ID: To: rift@ietf.org Content-Type: multipart/alternative; boundary="000000000000f9b81e058735c680" Archived-At: Subject: [Rift] -05 will be out ... X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Apr 2019 17:18:58 -0000 --000000000000f9b81e058735c680 Content-Type: text/plain; charset="UTF-8" I just submitted -05 version including changes based on first round of security reviews ... --- tony --000000000000f9b81e058735c680 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I just submitted -05 version including changes based = on first round of security reviews=C2=A0 ...

= --- tony

--000000000000f9b81e058735c680-- From nobody Tue Apr 23 10:30:52 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3680112042B for ; Tue, 23 Apr 2019 10:30:51 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.327 X-Spam-Level: X-Spam-Status: No, score=-1.327 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, KHOP_DYNAMIC=1.363, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id llp55jgz7G9v for ; Tue, 23 Apr 2019 10:30:48 -0700 (PDT) Received: from mx0a-00273201.pphosted.com (mx0a-00273201.pphosted.com [208.84.65.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5CC5A12024E for ; Tue, 23 Apr 2019 10:30:48 -0700 (PDT) Received: from pps.filterd (m0108158.ppops.net [127.0.0.1]) by mx0a-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3NHTnfH003279; Tue, 23 Apr 2019 10:30:44 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : cc : subject : date : message-id : content-type : mime-version; s=PPS1017; bh=wt2zWwJXmB9rhS+thBQNGgyAhqYAVnK7L5E0ZwKguVs=; b=FnxvbLtnakaftBa2c4gDPL8UmGliTp/exU9+LBoE7eF/H/IuJZ/jGIou2KcnNetfobtS yNtwg/O5mmp8fuxxSw9j0hm0tSKmWFG9iMpfqbHDwNJM6YeMwdHXirsk6u2H5eU2wT8u zZRDuCGQ0IDRZU5psvjlfAPh38YSwvFFT41QF6kxWN+feQJluMwhHb0mQ9G2RHT/UcPt HIa30cLjyAHBNupHFtN1Z2dwbKDF+gSvLJhhJnXaiICNir606jvU66KscrSX5QaPKWiW 9AvS+53NbIS2waFScnEQCTqcohljZs/5gYJVEYmdzZXxnQrJ5/l2suUOxgbnBhdyNjx1 pA== Received: from nam03-co1-obe.outbound.protection.outlook.com (mail-co1nam03lp2052.outbound.protection.outlook.com [104.47.40.52]) by mx0a-00273201.pphosted.com with ESMTP id 2s20rmgpt0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Tue, 23 Apr 2019 10:30:43 -0700 Received: from SN2PR05MB2463.namprd05.prod.outlook.com (10.166.213.8) by SN2PR05MB2638.namprd05.prod.outlook.com (10.167.14.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1835.12; Tue, 23 Apr 2019 17:30:41 +0000 Received: from SN2PR05MB2463.namprd05.prod.outlook.com ([fe80::f592:9f8f:f5ad:d73a]) by SN2PR05MB2463.namprd05.prod.outlook.com ([fe80::f592:9f8f:f5ad:d73a%5]) with mapi id 15.20.1835.007; Tue, 23 Apr 2019 17:30:41 +0000 From: "Jeffrey (Zhaohui) Zhang" To: "Scott G. Kelly" , Antoni Przygienda , Bruno Rijsman CC: "rift@ietf.org" Thread-Topic: RIFT security review Thread-Index: AdT5+dFkdty2AnMSST2sp6Q1ovQg4Q== Content-Class: Date: Tue, 23 Apr 2019 17:30:40 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.1.100.23 dlp-reaction: no-action msip_labels: MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_Enabled=True; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_SiteId=bea78b3c-4cdb-4130-854a-1d193232e5f4; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_Owner=zzhang@juniper.net; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_SetDate=2019-04-23T17:30:37.2151907Z; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_Name=Juniper Internal; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_Application=Microsoft Azure Information Protection; MSIP_Label_0633b888-ae0d-4341-a75f-06e04137d755_Extended_MSFT_Method=Automatic; Sensitivity=Juniper Internal x-originating-ip: [66.129.241.10] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: c6aca113-b073-4690-38f8-08d6c811686b x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600141)(711020)(4605104)(4618075)(2017052603328)(7193020); SRVR:SN2PR05MB2638; x-ms-traffictypediagnostic: SN2PR05MB2638: x-ms-exchange-purlcount: 5 x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:8273; x-forefront-prvs: 0016DEFF96 x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(376002)(396003)(366004)(346002)(39860400002)(136003)(189003)(199004)(486006)(26005)(186003)(1941001)(14444005)(102836004)(25786009)(966005)(606006)(7696005)(478600001)(99286004)(476003)(110136005)(76116006)(66556008)(64756008)(733005)(66946007)(14454004)(66446008)(53546011)(6436002)(316002)(4326008)(71190400001)(71200400001)(3480700005)(256004)(66476007)(6506007)(6246003)(55016002)(7116003)(236005)(7736002)(9686003)(97736004)(68736007)(229853002)(53936002)(73956011)(66066001)(6306002)(54896002)(33656002)(86362001)(53946003)(9326002)(790700001)(3846002)(6116002)(8936002)(5660300002)(66574012)(8676002)(15650500001)(81156014)(81166006)(2420400007)(861006)(7110500001)(74316002)(52536014)(2906002)(559001)(569006); DIR:OUT; SFP:1102; SCL:1; SRVR:SN2PR05MB2638; H:SN2PR05MB2463.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: jrBZX4pluVoJn06fDQFJbEGFghz/ZHVbFD9ZmA1Hm23WbrHr2EFptBtckY/Pmo3u1feyswxSgQhofAawb76ZwgIiCzk9d+dh6SS4wkMVs6F0sHv+pKXnI2sG2eKjGI0wBt+OhqD8c5YudTkYuCTUtdiibh6rm8OpoxqXyGJGb6ouwkeAjp648XbUXwdUWATCofNfjK7+eVTyc1kQOU0B/kQezA8ASX2BASEI8LwyGaYvHH0/wmOh8hQg2WqvQuPHqj8UUEYCq6o4GNQ5XXBeUEJhqmO0gvtRDuXw7Y/Lyas24KF9yzuozklBJQgtBOb4+wi5wuFiwW5+t/yPA5rOR/ZQgkAlI1jOYz6EMHPmBPPUrVtIcHVcihgkg8YMh0TN7i2oG8XvcGbe7bzsPUGK/6ahY5jL5+n/OO1R9dJFwME= Content-Type: multipart/alternative; boundary="_000_SN2PR05MB2463D1833A557E99671C8A57D4230SN2PR05MB2463namp_" MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-Network-Message-Id: c6aca113-b073-4690-38f8-08d6c811686b X-MS-Exchange-CrossTenant-originalarrivaltime: 23 Apr 2019 17:30:40.9538 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN2PR05MB2638 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-23_05:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904230120 Archived-At: Subject: Re: [Rift] RIFT security review X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Apr 2019 17:30:51 -0000 --_000_SN2PR05MB2463D1833A557E99671C8A57D4230SN2PR05MB2463namp_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi Scott, Thanks for your security review on RIFT spec. I am copying this to the RIFT= mailing list. Please see Tony's response in the email below and the revisions in https://= www.ietf.org/id/draft-ietf-rift-rift-05.txt. Could you review again to make= sure all your comments/concerns have been addressed? Thanks! Jeffrey Juniper Internal From: Antoni Przygienda Sent: Thursday, April 18, 2019 12:34 PM To: Jeffrey (Zhaohui) Zhang ; rift-chairs@ietf.org; Bru= no Rijsman Subject: Re: RIFT security review My comments on the security review: I have reviewed this document as part of the security directorate's ongoing= effort to review all IETF documents being processed by the IESG. These co= mments were written primarily for the benefit of the security area director= s. Document editors and WG chairs should treat these comments just like an= y other last call comments. The summary of the review is ready with issues >From the abstract, this document outlines a specialized, dynamic routing pr= otocol for Clos and fat-tree network topologies. (should that read CLOS?) Clos was a French mathematician in Bell Labs who invented the stuff so it's= really "Clos" Wikipedia: "Clos networks are named after Bell Labs researcher Charles Clos, who proposed the model in 1952 as a way to overcome the performance- and cost-related challenges of electromechanical switches then= used in telephone networks." Following is a brief summary of comments and questions by section. 5.4.1 includes this sentence: The most security conscious operators will want to have full control over which port on which router/switch is connected to the respective port on the "other side", which we will call the "port-association model" (PAM) achievable e.g. by pairwise-key PKI. What is "pairwise-key PKI"? pair-wise set of private/public key, i.e. a designated key pair per port. I= 'll try to word it better Secion 5.4.2 says "Low processing overhead and efficiency messaging are al= so a goal." I suggest replacing efficiency with efficient ack It also says "Message privacy achieved through full encryption is a non-go= al" I suggest saying "Message confidentiality is a non-goal" instead. ack Section 5.4.3 "Length of Fingerprint: 8 bits. Length in 32-bit multiples of the following fingerprint not including lifetime or nonces. It allows to navigate the structure when an unknown key type is present. To clarify a common cornercase a fingerprint with length of 0 bits is presenting this field with value of 0." Does length 0 mean no fingerprint is present (i.e. fingerprints are not pro= vided)? I don't understand that last sentence. yes, it does. I try to improve the wording. The definition for "Security Fingerprint" includes this sentence: "If the fingerprint is shorter than the significant bits are left aligned a= nd remaining bits are set to 0." I don't understand this sentence. I think you mean that if the fingerp= rint bit length is not an even multiple of 32, then it is left-aligned, and= the rightmost unused bits are set to 0. But that's just a guess. yes, I try to word better. 5.4.4 "Any implementation including RIFT security MUST generate and wrap around l= ocal nonces properly" I see the term "nonce" used elsewhere, but because it can wrap (and therefo= re repeat with regularity), I think this is a poor choice for naming this field. It seems to be more of= a counter. I think most security folks would agree that a nonce used for security purp= oses should, by definition, repeat only with negligible probability. I was under the impression that nonce is a well-known term in cryptography https://en.wikipedia.org/wiki/Cryptographic_nonce [https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Nonce-cnonce-uml= .svg/1200px-Nonce-cnonce-uml.svg.png] Cryptographic nonce - Wikipedia In cryptography, a nonce is an arbitrary number that can be used just once = in a cryptographic communication. It is similar in spirit to a nonce word, = hence the name.It is often a random or pseudo-random number issued in an au= thentication protocol to ensure that old communications cannot be reused in= replay attacks.They can also be useful as initialization vectors and in cr= yptographic hash ... en.wikipedia.org On a related note, does this really provide anti-replay protection? Elsewhere in the document (e.g. section 5.4.4) it says that implementations= could go up to 5 minutes without incrementing nonces. Can they send multiple packets with = the same nonce during this interval? If so, what prevents replay of a captured packe= t within that interval? Also, because wrapping (of this 16 bit value) is supported, it's also possi= ble that an earlier packet could be replayed (assuming the peer nonce also = aligned), right? The odds of this seem low, but could the protocol/endpoint= states be manipulated to improve the odds? Not sure. But if you are assumi= ng this can't happen, this security-relevant assumption should be called ou= t. 1. Correct, for efficieny purposes we open up to a 5 min window which we= consider an acceptable risk per point 2 2. it is the combination of local and remote nonce so it's really a 32 b= it number. The chance that the combination repeats is obviously very small. 5.4.7 says "If an implementation supports disabling the security envelope requirements while sending a security envelope an implementation could shut down the security envelope procedures while maintaining an adjacency and make changes to the algorithms on both sides then re enable the security envelope procedures but that introduces security holes during the disabled period." Aside from the fact that this needs word-smithing, should this be called ou= t in the security considerations section? This eeems to be saying that it's not a good idea t= o temporarily maintain adjacency while disabling security, so is this a SHOULD NOT? Will improve wording. Yes, it is a SHOULD NOT but sometimes implementations= do that to not loose adjacency and change keys easily. section 8.4 flodding -> flooding section 8.4 also says It is expected that an implementation detecting too many fake losses or misorderings due to the attack on the number would simply suppress its further processing. what are "fake losses"? I am not a routing expert, so there may be additional concerns that someone= better versed in routing would raise. Will improve wording. "Fake losses" is a possible attack vector where an at= tacker intercepts packets, modifies the packet number to simulate a "packe= t number loss/misorder" and forwards the packet on. Please fwd' to reviewer if needed & tell me whether you want notification o= n new version ... --- tony ________________________________ From: Jeffrey (Zhaohui) Zhang Sent: Thursday, April 18, 2019 7:26 AM To: Antoni Przygienda Subject: RIFT security review Hi Tony, Please see https://datatracker.ietf.org/doc/review-ietf-rift-rift-04-secdir= -early-kelly-2019-04-11/. Thanks. Jeffrey Juniper Internal --_000_SN2PR05MB2463D1833A557E99671C8A57D4230SN2PR05MB2463namp_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hi Scott,

 

Thanks for your security review on RIFT spec. I am c= opying this to the RIFT mailing list.

 

Please see Tony’s response in the email below = and the revisions in https://www= .ietf.org/id/draft-ietf-rift-rift-05.txt. Could you review again to make sure all your comments/concerns have = been addressed?

 

Thanks!

Jeffrey<= /span>

=  

 

Juniper Internal=

From: Antoni Przygienda <prz@juniper.net>
Sent: Thursday, April 18, 2019 12:34 PM
To: Jeffrey (Zhaohui) Zhang <zzhang@juniper.net>; rift-chairs@= ietf.org; Bruno Rijsman <brunorijsman@gmail.com>
Subject: Re: RIFT security review

 

My comments on the security re= view:

 

I have reviewed this document as part of t=
he security directorate's ongoing effort to review all IETF documents being=
 processed by the IESG.  These=
 comments were written primarily for the benefit of the security area direc=
tors.  Document editors and WG=
 chairs should treat these comments just like any other last call comments.=
 
The summary of the review is ready with is=
sues
 
From the abstract, this document outlines =
a specialized, dynamic routing protocol for Clos and fat-tree network topol=
ogies.
 
(should that read CLOS?)

Clos wa= s a French mathematician in Bell Labs who invented the stuff so it's really= "Clos"

Wikipedia: "Clos networks are named after Be= ll Labs researcher
Charles Clos, who proposed the model in=
 1952 as a way to overcome the 
performance- and cost-related challenge=
s of electromechanical switches then used in telephone networks.&=
quot;

Following is a brief s= ummary of comments and questions by section.
 
5.4.1 includes this sentence:
 
 &nb=
sp; The most security conscious operators will want to have full con=
trol
 &nb=
sp; over which port on which router/switch is connected to the respe=
ctive
 &nb=
sp; port on the "other side", which we will call the "=
;port-association
 &nb=
sp; model" (PAM) achievable e.g. by pairwise-key PKI.
 
What is "pairwise-key PKI"?
<= br>pair-wise set of private/public key, i.e. a designated key pair per p= ort. I'll try to word it better


Secion 5.4.2 says "Low= processing overhead and efficiency messaging are also a goal."
 
I suggest replacing efficiency with effici=
ent

ack

It also says "Message privacy achieved t= hrough full encryption is a non-goal"
 
I suggest saying "Message confidentia=
lity is a non-goal" instead.

ack

Section 5.4.3<= /o:p>
"Length of Fingerprint:  8 bits.&n=
bsp; Length in 32-bit multiples of the
 &nb=
sp;    following fingerprint not including lifetime o=
r nonces.  It allows
 &nb=
sp;    to navigate the structure when an unknown key =
type is present.  To
 &nb=
sp;    clarify a common cornercase a fingerprint with=
 length of 0 bits is
 &nb=
sp;    presenting this field with value of 0."
 
Does length 0 mean no fingerprint is prese=
nt (i.e. fingerprints are not provided)? I don't understand that last sente=
nce.

yes, it does. I try to improve the = wording.

The definition for "Security Fingerprint" in= cludes this  sentence:
 
"If the fingerprint is shorter than t=
he significant bits are left aligned and remaining bits are set to 0."=
 
I don't understand this sentence. I thi=
nk you mean that    &nb=
sp; if the fingerprint bit length is not an even multiple of 32, the=
n it is left-aligned, and the rightmost unused bits are set to 0. But that'=
s just a guess.

yes, I try to word better.
<= br> 5.4.4
"Any implementation including RIFT se=
curity MUST generate and wrap around local nonces properly"=
 
I see the term "nonce" used else=
where, but because it can wrap (and therefore repeat with regularity), 
= I think this is a poor choice for naming this field. It seems to be more of= a counter.
I think most security folks would agree that a nonce used f= or security purposes should,
by definition, repeat only with negligible= probability.

I was under the impression that nonce is a well-kno= wn term in cryptography

https://en.wikipedia.org/wiki/Cryptographic_nonce
=

3D"https://upload.wikimedia.or=

In cryptography, a nonce is an arbitrary number that can be= used just once in a cryptographic communication. It is similar in spirit to a nonce word, hence the name.It is often a random or = pseudo-random number issued in an authentication protocol to ensure that ol= d communications cannot be reused in replay attacks.They can also be useful= as initialization vectors and in cryptographic hash ...

en.wikipedia.org



On a related note, does this real= ly provide anti-replay protection?
Elsewhere in the document (e.g. sect= ion 5.4.4) it says that implementations could go up to
5 minutes withou= t incrementing nonces. Can they send multiple packets with the same
non= ce during this interval? If so, what prevents replay of a captured packet w= ithin that interval?
 
Also, because wrapping (of this 16 bit val=
ue) is supported, it's also possible that an earlier packet could be replay=
ed (assuming the peer nonce also aligned), right? The odds of this seem low=
, but could the protocol/endpoint states be manipulated to improve the odds=
? Not sure. But if you are assuming this can't happen, this security-releva=
nt assumption should be called out.
 
  1. Correct, for efficieny purposes we open up to a 5 min window w= hich we consider an acceptable risk per point 2
  2. it is the combination of local and remote nonce so it's really= a 32 bit number. The chance that the combination repeats is obviously very= small.


&= nbsp;5.4.7 says
 
 &nb=
sp; "If an implementation supports disabling the security envel=
ope
 &nb=
sp; requirements while sending a security envelope an implementation=
 &nb=
sp; could shut down the security envelope procedures while maintaini=
ng an
 &nb=
sp; adjacency and make changes to the algorithms on both sides then =
re
 &nb=
sp; enable the security envelope procedures but that introduces secu=
rity
   holes during the disable=
d period."
 
Aside from the fact that this needs word-s=
mithing, should this be called out in the security 
considerations secti= on? This eeems to be saying that it's not a good idea to temporarily
ma= intain adjacency while disabling security, so is this a SHOULD NOT?

=
Will improve wording. Yes, it is a SHOULD NOT but sometimes implemen= tations do
that to not loose adjacency and change keys easily. 


 section 8.4
flodding -> flooding<=
/pre>
 
section 8.4 also says
 
 &nb=
sp; It is expected that an
 &nb=
sp; implementation detecting too many fake losses or misorderings du=
e to
 &nb=
sp; the attack on the number would simply suppress its further proce=
ssing.
 
what are "fake losses"?
 
I am not a routing expert, so there may be=
 additional concerns that someone better versed in routing would raise.

Will improve wording. "Fake losses" is a= possible attack vector where an attacker intercepts packets, modifies the = packet number  to simulate a "packet number loss/misorder" and forwards the packet on.

 

Please fwd' to reviewer if needed & tell me wh= ether you want notification on new version ...

 

--- tony

 


From: Jeffrey (Zhaohui) Zhang
Sent: Thursday, April 18, 2019 7:26 AM
To: Antoni Przygienda
Subject: RIFT security review

 <= o:p>

--_000_SN2PR05MB2463D1833A557E99671C8A57D4230SN2PR05MB2463namp_-- From nobody Tue Apr 23 17:33:33 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 749D5120175 for ; Tue, 23 Apr 2019 17:33:32 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.899 X-Spam-Level: X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=g001.emailsrvr.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QF_CmCNPW5Gp for ; Tue, 23 Apr 2019 17:33:29 -0700 (PDT) Received: from smtp89.iad3a.emailsrvr.com (smtp89.iad3a.emailsrvr.com [173.203.187.89]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1988B120188 for ; Tue, 23 Apr 2019 17:33:29 -0700 (PDT) Received: from smtp12.relay.iad3a.emailsrvr.com (localhost [127.0.0.1]) by smtp12.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 3888825001; Tue, 23 Apr 2019 20:33:28 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=g001.emailsrvr.com; s=20190322-9u7zjiwi; t=1556066008; bh=4VIqkFEAp1LDDzwT1qlkW6Af+pyyEhet0w6GhyGwNMw=; h=Date:Subject:From:To:From; b=queY0xm7uTFjONlkgPx+FCZuwWUIhswKUdEUihVk0quSG/d94Oe1WW5+uzGEdpzh8 R8tqtAqVGpGUi1NLgbfYzdu93Or6wYD2UyMF+0jbpjyL2U/tyoneG3jSYgBNnjOwD4 zuGEobeqOpa3PN03b3IG2I+dFMvsnlJKoTC26lyA= Received: from app55.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by smtp12.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 0C68A23A17; Tue, 23 Apr 2019 20:33:28 -0400 (EDT) X-Sender-Id: scott@hyperthought.com Received: from app55.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by 0.0.0.0:25 (trex/5.7.12); Tue, 23 Apr 2019 20:33:28 -0400 Received: from hyperthought.com (localhost.localdomain [127.0.0.1]) by app55.wa-webapps.iad3a (Postfix) with ESMTP id EDB1D60045; Tue, 23 Apr 2019 20:33:27 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: scott@hyperthought.com, from: scott@hyperthought.com) with HTTP; Tue, 23 Apr 2019 17:33:27 -0700 (PDT) X-Auth-ID: scott@hyperthought.com Date: Tue, 23 Apr 2019 17:33:27 -0700 (PDT) From: "Scott G. Kelly" To: "=?utf-8?Q?Jeffrey_=28Zhaohui=29_Zhang?=" Cc: "Antoni Przygienda" , "Bruno Rijsman" , "rift@ietf.org" MIME-Version: 1.0 Content-Type: text/plain;charset=UTF-8 Content-Transfer-Encoding: quoted-printable Importance: Normal X-Priority: 3 (Normal) X-Type: plain In-Reply-To: References: Message-ID: <1556066007.970621387@apps.rackspace.com> X-Mailer: webmail/16.4.1-RC Archived-At: Subject: Re: [Rift] RIFT security review X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Apr 2019 00:33:32 -0000 Hi Jeffrey,=0A=0AI haven't yet reviewed the document, but based on your rep= ly, I think that all of my comments are addressed except for one. From Anto= nio's reply below:=0A=0A> =0A> I was under the impression that nonce is a w= ell-known term in cryptography=0A> =0A> https://en.wikipedia.org/wiki/Crypt= ographic_nonce=0A> [https://upload.wikimedia.org/wikipedia/commons/thumb/4/= 4f/Nonce-cnonce-uml.svg/1200px-Nonce-cnonce-uml.svg.png]=0A=0A=0AThis is exactly my point: read the= definition given for cryptographic nonce: "In cryptography, a nonce is an = arbitrary number that can be used just once in a cryptographic communicatio= n"=0A=0AThe key here is "just once". Because you expect rollover, and becau= se you allow one side to hold this constant up to 5 minutes (and repeat it = in any messages flowing during that interval) the same value can be used mo= re than once. So, it does not meet the commonly understood definition of no= nce (and it does not have the security properties commonly expected of a no= nce).=0A=0A--Scott=0A=0A=0AOn Tuesday, April 23, 2019 10:30am, "Jeffrey (Zh= aohui) Zhang" said:=0A=0A> Hi Scott,=0A> =0A> Thanks f= or your security review on RIFT spec. I am copying this to the RIFT=0A> mai= ling list.=0A> =0A> Please see Tony's response in the email below and the r= evisions in=0A> https://www.ietf.org/id/draft-ietf-rift-rift-05.txt. Could = you review again to=0A> make sure all your comments/concerns have been addr= essed?=0A> =0A> Thanks!=0A> Jeffrey=0A> =0A> =0A> =0A> Juniper Internal=0A>= From: Antoni Przygienda =0A> Sent: Thursday, April 18, 20= 19 12:34 PM=0A> To: Jeffrey (Zhaohui) Zhang ; rift-chai= rs@ietf.org; Bruno=0A> Rijsman =0A> Subject: Re: RI= FT security review=0A> =0A> My comments on the security review:=0A> =0A> = =0A> I have reviewed this document as part of the security directorate's on= going effort=0A> to review all IETF documents being processed by the IESG. = These comments were=0A> written primarily for the benefit of the security = area directors. Document=0A> editors and WG chairs should treat these comm= ents just like any other last call=0A> comments.=0A> =0A> =0A> =0A> The sum= mary of the review is ready with issues=0A> =0A> =0A> =0A> From the abstrac= t, this document outlines a specialized, dynamic routing protocol=0A> for C= los and fat-tree network topologies.=0A> =0A> =0A> =0A> (should that read C= LOS?)=0A> =0A> Clos was a French mathematician in Bell Labs who invented th= e stuff so it's really=0A> "Clos"=0A> =0A> Wikipedia: "Clos networks are na= med after Bell Labs researcher=0A> =0A> Charles Clos, who proposed the mode= l in 1952 as a way to overcome the=0A> =0A> performance- and cost-related c= hallenges of electromechanical switches then used=0A> in telephone networks= ."=0A> =0A> Following is a brief summary of comments and questions by sect= ion.=0A> =0A> =0A> =0A> 5.4.1 includes this sentence:=0A> =0A> =0A> =0A> = The most security conscious operators will want to have full control=0A> = =0A> over which port on which router/switch is connected to the respecti= ve=0A> =0A> port on the "other side", which we will call the "port-assoc= iation=0A> =0A> model" (PAM) achievable e.g. by pairwise-key PKI.=0A> = =0A> =0A> =0A> What is "pairwise-key PKI"?=0A> =0A> pair-wise set of privat= e/public key, i.e. a designated key pair per port. I'll try=0A> to word it = better=0A> =0A> =0A> Secion 5.4.2 says "Low processing overhead and effici= ency messaging are also a=0A> goal."=0A> =0A> =0A> =0A> I suggest replacing= efficiency with efficient=0A> =0A> ack=0A> =0A> It also says "Message pri= vacy achieved through full encryption is a non-goal"=0A> =0A> =0A> =0A> I s= uggest saying "Message confidentiality is a non-goal" instead.=0A> =0A> ack= =0A> =0A> Section 5.4.3=0A> =0A> "Length of Fingerprint: 8 bits. Length = in 32-bit multiples of the=0A> =0A> following fingerprint not includi= ng lifetime or nonces. It allows=0A> =0A> to navigate the structure = when an unknown key type is present. To=0A> =0A> clarify a common co= rnercase a fingerprint with length of 0 bits is=0A> =0A> presenting t= his field with value of 0."=0A> =0A> =0A> =0A> Does length 0 mean no finger= print is present (i.e. fingerprints are not provided)?=0A> I don't understa= nd that last sentence.=0A> =0A> yes, it does. I try to improve the wording.= =0A> =0A> The definition for "Security Fingerprint" includes this sentenc= e:=0A> =0A> =0A> =0A> "If the fingerprint is shorter than the significant b= its are left aligned and=0A> remaining bits are set to 0."=0A> =0A> =0A> = =0A> I don't understand this sentence. I think you mean that if the fi= ngerprint=0A> bit length is not an even multiple of 32, then it is left-ali= gned, and the=0A> rightmost unused bits are set to 0. But that's just a gue= ss.=0A> =0A> yes, I try to word better.=0A> =0A> 5.4.4=0A> =0A> "Any imple= mentation including RIFT security MUST generate and wrap around local=0A> n= onces properly"=0A> =0A> =0A> =0A> I see the term "nonce" used elsewhere, b= ut because it can wrap (and therefore=0A> repeat with regularity),=0A> I th= ink this is a poor choice for naming this field. It seems to be more of a= =0A> counter.=0A> I think most security folks would agree that a nonce used= for security purposes=0A> should,=0A> by definition, repeat only with negl= igible probability.=0A> =0A> I was under the impression that nonce is a wel= l-known term in cryptography=0A> =0A> https://en.wikipedia.org/wiki/Cryptog= raphic_nonce=0A> [https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f= /Nonce-cnonce-uml.svg/1200px-Nonce-cnonce-uml.svg.png]=0A> =0A> Cryptographic nonce -=0A> Wikipedia= =0A> In cryptography, a = nonce is an arbitrary number that can be used just once in a=0A> cryptograp= hic communication. It is similar in spirit to a nonce word, hence the=0A> n= ame.It is often a random or pseudo-random number issued in an authenticatio= n=0A> protocol to ensure that old communications cannot be reused in replay= attacks.They=0A> can also be useful as initialization vectors and in crypt= ographic hash ...=0A> en.wikipedia.org=0A> =0A> =0A> =0A> On a related not= e, does this really provide anti-replay protection?=0A> Elsewhere in the do= cument (e.g. section 5.4.4) it says that implementations could=0A> go up to= =0A> 5 minutes without incrementing nonces. Can they send multiple packets = with the=0A> same=0A> nonce during this interval? If so, what prevents repl= ay of a captured packet=0A> within that interval?=0A> =0A> =0A> =0A> Also, = because wrapping (of this 16 bit value) is supported, it's also possible=0A= > that an earlier packet could be replayed (assuming the peer nonce also al= igned),=0A> right? The odds of this seem low, but could the protocol/endpoi= nt states be=0A> manipulated to improve the odds? Not sure. But if you are = assuming this can't=0A> happen, this security-relevant assumption should be= called out.=0A> =0A> =0A> =0A> 1. Correct, for efficieny purposes we op= en up to a 5 min window which we=0A> consider an acceptable risk per point = 2=0A> 2. it is the combination of local and remote nonce so it's really = a 32 bit=0A> number. The chance that the combination repeats is obviously v= ery small.=0A> =0A> =0A> 5.4.7 says=0A> =0A> =0A> =0A> "If an implement= ation supports disabling the security envelope=0A> =0A> requirements whi= le sending a security envelope an implementation=0A> =0A> could shut dow= n the security envelope procedures while maintaining an=0A> =0A> adjacen= cy and make changes to the algorithms on both sides then re=0A> =0A> ena= ble the security envelope procedures but that introduces security=0A> =0A> = holes during the disabled period."=0A> =0A> =0A> =0A> Aside from the fac= t that this needs word-smithing, should this be called out in=0A> the secur= ity=0A> considerations section? This eeems to be saying that it's not a goo= d idea to=0A> temporarily=0A> maintain adjacency while disabling security, = so is this a SHOULD NOT?=0A> =0A> =0A> Will improve wording. Yes, it is a S= HOULD NOT but sometimes implementations do=0A> that to not loose adjacency = and change keys easily.=0A> =0A> =0A> section 8.4=0A> =0A> flodding -> flo= oding=0A> =0A> =0A> =0A> section 8.4 also says=0A> =0A> =0A> =0A> It is = expected that an=0A> =0A> implementation detecting too many fake losses = or misorderings due to=0A> =0A> the attack on the number would simply su= ppress its further processing.=0A> =0A> =0A> =0A> what are "fake losses"?= =0A> =0A> =0A> =0A> I am not a routing expert, so there may be additional c= oncerns that someone better=0A> versed in routing would raise.=0A> Will imp= rove wording. "Fake losses" is a possible attack vector where an attacker= =0A> intercepts packets, modifies the packet number to simulate a "packet = number=0A> loss/misorder" and forwards the packet on.=0A> =0A> Please fwd' = to reviewer if needed & tell me whether you want notification on new=0A> ve= rsion ...=0A> =0A> --- tony=0A> =0A> ________________________________=0A> F= rom: Jeffrey (Zhaohui) Zhang=0A> Sent: Thursday, April 18, 2019 7:26 AM=0A>= To: Antoni Przygienda=0A> Subject: RIFT security review=0A> =0A> Hi Tony,= =0A> =0A> Please see=0A> https://datatracker.ietf.org/doc/review-ietf-rift-= rift-04-secdir-early-kelly-2019-04-11/.=0A> =0A> Thanks.=0A> Jeffrey=0A> = =0A> Juniper Internal=0A> =0A From nobody Tue Apr 23 18:32:24 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4373E120172 for ; Tue, 23 Apr 2019 18:32:22 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.337 X-Spam-Level: X-Spam-Status: No, score=-1.337 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, KHOP_DYNAMIC=1.363, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JAJh246Rp4lJ for ; Tue, 23 Apr 2019 18:32:19 -0700 (PDT) Received: from mx0b-00273201.pphosted.com (mx0b-00273201.pphosted.com [67.231.152.164]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1BBB5120156 for ; Tue, 23 Apr 2019 18:32:19 -0700 (PDT) Received: from pps.filterd (m0108160.ppops.net [127.0.0.1]) by mx0b-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3O1TRqX005958; Tue, 23 Apr 2019 18:32:16 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : mime-version; s=PPS1017; bh=H2SHLHEUSBne259HGQ3+N5e1bXBLazygTJf55U3o69Y=; b=mV6IcRLjFzVag1Eepn8JB8+RRVOCkIJRjxi7cc0rPXbumrtS+Q1fkn7283TAmd5hGutA zT8+Vl2SFIv387fJbCeKK12iFD8gfNwXPFKeQFjSO5Y1yAmvFssSWSTXDZe1fSH93HgS iMFxv3TiEnNeSLuIwQX0QmJGjSUXxiQXUs7oklZodXgu70NM4R8zIlqsj1rihCavi4NJ 45Y0rhyyieDkww6Mqj/2uEJxrLRbRheqcr1LgDA1TGSkiOzb5FHqTQyPgGIuUFLhyQQE reVjsPGld8iw4KzmhfADb1tqnvucT1HXated56r/0YvGh2wVeFJtP9Hs++Rrqe2n9XI0 ZA== Received: from nam03-co1-obe.outbound.protection.outlook.com (mail-co1nam03lp2053.outbound.protection.outlook.com [104.47.40.53]) by mx0b-00273201.pphosted.com with ESMTP id 2s29asgc5b-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Tue, 23 Apr 2019 18:32:15 -0700 Received: from MWHPR05MB3279.namprd05.prod.outlook.com (10.173.230.18) by MWHPR05MB3197.namprd05.prod.outlook.com (10.173.229.140) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1835.7; Wed, 24 Apr 2019 01:32:13 +0000 Received: from MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202]) by MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202%10]) with mapi id 15.20.1835.010; Wed, 24 Apr 2019 01:32:13 +0000 From: Antoni Przygienda To: "Scott G. Kelly" , "Jeffrey (Zhaohui) Zhang" CC: Bruno Rijsman , "rift@ietf.org" Thread-Topic: RIFT security review Thread-Index: AQHU+jVaXs0vLJXv+EuMv17QAGUsTaZKhUum Date: Wed, 24 Apr 2019 01:32:12 +0000 Message-ID: References: ,<1556066007.970621387@apps.rackspace.com> In-Reply-To: <1556066007.970621387@apps.rackspace.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [66.129.239.11] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 76dd14a0-3dbe-444f-57a9-08d6c854ad54 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600141)(711020)(4605104)(4618075)(2017052603328)(7193020); SRVR:MWHPR05MB3197; x-ms-traffictypediagnostic: MWHPR05MB3197: x-ms-exchange-purlcount: 4 x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:3173; x-forefront-prvs: 00179089FD x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(376002)(136003)(366004)(396003)(39860400002)(346002)(51444003)(189003)(199004)(6506007)(4326008)(5660300002)(15650500001)(14454004)(9686003)(33656002)(55016002)(53936002)(966005)(8936002)(54896002)(236005)(478600001)(81156014)(6306002)(7696005)(6436002)(476003)(76176011)(229853002)(8676002)(446003)(74316002)(53546011)(81166006)(6246003)(99286004)(110136005)(54906003)(97736004)(66574012)(105004)(66476007)(256004)(14444005)(6116002)(25786009)(64756008)(68736007)(73956011)(66946007)(66446008)(66556008)(316002)(30864003)(3846002)(186003)(71190400001)(76116006)(71200400001)(19627405001)(26005)(6636002)(486006)(66066001)(1941001)(2906002)(52536014)(102836004)(7116003)(606006)(7736002)(11346002)(3480700005)(86362001); DIR:OUT; SFP:1102; SCL:1; SRVR:MWHPR05MB3197; H:MWHPR05MB3279.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: nmF9yoCmUJI+vjdq3f9ORBVlyhjyZt72Rsp5UXqGT8l48peiIduOJTj1r7HR4YQ5FTXRquy3GqcCG2CQJKf3XqVMwP/Jnqknk1I/IrtsiIRKLC3NbtZQxg35BcdPrtyIv0xGAlsriUalWYtYbcBofiZ4/DpD+DObXsZk2BSCO/xmHujUtWacxgAG4EPEP+nj52tPB1jCI7iqh0iRukYt4InuWE0VaDUyQd4XGU7g1hDQqIZpP7SBr+WYXooX8sq9dPzr0ziZWounf4PJhKdtQ+Vb6V6Q4cXUapTNuRQJuW5/DkKRnkjcHyYko2/KSeWImNzvcvIkdEecxFYUaS9P9k+vsn4uUDuR1p1B3ieyqc5J0uger63E9HGcqIbIT2ajRYszjfyWkQWqnH5vgGO3PqFFR4d2emImbTjC2Kcl2z0= Content-Type: multipart/alternative; boundary="_000_MWHPR05MB32795F5B83AD6EE610BDCF57AC3C0MWHPR05MB3279namp_" MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-Network-Message-Id: 76dd14a0-3dbe-444f-57a9-08d6c854ad54 X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Apr 2019 01:32:12.8866 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR05MB3197 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-24_01:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904240010 Archived-At: Subject: Re: [Rift] RIFT security review X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Apr 2019 01:32:22 -0000 --_000_MWHPR05MB32795F5B83AD6EE610BDCF57AC3C0MWHPR05MB3279namp_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable thanks Scott, I think I added in the document some text saying "that's not = _really_ a nonce but close" and that it's really combination of local & rem= ote that is used as salt so it's 32 bits (so replay is staticially only lik= ely for the duration of 5min/2 and only after 1^32/2 sequences have been re= corded). If you think the text is not sufficient, we can rename it to "pseu= do-nonce" or something like this I probably misunderstood your comment to an extent ... I hope that my reasoning as to "performance" in the sense of having a "real= " nonce and to cryptographically fingerprint _every_ LIE on evey interface = every time and every TIE on transmission/retransmission represening a very = high load was clear ... And even if we insist on "perfect nonces" then we have allow for a "nonce s= lip" since LIEs may get lost and so on so the protocol must allow +/- 5 non= ces local/neighbor anyway to work under lossy links. So there is a a certai= n replay vector that is unavoidable I think ... thanks --- tony ________________________________ From: Scott G. Kelly Sent: Tuesday, April 23, 2019 5:33 PM To: Jeffrey (Zhaohui) Zhang Cc: Antoni Przygienda; Bruno Rijsman; rift@ietf.org Subject: Re: RIFT security review Hi Jeffrey, I haven't yet reviewed the document, but based on your reply, I think that = all of my comments are addressed except for one. From Antonio's reply below= : > > I was under the impression that nonce is a well-known term in cryptograph= y > > https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__en.wikipedia.org_w= iki_Cryptographic-5Fnonce&d=3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3vo= DTXcWzoCI&r=3DmaKXfKzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycs= GJH_mNH6U&s=3Dat8GTl8R0Z06MPVIsuXx0_C74NJeGZ1xePUdjnYACtg&e=3D > [https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__upload.wikimedia.= org_wikipedia_commons_thumb_4_4f_Nonce-2Dcnonce-2Duml.svg_1200px-2DNonce-2D= cnonce-2Duml.svg.png&d=3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcW= zoCI&r=3DmaKXfKzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycsGJH_m= NH6U&s=3DsxN_f34c3v2_yMpEZgquYcJbHLXu8PmtP6nDuaIkmlc&e=3D] This is exactly my point: read the definition given for cryptographic nonce= : "In cryptography, a nonce is an arbitrary number that can be used just on= ce in a cryptographic communication" The key here is "just once". Because you expect rollover, and because you a= llow one side to hold this constant up to 5 minutes (and repeat it in any m= essages flowing during that interval) the same value can be used more than = once. So, it does not meet the commonly understood definition of nonce (and= it does not have the security properties commonly expected of a nonce). --Scott On Tuesday, April 23, 2019 10:30am, "Jeffrey (Zhaohui) Zhang" said: > Hi Scott, > > Thanks for your security review on RIFT spec. I am copying this to the RI= FT > mailing list. > > Please see Tony's response in the email below and the revisions in > https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__www.ietf.org_id_dr= aft-2Dietf-2Drift-2Drift-2D05.txt&d=3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeM= K-ndb3voDTXcWzoCI&r=3DmaKXfKzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvque4qSUq= Kng8_ycsGJH_mNH6U&s=3DW2GfTd20tkv2UsNNsAQ5w7JfplP2RJNK1o4Qb-NphAg&e=3D. Cou= ld you review again to > make sure all your comments/concerns have been addressed? > > Thanks! > Jeffrey > > > > Juniper Internal > From: Antoni Przygienda > Sent: Thursday, April 18, 2019 12:34 PM > To: Jeffrey (Zhaohui) Zhang ; rift-chairs@ietf.org; B= runo > Rijsman > Subject: Re: RIFT security review > > My comments on the security review: > > > I have reviewed this document as part of the security directorate's ongoi= ng effort > to review all IETF documents being processed by the IESG. These comments= were > written primarily for the benefit of the security area directors. Docume= nt > editors and WG chairs should treat these comments just like any other las= t call > comments. > > > > The summary of the review is ready with issues > > > > From the abstract, this document outlines a specialized, dynamic routing = protocol > for Clos and fat-tree network topologies. > > > > (should that read CLOS?) > > Clos was a French mathematician in Bell Labs who invented the stuff so it= 's really > "Clos" > > Wikipedia: "Clos networks are named after Bell Labs researcher > > Charles Clos, who proposed the model in 1952 as a way to overcome the > > performance- and cost-related challenges of electromechanical switches th= en used > in telephone networks." > > Following is a brief summary of comments and questions by section. > > > > 5.4.1 includes this sentence: > > > > The most security conscious operators will want to have full control > > over which port on which router/switch is connected to the respective > > port on the "other side", which we will call the "port-association > > model" (PAM) achievable e.g. by pairwise-key PKI. > > > > What is "pairwise-key PKI"? > > pair-wise set of private/public key, i.e. a designated key pair per port.= I'll try > to word it better > > > Secion 5.4.2 says "Low processing overhead and efficiency messaging are = also a > goal." > > > > I suggest replacing efficiency with efficient > > ack > > It also says "Message privacy achieved through full encryption is a non-= goal" > > > > I suggest saying "Message confidentiality is a non-goal" instead. > > ack > > Section 5.4.3 > > "Length of Fingerprint: 8 bits. Length in 32-bit multiples of the > > following fingerprint not including lifetime or nonces. It allows > > to navigate the structure when an unknown key type is present. To > > clarify a common cornercase a fingerprint with length of 0 bits is > > presenting this field with value of 0." > > > > Does length 0 mean no fingerprint is present (i.e. fingerprints are not p= rovided)? > I don't understand that last sentence. > > yes, it does. I try to improve the wording. > > The definition for "Security Fingerprint" includes this sentence: > > > > "If the fingerprint is shorter than the significant bits are left aligned= and > remaining bits are set to 0." > > > > I don't understand this sentence. I think you mean that if the finge= rprint > bit length is not an even multiple of 32, then it is left-aligned, and th= e > rightmost unused bits are set to 0. But that's just a guess. > > yes, I try to word better. > > 5.4.4 > > "Any implementation including RIFT security MUST generate and wrap around= local > nonces properly" > > > > I see the term "nonce" used elsewhere, but because it can wrap (and there= fore > repeat with regularity), > I think this is a poor choice for naming this field. It seems to be more = of a > counter. > I think most security folks would agree that a nonce used for security pu= rposes > should, > by definition, repeat only with negligible probability. > > I was under the impression that nonce is a well-known term in cryptograph= y > > https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__en.wikipedia.org_w= iki_Cryptographic-5Fnonce&d=3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3vo= DTXcWzoCI&r=3DmaKXfKzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycs= GJH_mNH6U&s=3Dat8GTl8R0Z06MPVIsuXx0_C74NJeGZ1xePUdjnYACtg&e=3D > [https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__upload.wikimedia.= org_wikipedia_commons_thumb_4_4f_Nonce-2Dcnonce-2Duml.svg_1200px-2DNonce-2D= cnonce-2Duml.svg.png&d=3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcW= zoCI&r=3DmaKXfKzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycsGJH_m= NH6U&s=3DsxN_f34c3v2_yMpEZgquYcJbHLXu8PmtP6nDuaIkmlc&e=3D] > > Cryptographic nonce - > Wikipedia > In cryptography, a nonce is an arbitrary number that can be used just onc= e in a > cryptographic communication. It is similar in spirit to a nonce word, hen= ce the > name.It is often a random or pseudo-random number issued in an authentica= tion > protocol to ensure that old communications cannot be reused in replay att= acks.They > can also be useful as initialization vectors and in cryptographic hash ..= . > en.wikipedia.org > > > > On a related note, does this really provide anti-replay protection? > Elsewhere in the document (e.g. section 5.4.4) it says that implementatio= ns could > go up to > 5 minutes without incrementing nonces. Can they send multiple packets wit= h the > same > nonce during this interval? If so, what prevents replay of a captured pac= ket > within that interval? > > > > Also, because wrapping (of this 16 bit value) is supported, it's also pos= sible > that an earlier packet could be replayed (assuming the peer nonce also al= igned), > right? The odds of this seem low, but could the protocol/endpoint states = be > manipulated to improve the odds? Not sure. But if you are assuming this c= an't > happen, this security-relevant assumption should be called out. > > > > 1. Correct, for efficieny purposes we open up to a 5 min window which = we > consider an acceptable risk per point 2 > 2. it is the combination of local and remote nonce so it's really a 32= bit > number. The chance that the combination repeats is obviously very small. > > > 5.4.7 says > > > > "If an implementation supports disabling the security envelope > > requirements while sending a security envelope an implementation > > could shut down the security envelope procedures while maintaining an > > adjacency and make changes to the algorithms on both sides then re > > enable the security envelope procedures but that introduces security > > holes during the disabled period." > > > > Aside from the fact that this needs word-smithing, should this be called = out in > the security > considerations section? This eeems to be saying that it's not a good idea= to > temporarily > maintain adjacency while disabling security, so is this a SHOULD NOT? > > > Will improve wording. Yes, it is a SHOULD NOT but sometimes implementatio= ns do > that to not loose adjacency and change keys easily. > > > section 8.4 > > flodding -> flooding > > > > section 8.4 also says > > > > It is expected that an > > implementation detecting too many fake losses or misorderings due to > > the attack on the number would simply suppress its further processing. > > > > what are "fake losses"? > > > > I am not a routing expert, so there may be additional concerns that someo= ne better > versed in routing would raise. > Will improve wording. "Fake losses" is a possible attack vector where an = attacker > intercepts packets, modifies the packet number to simulate a "packet num= ber > loss/misorder" and forwards the packet on. > > Please fwd' to reviewer if needed & tell me whether you want notification= on new > version ... > > --- tony > > ________________________________ > From: Jeffrey (Zhaohui) Zhang > Sent: Thursday, April 18, 2019 7:26 AM > To: Antoni Przygienda > Subject: RIFT security review > > Hi Tony, > > Please see > https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__datatracker.ietf.o= rg_doc_review-2Dietf-2Drift-2Drift-2D04-2Dsecdir-2Dearly-2Dkelly-2D2019-2D0= 4-2D11_&d=3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=3DmaKX= fKzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycsGJH_mNH6U&s=3Dph0H= bUWIyvt-sCe809MFpw7LlO3iXVYLDPyoZ6Uze10&e=3D. > > Thanks. > Jeffrey > > Juniper Internal > --_000_MWHPR05MB32795F5B83AD6EE610BDCF57AC3C0MWHPR05MB3279namp_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable
thanks Scott, I think I added in the document some text saying "that's= not _really_ a nonce but close" and that it's really combination of l= ocal & remote that is used as salt so it's 32 bits (so replay is static= ially only likely for the duration of 5min/2 and only after 1^32/2 sequences have been recorded). If you think the text is = not sufficient, we can rename it to "pseudo-nonce" or something l= ike this

I probably misunderstood your comment to an extent ...

I hope that my reasoning as to "performance" in the sense of havi= ng a "real" nonce and to cryptographically fingerprint _every_ LI= E on evey interface every time and every TIE on transmission/retransmission= represening a very high load was clear ...

And even if we insist on "perfect nonces" then we have allow for = a "nonce slip" since LIEs may get lost and so on so the protocol = must allow +/- 5 nonces local/neighbor anyway to work under lossy links= . So there is a a certain replay vector that is unavoidable I think ...

thanks

--- tony


From: Scott G. Kelly <sc= ott@hyperthought.com>
Sent: Tuesday, April 23, 2019 5:33 PM
To: Jeffrey (Zhaohui) Zhang
Cc: Antoni Przygienda; Bruno Rijsman; rift@ietf.org
Subject: Re: RIFT security review
 
Hi Jeffrey,

I haven't yet reviewed the document, but based on your reply, I think that = all of my comments are addressed except for one. From Antonio's reply below= :

>
> I was under the impression that nonce is a well-known term in cryptogr= aphy
>
> https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__en.wikipedia.org_wik= i_Cryptographic-5Fnonce&d=3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-= ndb3voDTXcWzoCI&r=3DmaKXfKzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvqu= e4qSUqKng8_ycsGJH_mNH6U&s=3Dat8GTl8R0Z06MPVIsuXx0_C74NJeGZ1xePUdjnYACtg= &e=3D
> [https://urldefense.proofpoint.com/v2/ur= l?u=3Dhttps-3A__upload.wikimedia.org_wikipedia_commons_thumb_4_4f_Nonce-2Dc= nonce-2Duml.svg_1200px-2DNonce-2Dcnonce-2Duml.svg.png&d=3DDwIFaQ&c= =3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=3DmaKXfKzgRTpiLitqHnJi= ww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycsGJH_mNH6U&s=3DsxN_f34c3v2_= yMpEZgquYcJbHLXu8PmtP6nDuaIkmlc&e=3D]<https://urldefense.proofpoint.= com/v2/url?u=3Dhttps-3A__en.wikipedia.org_wiki_Cryptographic-5Fnonce&d= =3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=3DmaKXf= KzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycsGJH_mNH6U&s= =3Dat8GTl8R0Z06MPVIsuXx0_C74NJeGZ1xePUdjnYACtg&e=3D>


This is exactly my point: read the definition given for cryptographic nonce= : "In cryptography, a nonce is an arbitrary number that can be used ju= st once in a cryptographic communication"

The key here is "just once". Because you expect rollover, and bec= ause you allow one side to hold this constant up to 5 minutes (and repeat i= t in any messages flowing during that interval) the same value can be used = more than once. So, it does not meet the commonly understood definition of nonce (and it does not have the security properti= es commonly expected of a nonce).

--Scott


On Tuesday, April 23, 2019 10:30am, "Jeffrey (Zhaohui) Zhang" <= ;zzhang@juniper.net> said:

> Hi Scott,
>
> Thanks for your security review on RIFT spec. I am copying this to the= RIFT
> mailing list.
>
> Please see Tony's response in the email below and the revisions in
> https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__www.ietf.org_id_draf= t-2Dietf-2Drift-2Drift-2D05.txt&d=3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0= UjBXeMK-ndb3voDTXcWzoCI&r=3DmaKXfKzgRTpiLitqHnJiww&m=3DewT6IR2ygwql= SkxXUvque4qSUqKng8_ycsGJH_mNH6U&s=3DW2GfTd20tkv2UsNNsAQ5w7JfplP2RJNK1o4= Qb-NphAg&e=3D. Could you review again to
> make sure all your comments/concerns have been addressed?
>
> Thanks!
> Jeffrey
>
>
>
> Juniper Internal
> From: Antoni Przygienda <prz@juniper.net>
> Sent: Thursday, April 18, 2019 12:34 PM
> To: Jeffrey (Zhaohui) Zhang <zzhang@juniper.net>; rift-chairs@ie= tf.org; Bruno
> Rijsman <brunorijsman@gmail.com>
> Subject: Re: RIFT security review
>
> My comments on the security review:
>
>
> I have reviewed this document as part of the security directorate's on= going effort
> to review all IETF documents being processed by the IESG.  These = comments were
> written primarily for the benefit of the security area directors. = ; Document
> editors and WG chairs should treat these comments just like any other = last call
> comments.
>
>
>
> The summary of the review is ready with issues
>
>
>
> From the abstract, this document outlines a specialized, dynamic routi= ng protocol
> for Clos and fat-tree network topologies.
>
>
>
> (should that read CLOS?)
>
> Clos was a French mathematician in Bell Labs who invented the stuff so= it's really
> "Clos"
>
> Wikipedia: "Clos networks are named after Bell Labs researcher >
> Charles Clos, who proposed the model in 1952 as a way to overcome the<= br> >
> performance- and cost-related challenges of electromechanical switches= then used
> in telephone networks."
>
>  Following is a brief summary of comments and questions by sectio= n.
>
>
>
> 5.4.1 includes this sentence:
>
>
>
>    The most security conscious operators will want to h= ave full control
>
>    over which port on which router/switch is connected = to the respective
>
>    port on the "other side", which we will ca= ll the "port-association
>
>    model" (PAM) achievable e.g. by pairwise-key PK= I.
>
>
>
> What is "pairwise-key PKI"?
>
> pair-wise set of private/public key, i.e. a designated key pair per po= rt. I'll try
> to word it better
>
>
>  Secion 5.4.2 says "Low processing overhead and efficiency m= essaging are also a
> goal."
>
>
>
> I suggest replacing efficiency with efficient
>
> ack
>
>  It also says "Message privacy achieved through full encrypt= ion is a non-goal"
>
>
>
> I suggest saying "Message confidentiality is a non-goal" ins= tead.
>
> ack
>
>  Section 5.4.3
>
> "Length of Fingerprint:  8 bits.  Length in 32-bit mult= iples of the
>
>       following fingerprint not includin= g lifetime or nonces.  It allows
>
>       to navigate the structure when an = unknown key type is present.  To
>
>       clarify a common cornercase a fing= erprint with length of 0 bits is
>
>       presenting this field with value o= f 0."
>
>
>
> Does length 0 mean no fingerprint is present (i.e. fingerprints are no= t provided)?
> I don't understand that last sentence.
>
> yes, it does. I try to improve the wording.
>
>  The definition for "Security Fingerprint" includes thi= s  sentence:
>
>
>
> "If the fingerprint is shorter than the significant bits are left= aligned and
> remaining bits are set to 0."
>
>
>
> I don't understand this sentence. I think you mean that  &nb= sp;   if the fingerprint
> bit length is not an even multiple of 32, then it is left-aligned, and= the
> rightmost unused bits are set to 0. But that's just a guess.
>
> yes, I try to word better.
>
>  5.4.4
>
> "Any implementation including RIFT security MUST generate and wra= p around local
> nonces properly"
>
>
>
> I see the term "nonce" used elsewhere, but because it can wr= ap (and therefore
> repeat with regularity),
> I think this is a poor choice for naming this field. It seems to be mo= re of a
> counter.
> I think most security folks would agree that a nonce used for security= purposes
> should,
> by definition, repeat only with negligible probability.
>
> I was under the impression that nonce is a well-known term in cryptogr= aphy
>
> https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__en.wikipedia.org_wik= i_Cryptographic-5Fnonce&d=3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-= ndb3voDTXcWzoCI&r=3DmaKXfKzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvqu= e4qSUqKng8_ycsGJH_mNH6U&s=3Dat8GTl8R0Z06MPVIsuXx0_C74NJeGZ1xePUdjnYACtg= &e=3D
> [https://urldefense.proofpoint.com/v2/ur= l?u=3Dhttps-3A__upload.wikimedia.org_wikipedia_commons_thumb_4_4f_Nonce-2Dc= nonce-2Duml.svg_1200px-2DNonce-2Dcnonce-2Duml.svg.png&d=3DDwIFaQ&c= =3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=3DmaKXfKzgRTpiLitqHnJi= ww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycsGJH_mNH6U&s=3DsxN_f34c3v2_= yMpEZgquYcJbHLXu8PmtP6nDuaIkmlc&e=3D]<https://urldefense.proofpoint.= com/v2/url?u=3Dhttps-3A__en.wikipedia.org_wiki_Cryptographic-5Fnonce&d= =3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=3DmaKXf= KzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycsGJH_mNH6U&s= =3Dat8GTl8R0Z06MPVIsuXx0_C74NJeGZ1xePUdjnYACtg&e=3D>
>
> Cryptographic nonce -
> Wikipedia<https://urldefense.proofpoint.com= /v2/url?u=3Dhttps-3A__en.wikipedia.org_wiki_Cryptographic-5Fnonce&d=3DD= wIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=3DmaKXfKzgR= TpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycsGJH_mNH6U&s=3Da= t8GTl8R0Z06MPVIsuXx0_C74NJeGZ1xePUdjnYACtg&e=3D>
> In cryptography, a nonce is an arbitrary number that can be used just = once in a
> cryptographic communication. It is similar in spirit to a nonce word, = hence the
> name.It is often a random or pseudo-random number issued in an authent= ication
> protocol to ensure that old communications cannot be reused in replay = attacks.They
> can also be useful as initialization vectors and in cryptographic hash= ...
> en.wikipedia.org
>
>
>
>  On a related note, does this really provide anti-replay protecti= on?
> Elsewhere in the document (e.g. section 5.4.4) it says that implementa= tions could
> go up to
> 5 minutes without incrementing nonces. Can they send multiple packets = with the
> same
> nonce during this interval? If so, what prevents replay of a captured = packet
> within that interval?
>
>
>
> Also, because wrapping (of this 16 bit value) is supported, it's also = possible
> that an earlier packet could be replayed (assuming the peer nonce also= aligned),
> right? The odds of this seem low, but could the protocol/endpoint stat= es be
> manipulated to improve the odds? Not sure. But if you are assuming thi= s can't
> happen, this security-relevant assumption should be called out.
>
>
>
>   1.  Correct, for efficieny purposes we open up to a 5= min window which we
> consider an acceptable risk per point 2
>   2.  it is the combination of local and remote nonce s= o it's really a 32 bit
> number. The chance that the combination repeats is obviously very smal= l.
>
>
>  5.4.7 says
>
>
>
>    "If an implementation supports disabling the se= curity envelope
>
>    requirements while sending a security envelope an im= plementation
>
>    could shut down the security envelope procedures whi= le maintaining an
>
>    adjacency and make changes to the algorithms on both= sides then re
>
>    enable the security envelope procedures but that int= roduces security
>
>    holes during the disabled period."
>
>
>
> Aside from the fact that this needs word-smithing, should this be call= ed out in
> the security
> considerations section? This eeems to be saying that it's not a good i= dea to
> temporarily
> maintain adjacency while disabling security, so is this a SHOULD NOT?<= br> >
>
> Will improve wording. Yes, it is a SHOULD NOT but sometimes implementa= tions do
> that to not loose adjacency and change keys easily.
>
>
>  section 8.4
>
> flodding -> flooding
>
>
>
> section 8.4 also says
>
>
>
>    It is expected that an
>
>    implementation detecting too many fake losses or mis= orderings due to
>
>    the attack on the number would simply suppress its f= urther processing.
>
>
>
> what are "fake losses"?
>
>
>
> I am not a routing expert, so there may be additional concerns that so= meone better
> versed in routing would raise.
> Will improve wording. "Fake losses" is a possible attack vec= tor where an attacker
> intercepts packets, modifies the packet number  to simulate a &qu= ot;packet number
> loss/misorder" and forwards the packet on.
>
> Please fwd' to reviewer if needed & tell me whether you want notif= ication on new
> version ...
>
> --- tony
>
> ________________________________
> From: Jeffrey (Zhaohui) Zhang
> Sent: Thursday, April 18, 2019 7:26 AM
> To: Antoni Przygienda
> Subject: RIFT security review
>
> Hi Tony,
>
> Please see
> https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__datatracker.ietf.org= _doc_review-2Dietf-2Drift-2Drift-2D04-2Dsecdir-2Dearly-2Dkelly-2D2019-2D04-= 2D11_&d=3DDwIFaQ&c=3DHAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&am= p;r=3DmaKXfKzgRTpiLitqHnJiww&m=3DewT6IR2ygwqlSkxXUvque4qSUqKng8_ycsGJH_= mNH6U&s=3Dph0HbUWIyvt-sCe809MFpw7LlO3iXVYLDPyoZ6Uze10&e=3D.
>
> Thanks.
> Jeffrey
>
> Juniper Internal
>


--_000_MWHPR05MB32795F5B83AD6EE610BDCF57AC3C0MWHPR05MB3279namp_-- From nobody Wed Apr 24 10:47:43 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A392C12040E for ; Wed, 24 Apr 2019 10:47:36 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.337 X-Spam-Level: X-Spam-Status: No, score=-1.337 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, KHOP_DYNAMIC=1.363, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=juniper.net Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qQa57mJehG9z for ; Wed, 24 Apr 2019 10:47:35 -0700 (PDT) Received: from mx0a-00273201.pphosted.com (mx0a-00273201.pphosted.com [208.84.65.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EEEE0120359 for ; Wed, 24 Apr 2019 10:47:34 -0700 (PDT) Received: from pps.filterd (m0108156.ppops.net [127.0.0.1]) by mx0a-00273201.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3OHjObv006624; Wed, 24 Apr 2019 10:47:31 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=juniper.net; h=from : to : subject : date : message-id : references : in-reply-to : content-type : mime-version; s=PPS1017; bh=LgsM9oe+vmkYne7Hsez6tTSmYOlWlWLiVqRlNR7W3Ak=; b=mlBtyhxI6cTuL5QHlTneyUqX598/rMkrMFl65nuTX1wMxky/fbMw0T2LmVjo7Mse7A0h +Gystuam8jxuhuYW9oki8wWNGbNBpbOX9uSLoP1PAbzuO/OXGqSZ7gYZCniiNj4vHnB+ oh06Y7oYGMAmW1Cxn62ubcEYRqcubuIHKWQxwUsGHOvmlQUVXZ+o0+8Ho1x5w3JcVVBu 8cHhQixLk8U76UieTc6f2vj38S2PK7yIvWKaK5ZC7ukzO2OhRR83fC+2yC5xBAyQZ9ZZ 0CTUDxifOdeVBZbLR8Tnj2RRMWZaY7gS5tk8/r/P+/kEtpRTTgxD9yqzfwVcYJL8Ob6z pw== Received: from nam05-co1-obe.outbound.protection.outlook.com (mail-co1nam05lp2054.outbound.protection.outlook.com [104.47.48.54]) by mx0a-00273201.pphosted.com with ESMTP id 2s2ng1grdw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 24 Apr 2019 10:47:31 -0700 Received: from MWHPR05MB3279.namprd05.prod.outlook.com (10.173.230.18) by MWHPR05MB3181.namprd05.prod.outlook.com (10.173.229.136) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1835.9; Wed, 24 Apr 2019 17:47:29 +0000 Received: from MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202]) by MWHPR05MB3279.namprd05.prod.outlook.com ([fe80::c104:c5bd:b877:2202%10]) with mapi id 15.20.1835.010; Wed, 24 Apr 2019 17:47:29 +0000 From: Antoni Przygienda To: Bruno Rijsman , "rift@ietf.org" Thread-Topic: Initial implementation of security in RIFT-Python is complete Thread-Index: AQHU+WF75SyDYoAPLUWqAFRVz5SOtqZLmJe1 Date: Wed, 24 Apr 2019 17:47:29 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [66.129.239.10] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 70dbf06c-2d08-462d-bd1a-08d6c8dcebc4 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600141)(711020)(4605104)(4618075)(2017052603328)(7193020); SRVR:MWHPR05MB3181; x-ms-traffictypediagnostic: MWHPR05MB3181: x-ms-exchange-purlcount: 1 x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:4502; x-forefront-prvs: 00179089FD x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(366004)(376002)(396003)(346002)(136003)(39860400002)(189003)(199004)(53376002)(446003)(54896002)(52536014)(19627405001)(53366004)(14444005)(99286004)(11346002)(110136005)(33656002)(316002)(256004)(236005)(476003)(53546011)(6246003)(6306002)(53936002)(2906002)(86362001)(9686003)(2501003)(486006)(3846002)(6506007)(6116002)(102836004)(5660300002)(55016002)(76176011)(105004)(186003)(26005)(7696005)(15650500001)(8676002)(25786009)(8936002)(81156014)(71200400001)(81166006)(14454004)(966005)(74316002)(4744005)(7110500001)(66946007)(66476007)(64756008)(7736002)(66446008)(73956011)(68736007)(76116006)(66556008)(229853002)(2420400007)(606006)(97736004)(6436002)(478600001)(66066001)(71190400001)(493534005)(19477635001); DIR:OUT; SFP:1102; SCL:1; SRVR:MWHPR05MB3181; H:MWHPR05MB3279.namprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: MXIMvM0HVbMdcmLgGv8RzrcZXGaJh0VqUkXnwzw1CoVsEDq4rAXP+sXFpFDUDBvaRDAjdAZxBl/3G/K+fxtUY7+xy1/DWdliD8iC2iMzxt/pCb1DB3utngdJDCT07hVhaOgPDQH2P8WUVn0BkNrgvkfS72kf7AyDyKvLtDKq7lR9GLQQjIUQzzc6bfTimhXnzwj+LyehZFEaXr2yETjDh65cXiiywBjap9+g3JUl42xALK6nAzpQalXd/brk+x/jgCQkMlj+DHfp7WQCgrdkfFsIKfUGIBiDrQbJKyTpbDxecZNsPxBtKmWjObMjZMGh1tfzX4/QGT98bH2243rDY4odC7w0t2dF1FHNwh1sX81ZpEFtSTKmHej6VJakUIvEyCBpJhMS8W2A6elgb/OFr+I1iw/5iUdr4Evrsm70074= Content-Type: multipart/alternative; boundary="_000_MWHPR05MB3279AB3BE295959AD2AF1BADAC3C0MWHPR05MB3279namp_" MIME-Version: 1.0 X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-Network-Message-Id: 70dbf06c-2d08-462d-bd1a-08d6c8dcebc4 X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Apr 2019 17:47:29.3033 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR05MB3181 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-04-24_11:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904240130 Archived-At: Subject: Re: [Rift] Initial implementation of security in RIFT-Python is complete X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Apr 2019 17:47:41 -0000 --_000_MWHPR05MB3279AB3BE295959AD2AF1BADAC3C0MWHPR05MB3279namp_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable read your guide in detail. makes all perfect sense. will extend schema to w= hat you suggest so inter'op ... thanks --- tony ________________________________ From: Bruno Rijsman Sent: Monday, April 22, 2019 4:16 PM To: rift@ietf.org Subject: Initial implementation of security in RIFT-Python is complete I have finished the initial implementation of security in RIFT-Python (secu= rity envelope, keys, fingerprints, nonces, packet-nr, etc. etc.) See http://bit.ly/rift-python-security-feature-guide for a detailed feature guide. While implementing the code, I gathered a number of comments on the securit= y section of the draft -05. I will report these in a follow-up e-mail. -- Bruno --_000_MWHPR05MB3279AB3BE295959AD2AF1BADAC3C0MWHPR05MB3279namp_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable
read your guide in detail. makes all perfect sense. will extend schema to w= hat you suggest so inter'op ...

thanks

--- tony


From: Bruno Rijsman <bru= norijsman@gmail.com>
Sent: Monday, April 22, 2019 4:16 PM
To: rift@ietf.org
Subject: Initial implementation of security in RIFT-Python is comple= te
 
I have finished the initial implementation of security= in RIFT-Python (security envelope, keys, fingerprints, nonces, packet= -nr, etc. etc.)

See http://bit.l= y/rift-python-security-feature-guide for a detailed feature guide.

While implementing the code, I gathered a number of comment= s on the security section of the draft -05. I will report these in a follow= -up e-mail.

-- Bruno
--_000_MWHPR05MB3279AB3BE295959AD2AF1BADAC3C0MWHPR05MB3279namp_-- From nobody Thu Apr 25 08:21:56 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9900512023C for ; Thu, 25 Apr 2019 08:21:54 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.999 X-Spam-Level: X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id z9wdlY-N5SQa for ; Thu, 25 Apr 2019 08:21:53 -0700 (PDT) Received: from mail-ed1-x52d.google.com (mail-ed1-x52d.google.com [IPv6:2a00:1450:4864:20::52d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 96BE5120203 for ; Thu, 25 Apr 2019 08:21:30 -0700 (PDT) Received: by mail-ed1-x52d.google.com with SMTP id c1so242941edk.5 for ; Thu, 25 Apr 2019 08:21:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=fmbiyzMcp8pvJ/R+wz1Bo/wMtnWFjskJz1vBFLtWmw0=; b=ISopEjnv0Ukc2185xYNx4fVUVUJPa6S4ztpI4eTo9HS2waCp8q22HQ+YrNYCg5pD0B lDf5JUZFrUGkLW2vc5nt7BbiES+DTe+vH/nIZXt5mGqlzhmcxahcddx7I2lYgnvu4eyQ Ma9AFzFIahZdFavFzKQ0OPaXdYOer4BQPrUNLCF0v8CYCDieZGuVgfy1p3VEezc4ELfA FhMJld5pc5grLjN1WCN4BcbFxn3/u8mBxVj7ByoAIZBzYIxNT+qnTKdTO5gtYkJa399K Ms+cWt1A0H/IrbpelQyWmp/riJf+T6meoIk2IMpDsgec6+w1r4Bzj334MeB0YYvjvfLT 9IUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=fmbiyzMcp8pvJ/R+wz1Bo/wMtnWFjskJz1vBFLtWmw0=; b=lIkK9nr7ve2JAraySZ8GnWaNS6uY2dFcjalt9xCC2+xeJjW+FGP2kphBs0sBjFALGw hGVvgQCfXi5AIF9gRbnt/uSNlhASpGpqL78FkNNXBhuZoZFh5Ah3cxdgSaOJnuRWeFsv XvAk/Z9gtS6PZfe4IzP+Rz15v9B4+FJKOzo52uDeDqfqqLSrB6YTyTcpJGrwWGz0/glZ MEZTr0SkjXGmyp9GTq/TNvquYmflZzpWX1EWnXbx1Ifk4/EOJs7CcXgkvVSYBHsCVhdX uOYn5rgm6wC2P0BilB4oQql2uCX7oqcNP+sX+wS1W1suddPOjzhTBJdc0x3g7Ppbo0mt 7AIg== X-Gm-Message-State: APjAAAUyZ3HlCNlB4CxMbix0CU8UrxkEiDXZUNZ5WBQPp9rVlBXT+jRw qszVBpXTPFj5EUNhroSVmxhaduXbCLAzhcPhM0CXj96l X-Google-Smtp-Source: APXvYqwNUR0+NyywikiboOfQhNQE58lI23M+7M+4uSYhTlqUuBkqOF0Fh3xfB6B/DfmFcCd/7AOf9ubGDR2wh8X/PJg= X-Received: by 2002:a17:906:6a03:: with SMTP id o3mr12330899ejr.6.1556205689079; Thu, 25 Apr 2019 08:21:29 -0700 (PDT) MIME-Version: 1.0 From: Tony Przygienda Date: Thu, 25 Apr 2019 08:20:52 -0700 Message-ID: To: rift@ietf.org Content-Type: multipart/alternative; boundary="0000000000004b8b2805875c5f82" Archived-At: Subject: [Rift] today's dial in ... X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Apr 2019 15:21:55 -0000 --0000000000004b8b2805875c5f82 Content-Type: text/plain; charset="UTF-8" nothing on agenda, just status report and missing items to go last call * either security considerations document or we get AD sign-off to leave all the security in main spec as it is * applicability draft * yang model tasks assigned/being assigned by chairs further: * multicast no further progress * Bruno expected to start a thread on experience with security implications seen on implementation thanks -- tony --0000000000004b8b2805875c5f82 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
nothing on agenda, just status report and missing ite= ms to go last call

* either security considerations doc= ument or we get AD sign-off to leave all the security in main spec as it is=
* applicability draft=C2=A0
* yang model

tasks assigned/being assigned by chairs

=
further:

* multicast no furth= er progress
* Bruno expected to start a thread on experience with= security implications seen on implementation

thanks

-- tony
--0000000000004b8b2805875c5f82-- From nobody Thu Apr 25 20:15:56 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3C58512004B for ; Thu, 25 Apr 2019 20:15:54 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.998 X-Spam-Level: X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HG1DOZUZMK-P for ; Thu, 25 Apr 2019 20:15:51 -0700 (PDT) Received: from mail-ed1-x52c.google.com (mail-ed1-x52c.google.com [IPv6:2a00:1450:4864:20::52c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 53FE01200B4 for ; Thu, 25 Apr 2019 20:15:51 -0700 (PDT) Received: by mail-ed1-x52c.google.com with SMTP id u57so1915742edm.3 for ; Thu, 25 Apr 2019 20:15:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=LX7ajaYG6paIoaGHkTr9hz2fAaYRmReKJSa6EMdiHh4=; b=auw9vkRK1jgJL4GaZEGsodu43SiUk77pp+sjeNh67vElnAu8o0Gu3h874klqTu/bEl /seTyPjl7xV1c63/Wn9ShM5vN2IuhXlwnoE1Av9lpea4MOFBKz40OFf8ZckGYv8g4YDI rGr4+rb72VFsJHGLE5Tw+yjcUrgr13fqFhvWNsU38VICXVGKb6giqC1G9ET/K0yRvvSu KELE0S2JXB7OMX3Mjy8aGVUMSN+8eCeG6r+jCEfhEF5/iGPBSjdPyg9EWo0MWQFx/NXU YYVi+fgKFkvDE8wcuD/ce/snOLvyLDo5ogGmMMBJGHgbsaWlsxGEKj8HvSXr8t/bl/OW O16w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=LX7ajaYG6paIoaGHkTr9hz2fAaYRmReKJSa6EMdiHh4=; b=M0+OjAdAI/4AVfKFeTlob7zL3A6U5SdhxueM5n1pZbloNmOrwXvWn84zynvyzSXAJr DaCoCnJDvhv7eqJQkALLZ3cb6v8wV2V39haZvO8ES5o8LkDm7MHBHC2RkfzbvoBuXn8n cj+FG8g7nxfkZqj99PUaLi+o5BM7s94WMQYcaXcSp4jYxr0KWMtGIdKp0KjEYQU+xbt7 ErLP3CrdqCx9FQ9T/VobqRIBtjmM5SM1XAzFA1HNFQQCX1cRQAlyDmZrS3Q/6PCgU6qj t+4r/WYCWpEC2/+qJVQC6OmyvYaGQq6mjC97zf7uywO5VXHzsFoy1XDW1F/QmSD0ia1A YkmQ== X-Gm-Message-State: APjAAAVQnJIfmv/DWYBBxCmp2g4ljJur52ukmKG364Qyv5rghYDd7aMr sag5UT6+k+3kAZGa3gsknUv+FllP4f3PqylPCxc= X-Google-Smtp-Source: APXvYqxutDY/Cdm8FfXow8ZdRDDPsdAQdp+v3FlQnd6hL7vMCvwzlfv1Tm5DACzb7xELinp0RD8k/d7lr1gBV7qALcI= X-Received: by 2002:a50:ca0a:: with SMTP id d10mr26649181edi.140.1556248549891; Thu, 25 Apr 2019 20:15:49 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Tony Przygienda Date: Thu, 25 Apr 2019 20:15:12 -0700 Message-ID: To: Antoni Przygienda Cc: Bruno Rijsman , "rift@ietf.org" Content-Type: multipart/alternative; boundary="000000000000ffa1150587665908" Archived-At: Subject: Re: [Rift] Initial implementation of security in RIFT-Python is complete X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Apr 2019 03:15:54 -0000 --000000000000ffa1150587665908 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Ok, I looked over the stuff in more detail again and I see that the yaml schema is only a limited security model and would hence like to extend it a bit. Let me know what you think At the top I'd like to add private-secret to support private/public key and not only shared {?} keys: {+} - id: <24-bit key number> {1} algorithm: [hmac-sha-256] {1} secret: {?} private-secret: *under -name (i.e. per node) it would be good to have * {?} tie_validation: [none|permissive|loose|strict] to support testing of the common models of processing of signatures Then under interface we'd need {?} active_key: <8-bit key number> {?} accept_keys: {?} lie_validation: [none|permissive|loose|strict] (6) so we can test mix of interfaces using different keys and not using them ta all (we can share the global keys for that purpose since it's simpler but can only use the 8-bit IDs) --- tony On Wed, Apr 24, 2019 at 10:47 AM Antoni Przygienda wrote: > read your guide in detail. makes all perfect sense. will extend schema to > what you suggest so inter'op ... > > thanks > > --- tony > > ------------------------------ > *From:* Bruno Rijsman > *Sent:* Monday, April 22, 2019 4:16 PM > *To:* rift@ietf.org > *Subject:* Initial implementation of security in RIFT-Python is complete > > I have finished the initial implementation of security in > RIFT-Python (security envelope, keys, fingerprints, nonces, packet-nr, et= c. > etc.) > > See http://bit.ly/rift-python-security-feature-guide > for > a detailed feature guide. > > While implementing the code, I gathered a number of comments on the > security section of the draft -05. I will report these in a follow-up > e-mail. > > -- Bruno > _______________________________________________ > RIFT mailing list > RIFT@ietf.org > https://www.ietf.org/mailman/listinfo/rift > --000000000000ffa1150587665908 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Ok, I looked over the stuff in more detail again and = I see that the yaml schema is only a limited security model and would hence= like to extend it a bit. Let me know what you think

At the top I'd like to add private-secret to support private/public = key and not only shared

{?} keys:                              
{+} - id: <24-bit key number&g= t;
{= 1} algorithm: [hmac-sha-256]
{1} secret: <string>
{?} private-secret= : <string>

under= -name (i.e. per node) it would be good to have
{?}      tie_validation: [none|permissive|loose|strict]  


to support testing of the common models of p= rocessing of signatures

Then under interface = we'd need

{?} =
          active_key: <8-bit key number> 
{?} accept_keys: <set = of 8-bit key number>
{?} lie_validation: [none|permissive|loose|strict= ] (6)

so we can test mix of interfa= ces using different keys and not using them ta all (we can share the global= keys for that purpose since it's simpler but can only use the 8-bit ID= s)

--- tony

On Wed, Apr 24, 20= 19 at 10:47 AM Antoni Przygienda <prz=3D40juniper.net@dmarc.ietf.org> wrot= e:
read your guide in detail. makes all perfect sense. will extend schema to w= hat you suggest so inter'op ...

thanks

--- tony


From: Bruno Rijsman <brunorijsman@gmail.com> Sent: Monday, April 22, 2019 4:16 PM
To: rift@ietf.org=
Subject: Initial implementation of security in RIFT-Python is comple= te
=C2=A0
I have finished the=C2=A0initial implementation of security in RIFT-Py= thon=C2=A0(security envelope, keys, fingerprints, nonces, packet-nr, etc. e= tc.)

See=C2=A0http://bit.ly/r= ift-python-security-feature-guide=C2=A0for a detailed feature guide.

While implementing the code, I gathered a number of comments on the se= curity section of the draft -05. I will report these in a follow-up e-mail.=

-- Bruno
_______________________________________________
RIFT mailing list
RIFT@ietf.org
https://www.ietf.org/mailman/listinfo/rift
--000000000000ffa1150587665908-- From nobody Sun Apr 28 08:19:40 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 888D9120152 for ; Sun, 28 Apr 2019 08:19:38 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -0.999 X-Spam-Level: X-Spam-Status: No, score=-0.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, FREEMAIL_REPLY=1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=no autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XLRKldqqS5xi for ; Sun, 28 Apr 2019 08:19:36 -0700 (PDT) Received: from mail-ed1-x52c.google.com (mail-ed1-x52c.google.com [IPv6:2a00:1450:4864:20::52c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 03A5012014F for ; Sun, 28 Apr 2019 08:19:36 -0700 (PDT) Received: by mail-ed1-x52c.google.com with SMTP id a8so4993290edx.3 for ; Sun, 28 Apr 2019 08:19:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to:cc; bh=nLWpQHREd0IGGKFnT0DdPWiY2U0Mi5+14gqTtA4KLug=; b=DhDzrNYqPgdsvlcArflQil0ZzVvZ2meq9Ih11m2b30UO45kaDmtOmu3BrSlf5oxTDd +Fbq74Df6avFDw9RTKaPlJ8sVMwEjdBPIQJ5fPKDIB/6hF/Cm8n5uO6vYivkz9mM+W3N xB4g3eFr0fSsai4YH54K0MV0xeoUIqLhl+ji1yNbwM+S5LtrkKMqf2vsZ2RoxGcEDZVl hAK1k836zru9wWXt+S4oN/ykcuBDXTXigiVRKF03AXb2Q0OpkfvCY95eIlEMZTpTlexn 9uMEuAS/GVGmNKyq1RvgC0Q/7u3Daa41xey1rqdpCsz26BtiPNnltO6BL+p8nF8YJg0E fKIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=nLWpQHREd0IGGKFnT0DdPWiY2U0Mi5+14gqTtA4KLug=; b=jlbuSqH/M2wVXyJ03HZhemCnai4SGHhj5d4zJ5mF4k+eDSNPOixJLWK9VZXHqZN8VG l4XzFpYWOb81KjZtbAbMQgGu622iWVb7neZxzRoXFIDPmk/97Vc4E6jvtABeIK2W3uXA BBON2awGiakvTmystRiUA04+x64V5ncOsU4kjw/wEvCi7h+eoCsySPcgLHsq5+Poe/cW ywQAAyL8kBeVV/inLsEFPwFOSuc5b1f2rcNIooe55wmJBklrwzZQEoma7EYFb0QoDFaU D/vniJCYPBF9jDEtp6pAAN/zFZwdqlwd3GIY5DWnR6sTkfEnDxNf0356JCz9cmB8RkCx 19WQ== X-Gm-Message-State: APjAAAVX6sxETAp8QjVxAz8IcHuTOLBJxWcN4r5xQ/Wg+Fh6PjuUzH20 SHHd2e2gSyFFZlyeJuhORbSlwsvzUHmX2DdwRXI= X-Google-Smtp-Source: APXvYqz73Ku+n7N8xWh8toqv66i45GjxRrwZZDWVluLj+kER7G5brafBK78/bUyw+RxUtTm3rMjBMmKbp5x7FpdQpU8= X-Received: by 2002:a05:6402:1256:: with SMTP id l22mr9013207edw.22.1556464774458; Sun, 28 Apr 2019 08:19:34 -0700 (PDT) MIME-Version: 1.0 From: Tony Przygienda Date: Sun, 28 Apr 2019 08:18:58 -0700 Message-ID: To: xu.benchong@zte.com.cn Cc: rift@ietf.org Content-Type: multipart/alternative; boundary="000000000000fcb01d058798b12d" Archived-At: Subject: [Rift] cc: from private thread, flooding rules ... X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Apr 2019 15:19:38 -0000 --000000000000fcb01d058798b12d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Benchong, I'm copy'ing mailing list from the private thread we started through for posterity ;-) My answers marked with bold > And yes, very good catch on flooding in first try, you must have really thought stuff through carefully ;-) --- tony ------------------------------ *From:* xu.benchong@zte.com.cn *Sent:* Sunday, April 28, 2019 2:31 AM *To:* Jeffrey (Zhaohui) Zhang *Cc:* EXT-zhang.zheng@zte.com.cn; Antoni Przygienda; jefftant.ietf@gmail.co= m; EXT-zzhang_ietf@hotmail.com *Subject:* Re:RE: Re:I would drive fwd' adoption of drafts ... 1. Something about ZTP Now the rift protocol ZTP model can auto build rift networks without configuration(except IP address/IPv6 address/vpn, and ToF level/leaf-2-leaf as rift-05 5.2.7, and Pod of ToP/Mobility attr). > *in fact, v6 address can be obtained by ND, yes, ToF must be set (there is no way what it up or down in a fabric otherwise). Yes, PoD of top is the same discussion as ToF. I'm not aware we need any configuration for mobility attributes * Node or some links status changing may cause level change and then more nodes need rebuild the neighbors, which break the REQ10 and REQ12. NODE1-----NODE2 | | | | NODE3-----NODE4 | | NODE5 ...... | | NODE6 ...... For example: NODE1 AND NODE2 level =3D 10, then after ZTP, NODE3 level =3D = 9, NODE4 =3D 9, NODE5 =3D 8, NODE6 =3D 7. When link between NODE1 and NODE3 break, after ZTP, NODE3 level =3D 8, NOD= E5 =3D 7, NODE6 =3D 6. Link unstable case network unstable. In the above example, Link between NODE1 and NODE3 unstable will case NODE3/NODE5/NODE6 unstable. I think rift should support two stage, ZTP and STABLE. *> your picture fell apart in the formatting so I'm trying to reconstruct or you have to resend it in fixed width ASCII (or just join the weekly on Zoom if you can and we'll draw). I don't know how Node4 would be 9 in this stage, the horizontal link you draw would be really an uplink for node 4 and with that node 4 would be @ level 8. Also I see two links between NOde1 and node 3. If node 2 is level 10 and node 3 has link to node 2 then node 3 would not change level. * All node learned its level and PoD in ZTP stage. And they should not change level In STABLE stage when HAL change, or they can get level and PoD configuration by KV tie in any stage, *> Yes, it would be possible to configure PoDs by KV but I don't see that as being much different from configuring it on the device itself. The KV would need to be directed so the node knows it is being targeted. Level you cannot get from KV ;-) since you can't form 3-way until you have level (observe that LIEs carry the level offers and with that levels are negotiated to form 3-way). And you can't flood KV until you have an established topology. * 2=E3=80=81KV tie can also take IP address/IPv6 address/vpn from ToF to othe= rs, and the ToF get all networks configuration by Netconf or any other way. *> yes, all addressing could be configured via KV but that's lots info on a large fabric and again, I don't see a benefit compared to configuring the device itself. Generally, information that you want "gathered" @ top of the fabric like e.g. ARP/IP bindings or properties of nodes in fabric is helped by KV, stuff that is node specific like "this node needs this ID" is not very much since you flood part/all of fabric with it with no use & extend the blast radius. Also, information that needs propagating south through all or part fabric is helped by KV, e.g. key rollovers. * KV tie should at least include type key-length key-value value-length value *> well, no ;-) * 1. *schema itself contains the length already, both key and value are strings to be interpreted as seen fit* 2. *type is not important, it's a string and the implementation can parse it any way it wants. There is no value is having e.g. a union allowing for all possible thrift types IMO. * 3. *As I said since a bit, what we need is a well-known K/V document that reserves stuff lots of people want as fair I see like ARP/IP bindings an= d then something where a key can be used by a vendor using his OUI prefix, something like well-known "OUI:XXXX:..." * 3.There are some other questions C.3.2.2 6. if DBTIE.HEADER < HEADER then I) if originator is this node then bump_own_tie else i. if this is a N-TIE header from a northbound neighbor then override DBTIE in LSDB with HEADER ii. else put HEADER into REQKEYS C.3.3.2 a. for every HEADER in TIRE do 1. DBTIE =3D find HEADER in current LSDB 2. if DBTIE not found then do nothing 3. if DBTIE.HEADER < HEADER then put HEADER into REQKEYS Why not put DBTIE.HEADER but put HEADER into REQKEYS *> well, DBTIE.HEADER is _smaller_ than HEADER but we really want to get the header (in fact its content) from the neighbor so when we request we have to request the newest HEADER. But yes, albeit IMO rather confusing, putting DBTIE.HEADER into REQKEYS would also work since we would request from neighbor an obsolete header which he would answer with the newest TIE version. * C.3.4. b. 3. if DBTIE.HEADER > TIE.HEADER then i. if DBTIE has content already then TXTIE =3D TIE Why not TXTIE =3D DBTIE but TXTIE =3D TIE *> right on ;-) and accolades since you found an oversight in flooding rules write-down that good amount of people fine-combed already and are implemented at least twice (and you seem to be doing it at least in a third version ;-) The protocol would stabilize nevertheless on TIDEs as far I see but still, a very good catch. Yes, this must be DBTIE (in fact, would you mind to check Bruno's python code for this omission, please). I will correct & add you to accolades section ;-) * --000000000000fcb01d058798b12d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Benchong,

I'm copy'ing mailing list from the private thread we start= ed through for posterity ;-) My answers marked with bold >

And yes, very good catch on fl= ooding in first try, you must have really thought stuff through carefully ;= -)

--- tony


From: xu.benchong@zte.com.cn <xu.benchong@zte.com.cn>
Sent: Sunday, April 28, 2019 2:31 AM
To: Jeffrey (Zhaohui) Zhang
Cc: EXT-zhang.zheng@zt= e.com.cn; Antoni Przygienda; jefftant.ietf@gmail.com; EXT-zzhang_ietf@hotmail.com
Subject: Re:RE: Re:I would drive fwd' adoption of drafts ...
=C2=A0

1. Something about ZTP

Now the rift protocol ZTP model can auto build rift networks without=20 configuration(except IP address/IPv6 address/vpn, and ToF=20 level/leaf-2-leaf as rift-05 5.2.7, and Pod of ToP/Mobility attr).

=


> in fact, v6 address can be obtained by ND, yes, ToF must be set (there is=20 no way what it up or down in a fabric otherwise). Yes, PoD of top is the same discussion as ToF. I'm not aware we need any configuration for=20 mobility attributes


Node or some links status changing may cause level change and then=20 more nodes need rebuild the neighbors, which break the REQ10 and REQ12.


NODE1-----NODE2

= | =C2=A0 =C2=A0 =C2=A0 =C2=A0 |

= | =C2=A0 =C2=A0 =C2=A0 =C2=A0 |

= NODE3-----NODE4

= |

= |

= NODE5 ......

= |

= |

=C2=A0 =C2=A0 NODE6 ......

=

For example: NODE1 AND NODE2 level =3D 10, then after ZTP, NODE3 level = =3D 9, NODE4 =3D 9, NODE5 =3D 8, NODE6 =3D 7.

=C2=A0When link between NODE1 and NODE3 break, after ZTP, NODE3 level = =3D 8, NODE5 =3D 7, NODE6 =3D 6.


=C2=A0Link unstable case network unstable. In the above example, Link between=20 NODE1 and NODE3 unstable will case NODE3/NODE5/NODE6 unstable.

=

I think rift should support two stage, ZTP and STABLE.


> your picture fell apart in the formatting so I'm trying to reconstruct= =20 or you have to resend it in fixed width ASCII (or just join the weekly=20 on Zoom if you can and we'll draw). I don't know how Node4 would be= 9 in this stage, the horizontal link you draw would be really an uplink for=20 node 4 and with that node 4 would be @ level 8. Also I see two links=20 between NOde1 and node 3. If node 2 is level 10 and node 3 has link to=20 node 2 then node 3 would not change level.


All node learned its level and PoD in ZTP stage. And they should not=20 change level In STABLE stage when HAL change, or they can get level and=20 PoD configuration by KV tie in any stage,


>=20 Yes, it would be possible to configure PoDs by KV but I don't see that= =20 as being much different from configuring it on the device itself. The KV would need to be directed so the node knows it is being targeted. Level you cannot get from KV ;-) since you can't form 3-way until you have= =20 level (observe that LIEs carry the level offers and with that levels are negotiated to form 3-way). And you can't flood KV until you have an=20 established topology.


2=E3=80=81KV tie can also take IP address/IPv6 address/vpn from ToF to= =20 others, and the ToF get all networks configuration by Netconf or any=20 other way.


> yes, all addressing could be configured via KV but that's lots info on a large fabric and again, I don't see a benefi= t=20 compared to configuring the device itself. Generally, information that=20 you want "gathered" @ top of the fabric like e.g. ARP/IP bindings= or=20 properties of nodes in fabric is helped by KV, stuff that is node=20 specific like "this node needs this ID" is not very much since yo= u flood part/all of fabric with it with no use & extend the blast radius.=20 Also, information that needs propagating south through all or part=20 fabric is helped by KV, e.g. key rollovers.


KV tie should at least include=C2=A0

type

key-length

key-value

value-length

value


> well, no ;-)


  1. schema itself contains the length already, both key and value = are strings to be interpreted as seen fit
  2. type is not important, it's a string and the implementation can parse it an= y way it wants. There is no value is having e.g. a union allowing for all possible thrift types IMO.=C2=A0
  3. As I said since a bit, wha= t=20 we need is a well-known K/V document that reserves stuff lots of people=20 want as fair I see like ARP/IP bindings and then something where a key=20 can be used by a vendor using his OUI prefix, something like well-known=20 "OUI:XXXX:..."


3.There are some other questions


C.3.2.2 =C2=A0 =C2=A0 =C2=A0 =C2=A0

6. =C2=A0if DBTIE.HEADER < HEADER then

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0I) =C2=A0 =C2=A0if originator i= s this node then bump_own_tie else

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0i. =C2=A0 = =C2=A0 if this is a N-TIE header from a northbound

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 neighbor then override DBTIE in LSDB with HEADER

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ii. =C2=A0= =C2=A0else put HEADER into REQKEYS


C.3.3.2

=C2=A0 =C2=A0a. =C2=A0for every HEADER in TIRE do

=C2=A0 =C2=A0 =C2=A0 =C2=A01.=C2=A0 DBTIE =3D find HEADER in current LSD= B

=C2=A0 =C2=A0 =C2=A0 =C2=A02. =C2=A0if DBTIE not found then do nothing

=C2=A0 =C2=A0 =C2=A0 =C2=A03. =C2=A0if DBTIE.HEADER < HEADER then put= HEADER into REQKEYS


Why not put DBTIE.HEADER but put HEADER into REQKEYS


> well, DBTIE.HEADER is _smaller_ than HEADER but we really want to get=20 the header (in fact its content) from the neighbor so when we request we have to request the newest HEADER.=C2=A0 But yes, albeit IMO rather=20 confusing, putting DBTIE.HEADER into REQKEYS would also work since we=20 would request from neighbor an obsolete header which he would answer=20 with the newest TIE version.


C.3.4. =C2=A0b.

=C2=A0 =C2=A0 =C2=A0 3. =C2=A0if DBTIE.HEADER > TIE.HEADER then


=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0i. =C2=A0 =C2=A0 if DBTIE has c= ontent already then TXTIE =3D TIE


Why not TXTIE =3D DBTIE but TXTIE =3D TIE


> right= =20 on ;-) and accolades since you found an oversight in flooding rules=20 write-down that good amount of people fine-combed already and are implemented at least twice (and you seem to be doing it at least in a third= version ;-) The protocol would stabilize nevertheless on TIDEs as far I see but still, a very good catch. Yes, this must be=20 DBTIE (in fact, would you mind to check Bruno's python code for this=20 omission, please). I will correct & add you to accolades section ;-)


--000000000000fcb01d058798b12d-- From nobody Tue Apr 30 07:06:52 2019 Return-Path: X-Original-To: rift@ietfa.amsl.com Delivered-To: rift@ietfa.amsl.com Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 29CCC1200D6 for ; Tue, 30 Apr 2019 07:06:50 -0700 (PDT) X-Virus-Scanned: amavisd-new at amsl.com X-Spam-Flag: NO X-Spam-Score: -1.998 X-Spam-Level: X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id oU0kGwhm6NxV for ; Tue, 30 Apr 2019 07:06:44 -0700 (PDT) Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com [IPv6:2a00:1450:4864:20::52e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 192DF1200B6 for ; Tue, 30 Apr 2019 07:06:44 -0700 (PDT) Received: by mail-ed1-x52e.google.com with SMTP id e56so6127037ede.7 for ; Tue, 30 Apr 2019 07:06:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=k5CVZjv6wx/C7B7qQucqTUe30Eljw65mn2OHDraonSY=; b=gQiq77uCCooWYE5pwO1O7Wka/Mk6RCz40IKcjP96LGRFEUNYCJ1pLkeJt3vIvF15cx SkyWYmqCklfAbSfTbQoLlFCCFvD5qPzZQ2GyqG1zqNt1JmbXJ+Uhfk6CNH6iqoE6TIU9 o5FtPM0564WHILjjXF7e7+nnUUgScnQKxfkSgG5+xknGeBXSX7eTd2XB1l3DlABuJMyr v/GnTrXRbYXCEH9m/lj7nfTy8+yr726FuUJxQaCVpsNKlCePe/TSR53re0CZdc1oI7f+ FEknx5Iym7BwvaqvZUsNvYUmA12A/xMuQtCpks8tIhmk8DbjMgTn0exKpptPa5YdY8+P D8jg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=k5CVZjv6wx/C7B7qQucqTUe30Eljw65mn2OHDraonSY=; b=kUybbp/ZcgBsVAmQFr2PigGKfacb/vwu2ngX8cXcq76mrpBYRWM6TCSRFBsSER7BTQ SutPcQM6dez4lsG0NcLua+zx30bJ72ynoXnEShL+anz5HDIcujSe24VLS3B/W0kMS9JG MAMHELUwGLC+pjCX3U6kbab5oR89lm9sIJUDb1Co3RsZYTPoge0pZJsUhKMMmVrLaimX cP03o5atYtLZNw167V3b51Cti13KEgEl+LFedb0MtD/qd+0aTwbFkMkHdccrZeJR+djh XV0qwB1u7ci9clmrDiFUnnidRhUU2gCdd3F44RQ+oqCkJvj7BXsuj4zIamX3m3VKrhV/ KzYQ== X-Gm-Message-State: APjAAAWicUzlV5rDsqSx3teMtlypQ8XtjRIrUPm1U6rpy9s2vhwoPiSd AaJQwkwnJxbFYhr2hVtaKckEmm5LewqA7+QQbmU= X-Google-Smtp-Source: APXvYqza0EE0dilyY2lNDngbkLlrw9AeXWZhRUjOQCK9Dd7njFvXFXeWneWtXITsBzghn6lhLke5LjAe9kRkQnkDJtI= X-Received: by 2002:a50:ac02:: with SMTP id v2mr43245477edc.86.1556633202614; Tue, 30 Apr 2019 07:06:42 -0700 (PDT) MIME-Version: 1.0 References: <201904301702426933703@zte.com.cn> In-Reply-To: <201904301702426933703@zte.com.cn> From: Tony Przygienda Date: Tue, 30 Apr 2019 07:06:06 -0700 Message-ID: To: xu.benchong@zte.com.cn Cc: rift@ietf.org, "zhang.zheng" Content-Type: multipart/alternative; boundary="0000000000001667aa0587bfe94d" Archived-At: Subject: Re: [Rift] cc: from private thread, flooding rules ... X-BeenThere: rift@ietf.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion of Routing in Fat Trees List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Apr 2019 14:06:50 -0000 --0000000000001667aa0587bfe94d Content-Type: text/plain; charset="UTF-8" On Tue, Apr 30, 2019 at 2:02 AM wrote: > > Tony, > > Thanks for your reply. > Xu, my pleasure ;-) > > 1. It's not a standard Fat Tree, NODE3 has only one north neighbor NODE1, > and NODE4 has a north neighbor NODE2. So when link between NODE1 and NODE3 > be down, HAL of NODE3 changes from 10 to 9. > ok, what's the link nod3-node4 and node1-node2? horizontal links? if node3 is down, it cannot have a HAL so I try to still interpret it. Basically, if you are thinking about Node1 _OR_ Node2 going away then the other node will always seed. let's assume node1 died, then you'll see Node2 | Node4 | Node3 hierarchy being built which is intended. if e.g. node3 goes down, your network is partitioned. there will be two networks node1-node2 | node4 rest will not be able to obtain a seed (basically without a ToF node you can't build up a fabric using ZTP since you simply don't know what is up and what is down) > > NODE1-----NODE2 > > | | > > | | > > NODE3-----NODE4 > > | > > | > > NODE5 ...... > > | > > | > > NODE6 ...... > > > ZTP is very good for network initialization. If ZTP is working all the > way, it may be used limited for security reason. Any new higher level node > connect in network or last hightest level neighbor get down may cause > network rebuild. It will easy to be attacked, so I think it's better to > work in a node start stage. > I think the security section explains very well the trade-off between security needs and ZTP possible ... This is nothing special for RIFT, it's universal. > > 2. Aboute mobility attributes, the leaf is dispatch address route with 32 > mask and connect route with shorter mask, wether is it support mobility > prefix can be controlled. > I can't parse that really. Does RIFT support overlapping prefixes? Yes, sure. if the /32 moves then it will be a more specific match. The problem is obviously here if traffic forwarded out a PoD hits a match southbound on aggregate while the more specific moved into another PoD. Then it will greedily blackhole. Again, nothing specific for RIFT really, aggregates blackhnole if they attract traffic they cannot route. Multiple solutions exist. Such prefixes could be leaked south e.g. but question is really, WHY would you need aggregates since RIFT automatically aggregates/de-aggregates for you in sufficient fashion. An example would be good. > > 3. > Yes, it would be possible to configure PoDs by KV but I don't see > that as being much different from configuring it on the device itself. The > KV would need to be directed so the node knows it is being targeted. Level > you cannot get from KV ;-) since you can't form 3-way until you have > level (observe that LIEs carry the level offers and with that levels are > negotiated to form 3-way). And you can't flood KV until you have an > established topology. > > > If we support get configuration by KV tie, the network can be easy control > by ToF node manual or automatic. ToF(s) can learn network topology by N-TIE > and send node level and pods by KV tie to them after ZTP stage to confirm > the network stable. Or the network can be centrol controlled by TOF, The > controler don't need rember the ip addr of every node. > Well, you CAN'T send level from ToF since you won't have 3-way adjacencies and with that flooding ;-) You need ZTP working first which will establish level and then you _could_ send PoD to specific nodes via KV, that's true but it's really _per_ node while KV is pruned flooding so it's bit tricky since you need to flood to the "level of the PoD" and then prune. But again, picutre/example would be good. > > 4. About the KV struct, I am not fit the thrift very well, and I'll learn > about it:-) > sure, thrift is widely deployed, stable, elegant, powerfull and easy so it's worth learning ;-) It vacuums the floors as well but that's a hidden feature :-P > > > 5. I have checked Bruno's python code, there is no this problem > > process_rx_tie_packet_info > > elif comparison > 0: > > # We have a newer version of the TIE, send it > > start_sending_tie_header = db_tie_packet.header > > thanks for that. Like I said, good amount of people were in the weekly meetings & we chewed the flooding rules over and over again ;-) while they have been implemented to make sure we don't miss stuff. So we probably talked through that in detail, agreed and then it was me who miswrote then in the throes of editorship'ing ;-) on one of the iterations since both implementations did the correct thing while the spec was wrong as you pointed out. Again, great catch ;-) --- tony --0000000000001667aa0587bfe94d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Tue, Apr 30, 2019 at 2:02 AM <<= a href=3D"mailto:xu.benchong@zte.com.cn">xu.benchong@zte.com.cn> wro= te:


Tony,

Thanks for your reply.


Xu, my pleasure ;-)
=C2=A0
=


1. It's not a standard Fat T= ree, =C2=A0NODE3 has only one north neighbor NODE1, and NODE4 has a north n= eighbor NODE2. So when link between NODE1 and NODE3 be down, HAL of NODE3 c= hanges from 10 to 9.


ok, what'= ;s the link nod3-node4 and node1-node2? horizontal links? if node3 is down,= it cannot have a HAL so I try to still interpret it. Basically, if you are= thinking about Node1 _OR_ Node2 going away then the other node will always= seed. let's assume node1 died, then you'll see

Node2
|
Node4
|
Node3

hierarchy being built which is intended. if e.g. node3 go= es down, your network is partitioned. there will be two networks
=

node1-node2
=C2=A0|
node4

rest will not be able to obtain a seed (basically with= out a ToF node you can't build up a fabric using ZTP since you simply d= on't know what is up and what is down)
=C2=A0


NODE1-----NODE2

| =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |

| =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |

NODE3-----NODE4

|

|<= /p>

NODE5 ......

|

|

NODE6 ......


ZTP = is very good for network initialization. If ZTP is working all the way, it = may be used limited for security reason. Any new higher level node connect = in network or last hightest level neighbor get down may cause network rebui= ld. It will easy to be attacked, so I think it's better to work in a no= de start stage.


I think the secur= ity section explains very well the trade-off between security needs and ZTP= possible ... This is nothing special for RIFT, it's universal.
=C2=A0


2. Aboute mo= bility attributes, the leaf is dispatch address route with 32 mask and conn= ect route with shorter mask, wether is it support mobility prefix can be co= ntrolled.


I can't parse that = really. Does RIFT support overlapping prefixes? Yes, sure. if the /32 moves= then it will be a more specific match. The problem is obviously here if tr= affic forwarded out a PoD hits a match southbound on aggregate while the mo= re specific moved into another PoD. Then it will greedily blackhole. Again,= nothing specific for RIFT really, aggregates blackhnole if they attract tr= affic they cannot route. Multiple solutions exist. Such prefixes could be l= eaked south e.g. but question is really, WHY would you need aggregates sinc= e RIFT automatically aggregates/de-aggregates for you in sufficient fashion= . An example would be good.
=C2=A0


3. > =C2=A0Yes, it would be possible to configure= PoDs by KV but I don't see that =C2=A0as being much different from con= figuring it on the device itself. The KV =C2=A0would need to be directed so= the node knows it is being targeted. Level =C2=A0you cannot get from KV ;-= ) since you can't form 3-way until you have =C2=A0level (observe that L= IEs carry the level offers and with that levels are =C2=A0negotiated to for= m 3-way). And you can't flood KV until you have an =C2=A0established to= pology.=C2=A0


If we support get configuration by KV tie, t= he network can be easy control by ToF node manual or automatic. ToF(s) can = learn network topology by N-TIE and send node level and pods by KV tie to t= hem after ZTP stage to confirm the network stable. Or the network can be ce= ntrol controlled by TOF, The controler don't need rember the ip addr of= every node.


Well, you CAN'T = send level from ToF since you won't have 3-way adjacencies and with tha= t flooding ;-) You need ZTP working first which will establish level and th= en you _could_ send PoD to specific nodes via KV, that's true but it= 9;s really _per_ node while KV is pruned flooding so it's bit tricky si= nce you need to flood to the "level of the PoD" and then prune. B= ut again, picutre/example would be good.
=C2=A0


4. About the KV struct, =C2=A0I am not = fit the thrift very well, and I'll learn about it:-)

sure, thrift is widely deployed, stable, elegant, powerfull and ea= sy so it's worth learning ;-)=C2=A0 It vacuums the floors as well but t= hat's a hidden feature :-P
=C2=A0

=C2=A0

5. I have checked Bruno's python code, ther= e is no this problem

process_rx_tie_packet_info

=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 elif comparison > 0:

=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # We have a newer version of the TIE= , send it

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sta= rt_sending_tie_header =3D db_tie_packet.header



thanks f= or that. Like I said, good amount of people were in the weekly meetings &am= p; we chewed the flooding rules over and over again ;-) while they have bee= n implemented to make sure we don't miss stuff. So we probably talked t= hrough that in detail, agreed and then it was me who miswrote then in the t= hroes of editorship'ing ;-) on one of the iterations since both impleme= ntations did the correct thing while the spec was wrong as you pointed out.= Again, great catch ;-)

--- tony
=


--0000000000001667aa0587bfe94d--