Secure Inter-Domain Routing WG (sidr) Minutes
IETF 87 - Berlin, DE

CHAIR(s): Sandra Murphy Sandra.Murphy at Sparta.com
          Chris Morrow morrowc at ops-netman.net
          Alexey Melnikov alexey.melnikov at isode.com

================================================================================

WEDNESDAY, July 31, 2013
1510-1610 CEST         Wednesday Afternoon Session II Potsdam 1

1)  Administrivia & Draft status                                       1510-1520

    Presenter: Chairs                                                           

   - Mailing list: http://www.ietf.org/mail-archive/web/sidr/index.html
   - WG Resources: http://tools.ietf.org/wg/sidr/ 
   - Minute taker: Andrew Chi
   - Jabber Scribe: Dan York, Warren Kumari
   - Blue Sheets
   - Agenda Bashing

2)  Existing WG Drafts                                                 
    - AS migration status adopted, comments incorporated
    - Paused (blocked) on reference
    - BGPSEC threats submitted to IESG, open season on requiremenents
    - ltamgmt will be discussed today, after a long silence
    - grandparenting draft now dead
    - A number of drafts are expired: rpsl-sig may need another author.

a)  Revisiting the RPKI LTAM                                           1520-1540
    Local Trust Anchor Management for the Resource Public Key         
    Infrastructure                                                       
    draft-ietf-sidr-ltamgmt-08.txt
    http://tools.ietf.org/html/draft-ietf-sidr-ltamgmt-08 

    Presenter: Steve Kent  

This is essentially LTAM version 2.  Additional motivation for this
version: (1) a nation wishes to inform the rest of the world that an
error above them in the RPKI hierarchy has occurred.  (2) Richard
Barnes caught a problem with LTAM version 1: covering ROA from an
upstream ISP may be disrupted by the perforation algorithm in version
1.  Note on slide 12: Solution details aren't finalized.  Exact
contents of INRRD TBD.

Roque Gagliano (Cisco): Are you planning to stop the current Local TA work?
Steve: The plan is to abandon the original approach and replace with version 2.
Rob Austein (DRL): [1] I wish to see the draft. [2] Let's not identify resource holders by SKI. [3] What will RPs do when a discrepancy is detected?
Steve: We won't dicate RP actions in discrepancy situations.
Ruediger Volk (DT): Glad to see this work addressing anomalies in the cert data space.  Does it make sense to split the monitoring (analyzing state changes of the system) vs the reporting vs the restoration (old LTAM).  Not sufficient detail.
Steve:  Version 1 was local and did not address how RP *gets* the information that allows detecting anomalies against INR holder claims.  Response is always up to the RP.
Ruediger: We are in agreement.  Need to make sure there are no false positives.
Sam Weiler: I would like to see document on the list first. Request SIDR chairs to disallow presentations without pre-uploaded documents.
Rob: Use of SKI in version 1 is undesirable because it changes every time someone rekeys.
Steve: Fair.
Randy Bush (IIJ): Is anything the same between versions 1 and 2?
Steve: 2 requirements are preserved, a 3rd requirement added.  But the mechanism has changed.
Randy: This is another publication mechanism, and could be a wild path.  We need to see a document.  Also, Rob is already implementing version 1, with a different syntax.  It is not reasonable to change this unilaterally.
Steve: That's correct, I don't get to move it unilaterally.  We'll see what the WG thinks.
Arturo Servin (LACNIC): I like this, and you are solving a real problem.  Waiting for document to see how monitoring will work.
Carlos Martinez-Cagnazzo (LACNIC): I like the approach much better than version 1.  Let's see a document.
Doug Montgomery (NIST): Is the LOCK record under a local trust anchor?
Steve: No, it's signed by the same issuer as the normal objects, ROAs, etc.
Doug: Why can't the (outsourced) issuer change the LOCK URL?
Steve: This is addressed in the (TBD) document: the two-phase attack where the LOCK URL is first changed, and then later <bad stuff>.  RP can detect a LOCK URL change and put a hold on it to add inertia.
Doug: How does a RP know that something bad actually happened and this was not an accidental mistake?
Steve: Public mechanism (slashdot, etc.) will tell you if something was really malicious.  The LTAM mechanism just buys you time by falling back to previous good state.
Eric Osterweil (Verisign): I have problems with this approach, and would like to see a document.

b)  Multiple Repository Publication Points support in the Resource     1540-1555
    Public Key Infrastructure (RPKI)
    draft-ietf-sidr-multiple-publication-points-00
    http://tools.ietf.org/html/draft-ietf-sidr-multiple-publication-points 

    Presenter: Roque Gagliano 
    
Current validators have problems with multiple publication points in the TAL format.  Should we obsolete RFC 6490?

Rob Austein: Speaking as rcynic author, it's surprising that rcynic accepts the TAL format, since I didn't intentionally implement it.
Reudiger: Is there an analysis of what happens when there are multiple pub points?  What happens if there are inconsistencies?
Roque:  Today, you can load balance via DNS, which means there already might be inconsistencies.
Reudiger:  Sure.  But this approach makes the question more urgent.
Roque: This is relatively simple when there's only one trust anchor.
Tim Bruijnzeels (RIPE NCC): In principle, I am very interesed in multiple pub points, but also see complications with it.  I do feel that multple TAs is good, but want wording that instructs CAs to keep the TAs up to date, as well as advice to RPs on which ones to use, etc.  This could get quite complicated if we go to multiple pub points in certificates.  More on Friday (my presentation).
Andy Newton (ARIN): We did have a case where we suspect a validator was getting unexpected results due to DNS round robin underneath.
Randy:  The issue is not the TA cert, it's the rest of the tree underneath.  If you're dealing with redundant data, you must have a plan.  What happens when they are not the same?  Do you want RPs to fetch all and compare?  Do you prescribe what to do: e.g., fetch first one and then only if it fails?
Carlos: Same problem happens w/ DNS round robin.
Randy:  Yes, but you're the one proposing a modification to RFC 6490.
Arturo: We don't have the data.
Reudiger: It's important to do analysis rather than trials.
Rob: The only way I could see this working with rcynic already is if the OpenSSL code just ignores non-base64 data when it doesn't parse.

================================================================================

FRIDAY, August 2, 2013
1120-1220 CEST         Friday Afternoon Session I Charlottenburg 2/3

3)  Administrivia                                                       1120-1125
    Presenter: chairs

2c)  Manifest EE Certificate Validity Times                            1125-1140
    RFC6486 & EE certificates
    http://tools.ietf.org/html/rfc6486
    Presenter: Tim Bruijnzeels

Proposal to change RFC 6486: EE cert dates must encompass manifest thisUpdate/nextUpdate.

Steve Kent (BBN):  I agree with all the suggestions.  This is not an erratum, this is an update/obsolete.  By the way, I'd like to write a document containing guidelines for CAs and RPs, with related suggestions.  Would you like to collaborate?
Tim:  I will think about it.  It would be good for such a draft to be informational (non-normative).
Rob: I think this is a good idea.

4)  Deployment
a)  Some available RPKI tools                                          1140-1155
    Presenter: Carlos M. Martinez (LACNIC) and Benno Overeinder (NLnet Labs)

The workshop on Saturday before IETF identified some gaps in the RPKI toolset.  Here are some.
Tool 1: "Looking Glass"
Tool 2: "RPKI dashboard" - URL for demo will be forwarded to sidr by Benno
Tool 3: "The ROA Wizard" - suggest ROAs based on current routing table
http://rpki.surfnet.nl/

Andrei Robachevsky (ISOC): Would it make sense for people to be able to upload their routing info and create the whole ROA structure from that?
Carlos: We've been thinking about that, but it's complicated.
Taiji (JPNIC): We have a testbed server for ISPs in Japan to try issuing ROAs, there are a huge number of ROAs.  Please be careful about that.
Carlos: Optimizing ROAs (packing them as much as possible) is an interesting problem.

b)  Per-RIR Statistics and A Very Real Problem                         1155-1200
    Presenter: Randy Bush

Randy: Motivation is: "Could you please tell me the ROA set necessary to rescue the customers I'm about to step on?" (gave examples of clobbering in the slides)
Report from the audience: Orange *had* these problems, but they are now fixed.
Randy: Terminology side note: Let's be careful about saying "invalid ROAs".  *BGP announcements* become "valid/invalid" based on the ROAs.  ROAs are validated through the cert hierarchy.
Randy: It would be useful to have a tool that proposes a ROA set that says "here is a ROA set that would save your customers".  But purists like Sandy would say, "What's the point of all of this? We're importing all the junk from BGP."
Sandy: I remember a previous conversation when our roles were reversed!
Randy: Next slide: ARIN is sabotaging the RPKI.  On the other hand, congrats to LACNIC and RIPE for efforts on RPKI, as shown in the numbers.  In a few years, RPKI has 3x more than IPv6 :-)
Carlos: We as a group need to be better about separating RPKI validated database vs origin validation in routers.
Sandy: Is there is any recommendation based on this data of what the WG must work on?
Arturo: Reliability.  (e.g., people ask "5 servers hosting all the data??")
Carlos: We as a group need to separate the concept of RPKI as a database and origin validation as a function of the routers.
Steve: Question for Randy: When you said that the 2 ISPs should be issuing ROAs on behalf of some ISPs: are those ISPs homed to anyone else upstream?
Randy: Doesn't matter.
Steve: These kinds of experiments and data are useful.  We need more of a users manual: here are the things you should be doing, here are tools that are available, watch out for stomping on people, etc.
Randy: Would you like me to quote the origin ops doc?
Benno: +1 to Steve's comment, we need more guidelines.
Mikael Meulle: This is good, but we need more timing information on all of this data for debugging.
Chris: To clarify, you want the same data, but with timestamp indicating last refresh time.
Matthias: Based on our statistics, ISPs are doing much better than they were one year ago.
Taiji: It would be great to have measurements on how much errors/hijacking in BGP are reduced by the use of RPKI.
Randy: How many of you doing these measurements are looking at it across time (about 4 hands)?  We've done this for years.
Sriram Kotikalapudi (NIST): How does your tool recommend max-length on ROAs?
Carlos: The tool tries to group prefixes together as much as possible.
Sriram: Please also show the expiration time in the ROA.
Carlos: That is not the role of the tool, but you can click twice to get to it.

5)  Revisited Topics

a)  RSYNC Performance Study                                            1200-1220
    Presenter: David Mandelberg

Comment in voiceover: The case of 100% change in an RPKI repository is not designed to be realistic; it's just a sanity check.

Kaveh Ranjbar (RIPE NCC): Thanks for this work.  Do you know what the high CPU was from?  Did you measure I/O or other?
Dave: We measured I/O, but didn't test adding more CPU cores (Amazon EC2 node limitation).
Tim (RIPE NCC): Good stuff, our work wasn't looking for real scaling numbers.  It's good to know that we can use rsync.  I'm still interested in looking for alternatives.
Dave: We can give you our code/scripts.
Rob: How many rsync connections per client synchronizing?
Dave: 1
Rob: Flat or hierarchical repo structure?
Dave: 10 files each in 43,000 directories.  <some discussion>  All were downloaded in the same rsync connection.
Steve: That's right.  This was a worst case.  Reasoning for this experiment was to ask, "How urgent is it to adopt another mechanism?"  This tells us that rsync won't be a limiting factor for several years.
Andy Newton (ARIN): Was there any attempt to see if RPs need to re-fetch something?
Dave: No.
Tim: Responding to Steve, I agree we should do our homework on alternatives to rsync.  Our current fear is not normal repository access, I'm more worried about intentional DoS.
Steve: Absolutely.  You can distribute over multiple servers, etc., there are others who do this for a living.  But I don't disagree.  In response to Andy: our RP software does revisit upon errors, and runs N connections in parallel, e.g., to the 5 RIRs.
Andy: This is about server load.  If the RP is making an additional connection, it's more server load.  It would be nice to simulate this.
Steve: However, RPs will behave differently w.r.t. retries.  YMMV.

1230-1330 CEST         Friday Afternoon Session II Charlottenburg 2/3

0)  Administrivia                                                       1230-1230
    Presenter: chairs

5b)  RPKI Repository Documents                                         1230-1245
     Presenter: Tim Bruijnzeels

#1: We'd like to dissociate RPKI validation from fetching mechanism.  This relies on a strict interpretation of manifests (using it as a directory listing).
#2: Delta protocol.  Comments welcome.

Steve: Manifest-driven fetching is reasonable.  Another way, like in traditional PKIs, is to use the nextUpdate field.  Are you trying to focus exclusively on manifest?
Tim: I'm looking at a different notification mechanism.  I'm sure nextUpdate is very useful.  Currently, we update (server-side) every 24 hours.  But there may be good reasons for faster propagation times (fixing a bad ROA), but that also needs to be balanced against load.  So I'm not sure that nextUpdate is the best way to do it.
Steve: In the experience with other PKIs, I've seen the exact same thing arise in the case of CRLs.  People thought they needed to quickly push out revocations.  But that need is not as high as one might think, and mistakes have been made in other places where the load/burden is increased on everyone, even when the administrative backend can't act that quickly.
Rob: We're experimenting with manifest nextUpdate.  Has some entertaining properties.  Mostly, one really pays for it if one gets the nextUpdate wrong; have to whack manually to recover in reasonable time.
Reudiger: We need to figure out who is responsible for setting what update intervals.
Tim: Yes, agreed.  We'd like to do the pilot work first.

Break for Benno's demo: http://rpki.surfnet.nl/
Comment missed <look up later>

6)  New Topics

a)  RPKI Validation Reconsidered                                       1245-1305
    draft-huston-rpki-validation-00.txt
    http://tools.ietf.org/html/draft-huston-rpki-validation 
    Presenter: Geoff Huston 

Steve Kent (BBN): (1) You could can revoke certificates at any point. You didn't say this was a transfer of live resources. (2) Bigger issue RFC 3779 is still relevant.
Geoff: It would probably be wiser if you reissued before you revoked, in which case you run into these problems.
Steve: We disagree on that, but 
Reudiger: We haven't really discussed how resource transfers work.  Currently only a few can do it.  We should document how transfers are handled, and see if it's really painful.
Geoff: I'm not saying "we should do this"...
Andy Newton (ARIN): I think we ought to pursue this.  Current validation makes transfers very tricky.  This would simplify operations significantly.
Rob Austein: Ie OpenSSL RFC 3779 code has shipped, is not going to be updated quickly no matter what WG says.
Geoff: I appreciate that things might not happen quickly, but when would we think about it?
Steve Kent: I agree with the comment on OpenSSL, but this would also mean RFC 3779 changes.
Geoff: See my previous response.
Randy: Is transfer the only case? Perhaps we should just focus on the transfer issue.
Geoff: It's not just transfer, it's more general (some LTAM cases).

b)  BGP operations and security                                        1305-1315
    draft-ietf-opsec-bgp-security-01.txt
    http://tools.ietf.org/html/draft-ietf-opsec-bgp-security 
    Presenter: Gunter Van de Velde

Gunter (WG chair) presenting for three authors who could not be at IETF.  Requesting feedback on this document on the OPSEC mailing list.

Sandy: What do you mean by communities scrubbing?
Gunter: <not sure of details>
Chris: Probably "strip communities before you ship them outside".
Mikael Meulle: Monitoring is important; not a question of "if" but "how".

c)  BGPSEC Error Handling                                              1315-1330
    Presenter: Sandra Murphy

Sandy presenting the sidr WG message in slide form.  Comments requested here.

Randy: Are there errors here that do not fall under the current IDR-proposed error-handling mechanisms?
Sandy: 4 possible responses: (1) keep RFC 4271, (2) (3) (4) drop update and treat as withdraw
Jeff Haas: <pontification>
Slide 7.  Dropping the session due to malformed input is an attack on BGP.  Let's not perpetrate that in BGPSEC.
Signature verification: one possibility here is that the router doesn't currently have enough information to do this verification.  You should probably treat this as an unusable route and not pass it on.
</pontification>
Sandy: 
Matt Lepinski (BBN): I thought we'd been through this before.  From a BGP *protocol* point of view, this is a correct BGP update, but "Invalid" from a BGPSEC perspective.  So it becomes local policy.
Jeff: Yes
Matt: Want someone to help me with error-handling on IDR side.  (John Scudder volunteers)
???
Randy: Time sequencing of normative reference
John: I intend to move it forward rapidly


Sandy: General comment.  The mailing list is often silent, but then lots of activity happens when the topic is brought up at IETF or at last call.  Request review for certain drafts:
- anything that's 00
- AS migration http://tools.ietf.org/html/draft-ietf-sidr-as-migration-00 recently adopted and the author specifically requested more review on list
- multiple pub points http://tools.ietf.org/html/draft-ietf-sidr-multiple-publication-points-00. no comments since adoption, authors have asked mailing list to comment
- key rollover  http://tools.ietf.org/html/draft-ietf-sidr-bgpsec-rollover-02.  latest version is just a refresh, has received little/no comment on the list since -01
- keying of routers  http://tools.ietf.org/html/draft-ietf-sidr-rtr-keying-01. - latest version has received little/no comment.  more comment needed.

Taiji: RPKI operations has two operations - RP and CA.  Who is looking at key management for the CAs?
Steve: What specific aspect are you asking about?
Taiji: How do we secure the key and dispose of it?
Steve: Read the CP, but if you have questions, email me.