Transport Area Open Meeting IETF-80, Prague, Czech Republic =============================== proceedings: http://www.ietf.org/proceedings/80/tsvarea.html slides: audio: http://ietf80streaming.dnsalias.net/ietf80/ ietf80-ch4-wed-am-mp3 Chairs: David Harrington Lars Eggert [incoming Wes Eddy ] (thanks to Matt Zekauskas for note-taking!) - Agenda Bashing - Note Well - State of the Area - AD Hand-Over Administrivia Transport document progress WG Activity CDNI - could be in either transport or app area, see BoF on Thursday TSV picked up ALTO and DECADE (related to storage); trying to give storage and peer-to-peer a home in TSV Concluded WG: NSIS Winding down WGs: DCCP, PCN, LEDBAT, FECFRAME Wes should be picking up groups Lars was responsible for; may shift as WGs close down and make adjustments ----------------------------- Welcome to TSV, ALTO & DECADE presented by Vijay Gurbani - co-chair of ALTO grew out of P2P workshop, May 2008 at MIT give apps (eg bittorrent) info about the network to selectively pick peers 2 WGs originally created: ALTO (apps) & LEDBAT (tsv) ALTO looks at "static" costs, not very dynamic info produced one RFC - problem statement protocol itself is in -06 version now Monday had an ALTO demo ... looking for interop in Quebec City sizeable implementation experience already, starting to adopt guidelines new charter deliverable for deployment considerations looking for new milestone ... ALTO for CDN Rich Woundy: comment about last part (CDNi) Rich is DECADE co-chair and CDNI-bof chair perspective is that ALTO for CDN is a better place for use case for CDN keep expertise where it belongs Vijay: agree, this will be informational anyway DECADE co-chair, Haibin Song DECADE formed last year after March meeting solves last-mile bandwidth problems by adding in-network storage for P2P problem statement and survey finished WGLC architecture and requirements under way is starting rechartering discussion no questions ------------------------------------------- Bufferbloat: "Dark" Buffers in the Internet Jim Gettys (see slides) this is like putting together a puzzle -- design assumption about TCP that congestion notification happens in timely fashion what happens if it is not timely anymore? (packet loss avoided by buffering) -- "the Internet is slow" -- slide of network map server to house is 5-6 hops -- Jim had lunch with Comcast trying to understand the powerboost technology pointed to big buffers pointer to ICSI netalyzr for measurement & analysis -- dslreports smokeping - was nice smoking gun latencies up around 1 sec. why? this happend while doing a big copy of gigabytes of data... reading mail and browsing web, killing connection -- tcptrace plot took traces, knew what to expect -- measurement didn't look like expected Aaron Falk: were axes the same on both plots? Answer: not in terms of quantity, they're labled bursty behavior on 10 sec intervals, not nice behsvior -- plots look like no TCP behavior I've ever seen bursts of dupacks, retransmissions, SACKs, on long timescales knew cable stuff worked well had various people reproduce -- a week or two later, experiment wth inlaws FIOS service path is slightly longer, 20 vs 10ms but roughly same sort of thing -- similar signature for tcptrace peaks on large time scales, lot of data outstanding, 200ms latency -- next is wireless side of home router it's worse latency order 0.5 sec, high loss Question: are wireless and wired on same router? Answer: yes -- called in experts for help agreed lots of buffer in network -- triggers - saturation of path happens in circumstances we don't routinely think about TCP tries to fill bottleneck buffers (examples; "even web browsing") -- what are the effects? whole list of problems DNS failures VOIP is destroyed gamers will get fragged -- netalyzr tool shows key to what's going on in broadband data published by netalyzer folks colors indicate which technology (FIOS, DSL, Fiber) task was to infer buffer capacity look at paper interesting thing to note are diagonal lines, potential latency green, which I don't consider good, is 0.5 sec arrow is increasing latency also a lower bound on severity of problem was a bug in netalyzer that caused it to sometimes not fill buffers (it is *at least* this bad, we don't know how much worse) also mixes wired and wireless data; home routers contribute -- paranoia sets in it doesn't stop there (broadband access) this is a systematic mistake we're all making together -- plot thickens - home router experiments just looking at home -- tests, caught home router doing bad things 8seconds of latency, much worse than broadband only provoked by simple copies of data -- home router bufferbloat led to host bufferbloat play with commercial home routers, easy to reproduce installed OpenWRT; figured out linux transmit queue could be issue txqueuelen 0 better? nope, copying upstream. bottleneck is 802.11 link in house, therefore buffers in laptop. 50/10 bband service, so see this routinely -- host bufferbloat and home router home routers often using general purpose OS under the covers (linux) buffers hide in many places in modern OS back to first principles 256 packets, is 3M bits, 1/3 of a sec on wireless but in busy conference like here, your fair share is much less (so buffers rep many segments) and defeats congestion avoidance -- why not seen on ethernet? can see it trivially. gigabit laptop into 100mbps switch, buffers build on local machine ring buffers on device drivers ~300packets, so 10ms Windows is interesting good latency... by default Windows bandwidth shapes to just below what would saturate a 100 Mbps switch. there is a Microsoft tech note saying why -- where does bufferbloat hurt? end devices, including potentially pohones in home routers, and data centers too -- aggregate network behvior, back to future what happens if buffers all over as in 1990s before AQM, classic congestion problem yet again in wireless saw in sat links [MM's apocalypse :) ] dave reed 3G thing a while ago. - 30 sec so running badly congested with large buffers aggregate form in 3G, as well as 802.1. symptom is large latencies, w/no drops, so not told to slow down. -- conversations with van jacobson I had thought RED was out there, and should prevent this starting in last August... see "RED in different light" 1999 draft paper, should be available soon (I hope) a large part of our gear is impulsing the net at particular frequencies, if any synchronization phenomena, could get various sorts of resonance. "time based congestion phenomena" -- aggregate bufferbloat locations 3g nets head-end gear (paper characterizing residential broadband, 30% of broadband gear headends have no queue mgmt) problems in various places hotel networks are interesting... -- fat subnets go back to first principles on wireless technologies everyone single packet buffer on busy 802.11 net, and far away from access point, even a single packet can be a lot of latency. what if 20, 200, 1000? then additoinal sins retransmitted packets -- what do we get? an ugly sight in conferences, schools, hotels especially hotels on broadband systems with bad latencies; saturated in evening set up testbed to replicate failures at OLPC, but missed bufferbloat, it's subtile -- buffers are only detectable when they are next to a saturated bottlenck hidden until saturation. so need to think about all the places list of places... hardware has "insanely large ring buffers" although good reasons for putting thm in even switches come into play goal is to make you paranoid. bite you anywhere, don't see until link saturates -- other locations of bufferbloat looked at Lucent's internal corporate network. saw spikes I didn't like. do classification, but not AQM. as windows XP retires, can window scale, and saturate internal links [actually a good thing!] become a more common problem as Windows XP retires. personally seen RTTs on order 20 sec, going to Peru ipsec tunnel into company, see latency where firewall complex is another place, buffering for encryption... -- where do buffers hide almost everywhere we don't have the best tools for identifying we know how tools can be built pingplotter in windows is not so bad -- web browser/servers TCP initial congetsion window changes intent in http 1.1 - at most a couple of connecitons along given path original reasons long gone self-congestion on dialup modem banks that had too little buffering, now opposite problem proposal to up initial window if web pages with a bunch of embedded images, then really easy to get flights of 100s of packets to broadband conneciton, with single queue, go splat at bottleneck see transient latency [who cares, unless interactive in middle] see more than 150 ms latency browsing to page want to use internet connection for more than just browsing telephony, games believe the right solution is to replace HTTP with something better stop papering over sins. encouraging network unfriendly behavior -- network neutrality implications of bufferbloat believe bitorrent was saturating links, and that was real issue... can't deploy low-latency apps with any reliability so better fix to have any innovation. -- what should the IETF do? have a big problem on hands that is pervasive on internet. Q&A: Elliot Lear: backup to net neutrality Clarification issue is that carriers telephony gets a major bump, is because of fewer buffers or? Jim: no, they're provisioning telephony on a separate channel. problem is that I can't classify myself. all one pipe without any way to sort packets. Edward Craft / Google. to amplify point, typically interface has 100-200 ms of buffer people are following typical bw-delay sizing that Curtis Vilamizar was a proponent of and IETF has propogated; ignoring Nick McKeown's work, and assuming worst case latency. I think Nick's work is overly pessimistic with regards to the buffers supportable. modern TCPs like cubic require less buffers to feed link. Jim: buffers are grossly oversized, picked out of thin air. never a right answer for what size should be. don't really know bandwidth you are going to get, or RTT don't know number of flows and RTT distribution Edward: as humerous counterpoint, vendors offering up to a second as a feature when go home, you can make most of problem go away in home to lessen suffering Stewart Cheshire / Apple: great presentation. I wrote something along these lines myself about 12 years ago. Jim: I quoted it Stewart: one of the calls of action is to fix/replace HTTP 1.1 I guess next presentation will address that but AQM was also not on slides Jim: yes, AQM everywhere is the ultimate solution but chatting with Van, RED is flawed; Van doesn't believe it can work in environments like home network. looking at algorithms, please come help still trying to extracate RED light algorithm from Van research from Ireland, Tianji Li, Dave Malone... have running on laptop even queue management on hosts isthe right thing. can't fix 802.11 w/o it. Stewart: making same argument at Apple, we make home gateway base stations. but small market share come to Apple in month to make talk :) Mark Nottingham: tomorrow at httpbis session, presentation of pipelining, pretty promising also Mike come to SPDY from HTTP perspective Jim: pipelining was intended solution all along. but hard to implement. pessimistic about getting to work Mark: not nearly as pessimistic, have draft to address but... not sure what will buy us. whether use a bunch or small # of connections web developers want lots of bits in flight; latency sensitive for first page hit somehow penalize misbehaving people everyone wants to micro-optimize, no push back for greater apps Mark: congestion control has been a gentlemans' agrement, some don't live up to that Joel Jaeggli; 2 observations. need 2G packet buffer on my 16x10GE link because have to drain. but is a loaded gun that can point at foot 2nd observation from data center... apps that will move data with 2 sec of latency destroys everything else trying to use the link. the fact that I have 2 sec RTT on US across deep buffers, app developer doesn't care because they're moving 6.5G/s Jim: no single right answer for what buffering should be. Joel: also really big flows hard to shape. IPsec tunnels between data centers. 10G interfaces. 2-6 Gbps IPsec flow. don't want to touch traffic inside flow outside firewall. have to take that. if toss packets out or shape it bad performance Jim: buffer pain is only adjacent to bottleneck link Joel: also where I can do something about it. 2Gbps IPsec flow, need to eat Jim: not sure have all the solutions; just "we have a problem" Matthew Kaufman: 2nd/3rd to say happy brought up. only 16 yrs overdue. happy with Stuart 12 yrs ago. used to build ISPs. we had 2-3 sec of buffer in ISDN in 1994. big problem then. talked to vendors, complain that small buffers makes performance tests fail. Jim: need to shift market so not just bandwidth game. need to have good metric. for geeks latency smaller is better, but not right thing for marketing (hard to explain) Jim: the real solution is queue managemnt everywhere, we can mitigate the problem quckly dont' delay waiting for perfect long-term solution have devices designed for 100s of Mbps, opering in 10-20Mbps, then sizing is off by factor of 10-20 we can make things a lot better before "solved" properly Colin Perkins: active queue magmt and ECN are good things. have a draft on how to bring ECN into RTP. somewhat question rather than just reduce amount of bufering in network... will drastically reduce problem without queue managing schemes and lots of complexity Jim: think AQM is incomplete solution, but do what we can w/o waiting for nirvana. in bad and increasingly worse situation. apps saturate links, OS saturate links, things are getting worse badly, although no data. Joe Touch: will talk off-list about other half-dozen sources of latency haven't talked about re: size of buffer, would like to see ICMP paper 1997 hazard of making buffers small. lots of people doing telnet across trans atlantic link, no one could get one segment through reliably. need to make sure 1 unit of transmission for everyone. so not just fewer packets, but send shorter packets. a lesson ATM almost got right, but segmented at lower later re: pipelining. bad idea. recreate problem one layer up, and reimplement all of IP in HTTP. pointed out N years ago when you and I were working on this; still just as true now Murari, Microsoft: great presentation ever since seen window scaling, wanted people to get sensitive to problem. unless we make AQM brain dead, not going to get it deployed Jim: yes, must make AQM just work it will take time to fix the middle of net and never able to avoid congestion everywhere :) Murari: problem is not AQM, but get to line rate while operating with few buffers delays so far, destroyed congestion avoidance. Murari: need some sort of delay based congestion algorithms... [do much from endpoints] also you said Microsoft turned off initial congestion window; I investigated the problem - it was a load balancer that terminated connections Mark Handley: glad that you've correctly identify cause of problem symptoms depend on TCP implementations they're running clear from diagrams, you run Cubic great work to fill high delay-bandwidth product pipes guess what, we're filling it one thing have to do in IETF, revisit high-speed TCPs, and high bandwidth pipes see what makes sense Microsoft CTCP is delay sensitive. so will fill pipe until see delay rise. Jim: fundamental problem is timely notifcation is if had timely notification, wouldnt' matter but algorithms not designed to oeprate with buffer Jana Iyengar: buffers at end host. with multiplexing, adding several layers so increasing up and down. Jim: yes Jana: second thing I want to comment on, you talked about a situation where several students trying to get on net at same time... don't know what degree of sharing there is, access link bandwidth is... but think buffers helping. Jim: no, well take offline. Jana: reason is that TCP can't go to less than 1 packet per RTT --------------------------------------------- SPDY, TCP, and the Single Connection Throttle Mike Belshe / Google 40 min chrome 6 - chrome 11 has had SPDY enabled. pretty much on 100% for several months all google servers deployed using for ssl traffic only. because of internal poicy google has data, now trying to figure out what's working and what's not speed - latency to end user. tinkered around with incremental changes. before chrome, had lotsof choice. tcp server side tinkering at one point thought: let's change it all, and see what we can do best case. tackle tcp or app area problems. tempting to go to transport. but from practical perspecive, elected not to there are pretty easily identifiable problems at http so work on that first. SPDY is an application level change to decrease latency -- state of the web page (see slides) trying to work off problem http is single threaded more stuff as quickly as possible will only do 6 threads/time so use images1.google.com images2.google.com ... -- Quick SPDY background all open source, welcome contributions not PIPELINING, which has head of line blocking issues, ... sometimes know need something, but don't tell server, because don't want low-priority request to interfere with high-priority request. [search result vs images along with it search: lots of compute, images: not much, so don't ask for images too early] stuff has gotten bigger over time lots of redundant stuff in cookies want to work on server push as a secondary feature; SPDY provides building blocks trying to build basic "blocking and tackling" for networking -- http connection use today not trying to pick on IE; all browsers are like this. most are 6 connections/domain around 29 connections average per page even small pages, about 10 connections /page Stewart Cheshire: 6 connections per target host. that's why 20-30 connections, because there are differnet host names. Mike: yes, domain == host name here -- reducing upload bytes by adding a little bit of compression (thought had to be sophisticated... but just gzip text headers; works real well for cookies. cookies 1-4K today). 51% reduction. -- download bytes don't do much here in SPDY to force compression. think compression on net is far lower than it could be today. 2/3 of content that could be compressed is except it's only 50% on SSL. think SSL guys haven't payed as much attention -- reducing total packets 20% fewer packets over the network using SPDY. -- increasing parallelism time to first byte for response reduced. with multiplexed protocol, get back faster? on average much faster. -- single connection throttle. think should be a good thing from a transport perspecive. sending/receving less reward with larger performance boost. SPDY is faster, but... run into single connection throttle. -- throtte#1 CWND initial cwnd. large RTT to clients. 100ms on average, mobile is worse so open lots of connections to get requests off clients side effect is boost from cumulative cwnd google using 10, proposal w/TCPM to make this real w/SPDY using 1 connection. 29 connection -> 7 connections if get rid of subdomain sharding, goes lower. w/SPDY, haven't done any application level reduction of subdomain sharding -- cwnd vs # connections 5Mbps download/1Mbps updoad if start with low cwnd, big difffernece between 6 connections and 1. reduced latency increases how long people are at site, which relates to how much is in shoppoing carts, not sure how can put geni back in bottle here. 'cat is already out of bag' if go to 10, a bit better go to 16 a bit better for SPDY. this is facebook home page. shows this well. -- throttle#2 recieve windows 4 packet rwnd trumps larger cwnd on server -- graph showing true , next -- throttle # 3 : Intermediaries a bug, but ends up being a problem for using fewer connections. small window sizes looking at matching client/server traces someone in middle mucking with window scale. server left thinking window size is 700 bytes HTTP had 6 of these, so it was better than SPDY (because of more connections) not sure how to fix this -- Throttle#4 congestion control decrease send rate, exactly the right thing to do. TCP will "unfairly penalize" single connection protocol for losses compared to multiple connections. 1 packet loss, cut rate in half if 1 loss on one of 6, cut is only 1/12th of aggregate. fairness w/multiple flows. just use more connections :) -- am I too obsessed with 1 connection? want to make things more efficient. multi connections end up on different servers, so stateful compression worse can't prioritize server push to match w/ request hard. would like single stream. want transport to help protocol that is "doing the right thing" -- how much does a handshake cost? off topic, but ... what happens when add latency to handshake (including all setup: DNS, TCP; or DNS, TCP, SSL ...) 100ms RTT links, 5Mbps download add additional 100ms of delay into connect sequence end up with 300-400ms delay to total request (amplification because of spreading across multiple domains, etc) -- What's Next? think TCP neds to address: data in initial handshake, single connection taxes and add both server auth and encryption to TCP. Joe Touch: have to bite. you're upset when someone changes your cwnd, but OK to start with IWIN that's not in spec? Mike: well, how do you learn how to change the spec? Joe: sauce for the goose is sauce for the gander Mike: wrt SPDY, use one connection not six, so are cutting back down. think community thinks this is a good thing. linux kernel already made change Jim Gettys: issue I have is what this is doing to other apps at same time which are delay sensitive. Yoshifumi Nishida: one qustion, have you thought about SCTP? talking about multiplexing, and head-of-line blocking solved in SCTP, why not use that? Mike: SCTP might work. and a group at U. Delaware looking at that. looking at single connection throttle, might have limits also SCTP deployability is an issue. Yoshifumi: if deployment issue... if google says "going to use", that might get fixed. Stewart Cheshire: great presentation, good work. go back to slide 3. state of web page. interested in properties and pipelining, would like to hear more about that. Mike: Mark Nottingham mentioned think he can make pipelinging work. Patrick at firefox trying to do that. written up 10 yrs ago, never successfully deployed. reasons are proxies that could not support. browsers not willing to take on as a risk. other things come into play. to know that can pipeline will work at all, need to know server is 1.1 server. that's an RTT. no guarantee that was a server which was 1.1 in past, will be in future, hard to cache. long hanging GETS are a problem. Stewart: pipelining allow out-of-order delivery? Mike: no anyway, firefox guys close, can get a long way. in order to make this work, have had to put in lots of extra mechanism "hacks" so can't know immediately if can pipeline correctly. Stewart: hope make pipelining work, think there's lots of promise there. on iTunes, it's HTML being rendered inside ; using same we kit that safari & chrome use. just opening a single connection, doing 280 gets, made a dramatic improvement in page load time. pack lots of gets into TCP packets, and firehose coming back at other end. also we have benefit of talking to a known server. Mike: don't think one over other. think research on pipelining should continue. second comment: headers in clear have been a problem. places where want to do better. but well intended folks, maybe AV guy. but not quite ready to implment gzip compression. think I'll mark accepting coding header, x-encoding that doesn't then server doesn't know about client support for compression. Mark Nottingham: tweak last answer a bit. think people haven't deployed pipelining because haven't tried. busy going to 1.1. but people are becoming more performance savy, so seeing resurgence of interest. believe it is spending next year to see if improving HTTP is good enough. in common case, getting HTTP page, which blocks. then a bunch of assets. think SPDY is very interesting; but before we throw out big investment in infrastructure and implementation in HTTP, worth trying to improve game. Mike: I'm good with that. if can feedback SPDY to HTTP; better because already implemented. Matthew Kaufman: websocket has support for fragmentation. and could but doesn't support multiplexing. curious how integrated. Mike: not sure. SPDY started up independently. different motivations they wanted bidirectional protocol through javascript. docs.google.com using SPDY for websocket like thing hanging GETs are a nuisance. a lawyer might have 30 docs open doing server push, where we want notificaitons to browser hanging GET works, used for 5 yrs, people understand. but when 30 documents, there are 30 of them, connection throttle problem. 6 connections/domain so SPDY works well for those guys. that's overlapping with web sockets. Mathew: think more overlap than initially evident. more work should be done to integrate Mike: also flow control on SPDY sterams. Salvatore: comment on that: my HYBI chair role on websocket, no support for multplexing, but getting lots of feedback from google SPDY guys Elliot Lear; at least 2 apps protocols that have experience with pipelining problem outside HTTP IMAP is one. Mark Crispin looked early on BEEP. considerably less success. but consider experiment SPDY over BEEP also, how long do you think better behavior will last? Mike: intermediaries... SPDY requires SSL, they can't see middle, so don't touch if believe must be secure, not a problem. are some that wants to do unsecure version. I'm not one of them. could do even better if ditch security requirements. If we're in 2020 and still sniffed in cafe, we as application providers have failed. gmail: all ssl see twitter and facebook offering this too. lot of challenges with security side, but think critical to where we're going. Elliot: just because using SSL, sometimes intermediaries get involved on behalf of user Mike: true, and evil! lots of problems with certificate infrastructure; hard to change Lars: what are plans for I-D or BoF or ? Mike: could put out I-D at any time. anyone can help. we're not happy with performance angle over SSL spent a lot of time optimizing SSL & HTTP false-start is something browsers should do, easy snap-start. experiment tried, but rejected now working on third form. trying to get low-latency security protocols. if you guys yell more, it will get prioritized Mark: encourage decouple SSL from protocol. browsers can use. but find SPDY invaluable in back end w/o encryption. Mike: "fair enough" ---------------------------------------- 1:58 A Storage Menagerie: NAS, SAN & the IETF David Black and Brian Pawlowski 40 min slides: http://www.ietf.org/proceedings/80/slides/tsvarea-2.pdf "and now for something completely different" David Black is co-chair of storm WG Brian Pawlowski is co-chair of nfsv4 WG this presentation is a request from ADs to explain what is happening in Internet storage technologies, and the relevant IETF pieces. -- storage networking: nas and san NAS: Network Attached Storage: remote files NFS and CIF are wide-spread NFS is an IETF standard SAN: Storage Area Network: virtual disks - remote blocks SCSI, Fiber Channel iSCSI, iFC, are IETF standards -- [Brian] two worlds of storage SCSI and NFS started about the same time (1982-1983) similarity of language can be an "experiment in miscommunication" HBAs are the things that connect to a network in SAN NICs are the tingsthat connect to a network in NAS SAN started out running over FiberChannel (non-routed) iSCSI over TCP/IP added routing capability NAS started out running over TCP/IP. There is a "one wire convergence" occuring. Everything is starting to go over Ethernet. so at least one physical infrastructure. -- NAS from ietf perspective, NAS is NFS -- protocol vs implementation NFS supports vendor independence Idealized File System with POSIX-like interfaces Ideally Stateless. have 3x yr bakathons. cover semantics, make sure implementations fail properly -- stacks and standards NFS RPC XDR TCP/IP Ethernet -- domains and features (2:06) important thing about rpc - is security negotiation layer of stack Knowledge is shared between NFS layer and RPC layer to implement security principals and access control lists. These mappings in area of security cause lots of heartburn in implementation because client operating systems subtly differ around security - parallel nfs (pNFS) is new in 4.1 spec. 2:07:36 allows for hybrid network, to provide performance its like having SAN attached to same storage for NFS so nfs is metadata, common layout info, for high speed access to regions of data given maps provided by nfs server to clients allow large block fetches over high perf network. probably the trickiest thing we have done yet in NFS. support direct file access, mapping to iscsi and fiber channel SAN, or block storage as defined by T10 standards -- nfs future work NFSv4.2 small enhancments server side copy is similar to an iSCSI feature space reservations Data center concerns we need to discuss with TCP groups about low latency in high bandwidth networks Expanding use cases include more sharing with large amounts of data, such as large streaming data transactions storage for virtualized environments -- [David Black] 2:09:45 SAN - Storage Area Networks more background because these systems have laregly grown up outside the Internet and IETF. -- overview (see slides) Storage Arrays make logical disks out of physical disks massive difference in size from a few disks in a rack mount to large room-size units. -- storage protocol classes most common way to see SAN protocols, is not what is in diagram on this slide. This slide shows server to storage access, where Fibre Channel or iSCSI or NFS or CIFS are providing access to storage for a server. But for the SAN side, what you see on the Internet are protocols where the arrays are talking to each other. Often based on server-storage protocol, but doing something fundamentally different. big thing: replication protocols, SANs talking to each other over the Internet. 2nd on. -- why remote replication robustness If your storage is in Miami, and a hurricane like Andrew comes through, you have a problem. Power os down, phone lines are down, network is down. -- remote replication rationale That's why we do storage replication. Andrew did not go through Chicago. The goal of storage replication is to store the same data in two places that are far enough apart that a single event won't take them both out. really do need geographical separation -- remote replication: 2 types 2:13:04 1) synchronous: identical copy of data; distance limited 2) asynchronous: delayed consistent copy of data; higher latencies, longer distances usually based on access protocol (FC, iSCSI, etc.) but magic on top. and specific to vendor disk arrays. -- scsi protocl family, foundation of san storage -- scsi [see dilbert cartoon w/ratbert] client-server architecture (really master-slave) server is slave to client slave is disk drive, essentially disk is resource-constrained; client is often computer with more resources. disk does what it's told. scales to very large disk arrays very latency sensitive at the bottom of the OS, accessing files and disk storage milliseconds matter -- scsi architecture 2:14:34 managed to abuse the word transport. scsi command sets scsi transports: haul command and data around in ietf, put on top of internet transport tech. scsi architecture layered architecture command sets and transports same commend sets used with all transports command sets: primary commands, block commands, stream commands SCSI transports: hual commands and data around Fiber Channel via SCSI Fiber Channel Protocol (FCP) iSCSI: Internet SCSI SAS: serial attached SCSI most functionality specified in INCITS standards organization, in T10 (SCSI) -- the scsi protocol family SCSI Architecture (SAM) and Commands (SCSI-3): where read and write, config and ready, and errors defined lots of realizations parallel scsi & sas runs over SCSI and SAS cables, generally short distance iSCSI same commands and architecture runs over TCP/IP on any IP network fibrechannel runs over FC Fibers and switches fibrechannel is complex, it is multiple protocols. in additon to SCSI, handles mainframe I/O, and IP on FC -- SCSI Protocol Family and SANs DAS direct attached storage; computer hooked up to storage via a cable using parallel SCSI and SAS cables IP SAN, using iscsi FC SAN, using Fibre Channel -- IP SAN:iSCSI 2:17 This is the work being done in IETF SCSI over TCP/IP - RFC3720 and friends typical usage within data center (1G and 10G Ethernet) seperate LAN or VLAN recommended for iSCSI traffic data center bridging ethernet helps with vlan behavior STORM WG does iSCSI maintenance consolidating existing RFCs into one dou=cument new RFC adds new SCSI transport features Most SCSI functionality is above iSCSI level (see T10, not IETF) -- Why is iSCSI interesting? One example: live virtual machine migration VMware folks, VMotion just move it and it works. in order to make migration work effectively, don't want to actually move storage We need large **shared** storage We move the virtual machine, but not its shared storage -- SCSI protocol family and Fibre Channel mostly about distance extension. fc doesn't normally run over tcpip its run over fibre channel links -- native FC links in data center links, always optical. fc always single lane serial credit based flow control. not pause/resume credits managed on wire. absolute control/limits over data in flight. -- fc timing and data recovery 2:21 very strict timing requirements Heavyweght error recovery: no reliable fc transport protocol, becuase link is supposed to just work. base protocol is a datagram service. if drop or reoroder... if drop, whole thing fails, so timeout 30 sec later. tape i/o is even worse, tape stops in tracks on error, lots of stuff. error during tape streaming, streaming just stops FC is very sensitive to drops and reordering people typically overprovision to avoid congestion-induced drops need to avoid reordering very primitive compared to TCP/IP -- scsi proto family and fibre channel touch IP with variants and extension. can encap FC over TCP/IP, used for replicaitons or put Pseudowire over MPLS and FCOE, can run over lossless ethernet. -- FC pseuudowire encapsulation brand new. IETF last call completed -- FC/IP and iFCP typically for distance extension timeouts in there, not bad most important protocol is FC/IP -- how used FC/IP network customer examples for storage replication, when in large facility... not if need 10G wan link, but how many. -- FCOE data center tech. putting FC in "lossless ethernet" proto relies on ethernet mcast. uses data center bridging. -- SCSI protocol family and Standards this world is distributed collaboration T10 - scsi stds. T11 - fibrechannel stds org IETF - handles a bunch iSCSI, iFC, PWE IEEE - ethernet This collaboration makes work in this area interesting. Q&A: joe touch: 1. two differnet paths to IP. why? benefit of scsi -> fc -> tcp vs direct? yes, distance extention. storage replication. traffic came in on native fiber channel in array and wants to speak to one on far side. and need to extend replication over the Internet. most effective way is to rewrap in tcpip and use scsi in stack because can think in terms of read and write I/Os. jt: missing point. if iSSCI over TCPIP and that over FC, ... seems backwards. down to FC, then back up to TCP rather than let tcpip do what it was, and run over any medium. has to do with implementaiton infrastructure. Storage replication was not initially over tcpip. was dark fiber, then wdm, then. jt: ok. historical. 2. on one hand, all of this is designed to run over one true layer, ethernet almost everything seen since then, says sensitive to layer, and must understand it. what did you mean everything over ethernet? that was data center comment. look in existing DC. fc sans that do storage, and 3 or 4 separate ethernets. so as people do rack scale stuff, too many cables want to push it all onto etheret to reduce cables. most transports in the diagram are data center focused. The other two are WAN protocols. ???: Q: anyone used erasure coding for replication? A: there are people working on erasure coding. but use case for replication is get back to work quickly want all the data in one place to get back to work quickly. Erasure coding's property about reassembling shards from various places in a few hrs is too expensive.