idnits 2.17.1 draft-ietf-forces-netlink-03.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 33 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 34 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 27 instances of too long lines in the document, the longest one being 8 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 230 has weird spacing: '...Netlink layer...' == Line 1343 has weird spacing: '...Netlink layer...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 2002) is 7978 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC-2119' is mentioned on line 39, but not defined == Unused Reference: 'RFC1633' is defined on line 1242, but no explicit reference was found in the text == Unused Reference: 'RFC1812' is defined on line 1246, but no explicit reference was found in the text == Unused Reference: 'RFC2475' is defined on line 1249, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 1633 ** Downref: Normative reference to an Informational RFC: RFC 2475 ** Downref: Normative reference to an Historic RFC: RFC 1157 ** Obsolete normative reference: RFC 3036 (Obsoleted by RFC 5036) -- Possible downref: Non-RFC (?) normative reference: ref. 'Stevens' -- Possible downref: Non-RFC (?) normative reference: ref. 'Netfilter' -- Possible downref: Non-RFC (?) normative reference: ref. 'Diffserv' Summary: 13 errors (**), 0 flaws (~~), 9 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 ForCES Working Group Jamal Hadi Salim 3 Internet Draft Znyx Networks 4 Hormuzd Khosravi 5 Intel 6 Andi Kleen 7 Suse 8 Alexey Kuznetsov 9 INR/Swsoft 10 June 2002 12 Netlink as an IP Services Protocol 13 draft-ietf-forces-netlink-03.txt 15 Status of this Memo 17 This document is an Internet-Draft and is in full conformance with 18 all provisions of Section 10 of RFC2026. Internet-Drafts are working 19 documents of the Internet Engineering Task Force (IETF), its areas, 20 and its working groups. Note that other groups may also distribute 21 working documents as Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as ``work in progress.'' 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Conventions used in this document 36 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 37 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 38 this document are to be interpreted as described in [RFC-2119]. 40 1. Abstract 42 This document describes Linux Netlink, which is used in Linux both 43 as an intra-kernel messaging system as well as between kernel and 45 jhs_hk_ak_ank draft-forces-Netlink-03.txt 47 user space. This document is intended as informational in the con- 48 text of prior art for the ForCES IETF working group. The focus of 49 this 50 document is to describe Netlink from a perspective of a protocol 51 between a Forwarding Engine Component (FEC) and a Control Plane 52 Component (CPC), the two components that define an IP service. 54 The document ignores the ability of Netlink as a intra-kernel mes- 55 saging system, as an inter-process communication scheme (IPC), or 56 as a configuration tool for other non-networking or non-IP network 57 services (such as decnet, etc.). 59 2. Introduction 61 The concept of IP Service control-forwarding separation was first 62 introduced in the early 1980s by the BSD 4.4 routing sockets 63 [Stevens]. The focus at that time was a simple IP(v4) forwarding 64 service and how the CPC, either via a command line configuration 65 tool or a dynamic route daemon, could control forwarding tables for 66 that IPv4 forwarding service. 68 The IP world has evolved considerably since those days. Linux 69 Netlink, when observed from a service provisioning and management 70 point of view, takes routing sockets one step further by breaking 71 the barrier of focus around IPv4 forwarding. Since the Linux 2.1 72 kernel, Netlink has been providing the IP service abstraction to a 73 few services other than the classical RFC 1812 IPv4 forwarding. 75 The motivation for this document is not to list every possible ser- 76 vice for which Netlink is applied. In fact, we leave out a lot of 77 services (multicast routing, tunnelling, policy routing, etc.). 78 Neither is this document intended to be a tutorial on Netlink. The 79 idea is to explain the overall Netlink view with a special focus on 80 the mandatory building blocks within the ForCES charter (i.e., IPv4 81 and QoS). This document also serves to capture prior art to many 82 mechanisms that are useful within the context of ForCES. The text 83 is limited to a subset of what is available in kernel 2.4.6, the 84 newest kernel when this document was first written. It is also 85 limited to IPv4 functionality. 87 We first give some concept definitions and then describe how 88 Netlink fits in. 90 jhs_hk_ak_ank draft-forces-Netlink-03.txt 92 2.1. Definitions 94 A Control Plane (CP) is an execution environment that may have sev- 95 eral sub-components, which we refer to as CPCs. Each CPC provides 96 control for a different IP service being executed by a Forwarding 97 Engine (FE) component. This relationship means that there might be 98 several CPCs on a physical CP, if it is controlling several IP ser- 99 vices. In essence, the cohesion between a CP component and an FE 100 component is the service abstraction. 102 2.1.1. Control Plane Components (CPCs) 104 Control Plane Components encompass signalling protocols, with 105 diversity ranging from dynamic routing protocols, such as OSPF 106 [RFC2328], to tag distribution protocols, such as CR-LDP [RFC3036]. 107 Classical management protocols and activities also fall under this 108 category. These include SNMP [RFC1157], COPS [RFC2748], and pro- 109 prietary CLI/GUI configuration mechanisms. 111 The purpose of the control plane is to provide an execution envi- 112 ronment for the above-mentioned activities with the ultimate goal 113 being to configure and manage the second Network Element (NE) com- 114 ponent: the FE. The result of the configuration defines the way 115 that packets traversing the FE are treated. 117 2.1.2. Forwarding Engine Components (FECs) 119 The FE is the entity of the NE that incoming packets (from the net- 120 work into the NE) first encounter. 122 The FE's service-specific component massages the packet to provide 123 it with a treatment to achieve an IP service, as defined by the 124 Control Plane Components for that IP service. Different services 125 will utilize different FECs. Service modules may be chained to 126 achieve a more complex service (refer to the Linux FE model, 127 described later). When built for providing a specific service, the 128 FE service component will adhere to a forwarding model. 130 jhs_hk_ak_ank draft-forces-Netlink-03.txt 132 2.1.2.1. Linux IP Forwarding Engine Model 134 ____ +---------------+ 135 +->-| FW |---> | TCP, UDP, ... | 136 | +----+ +---------------+ 137 | | 138 ^ v 139 | _|_ 140 +----<----+ | FW | 141 | +----+ 142 ^ | 143 | Y 144 To host From host 145 stack stack 146 ^ | 147 |_____ | 148 Ingress ^ Y 149 device ____ +-------+ +|---|--+ ____ +--------+ Egress 150 ->----->| FW |-->|Ingress|-->---->| Forw- |->| FW |->| Egress | device 151 +----+ | TC | | ard | +----+ | TC |--> 152 +-------+ +-------+ +--------+ 154 The figure above shows the Linux FE model per device. The only 155 mandatory part of the datapath is the Forwarding module, which is 156 RFC 1812 conformant. The different Firewall (FW), Ingress Traffic 157 Control, and Egress Traffic Control building blocks are not manda- 158 tory in the datapath and may even be used to bypass the RFC 1812 159 module. These modules are shown as simple blocks in the datapath 160 but, in fact, could be multiple cascaded, independent submodules 161 within the indicated blocks. More information can be found at 162 [Netfilter] and [Diffserv]. 164 Packets arriving at the ingress device first pass through a fire- 165 wall module. Packets may be dropped, munged, etc., by the firewall 166 module. The incoming packet, depending on set policy, may then be 167 passed via an Ingress Traffic Control module. Metering and polic- 168 ing activities are contained within the Ingress TC module. Packets 169 may be dropped, depending on metering results and policing poli- 170 cies, at this module. Next, the packet is subjected to the only 171 non-optional module, the RFC 1812-conformant Forwarding module. 172 The packet may be dropped if it is nonconformant (to the many RFCs 173 complementing 1812 and 1122). This module is a juncture point at 174 which packets destined to the forwarding NE may be sent up to the 175 host stack. 177 Packets that are not for the NE may further traverse a policy rout- 178 ing submodule (within the forwarding module), if so provisioned. 180 jhs_hk_ak_ank draft-forces-Netlink-03.txt 182 Another firewall module is walked next. The firewall module can 183 drop or munge/transform packets, depending on the configured sub- 184 modules encountered and their policies. If all goes well, the 185 Egress TC module is accessed next. 187 The Egress TC may drop packets for policing, scheduling, congestion 188 control, or rate control reasons. Egress queues exist at this 189 point and any of the drops or delays may happen before or after the 190 packet is queued. All is dependent on configured module algorithms 191 and policies. 193 2.1.3. IP Services 195 An IP service is the treatment of an IP packet within the NE. This 196 treatment is provided by a combination of both the CPC and the FEC. 198 The time span of the service is from the moment when the packet 199 arrives at the NE to the moment that it departs. In essence, an IP 200 service in this context is a Per-Hop Behavior. CP components run- 201 ning on NEs define the end-to-end path control for a service by 202 running control/signaling protocol/management-applications. These 203 distributed CPCs unify the end-to-end view of the IP service. As 204 noted above, these CP components then define the behavior of the FE 205 (and therefore the NE) for a described packet. 207 A simple example of an IP service is the classical IPv4 Forwarding. 208 In this case, control components, such as routing protocols (OSPF, 209 RIP, etc.) and proprietary CLI/GUI configurations, modify the FE's 210 forwarding tables in order to offer the simple service of forward- 211 ing packets to the next hop. Traditionally, NEs offering this sim- 212 ple service are known as routers. In the diagram below, we show a 213 simple FE<->CP setup to provide an example of the classical IPv4 214 service with an extension to do some basic QoS egress scheduling 215 and illustrate how the setup fits in this described model. 217 jhs_hk_ak_ank draft-forces-Netlink-03.txt 219 Control Plane (CP) 220 .------------------------------------ 221 | /^^^^^^\ /^^^^^^\ | 222 | | | | COPS |-\ | 223 | | ospfd | | PEP | \ | 224 | \ / \_____/ | | 225 /------\_____/ | / | 226 | | | | / | 227 | |_________\__________|____|_________| 228 | | | | 229 ****************************************** 230 Forwarding ************* Netlink layer ************ 231 Engine (FE) ***************************************** 232 .-------------|-----------|----------|---|------------- 233 | IPv4 forwading | | | 234 | FE Service / / | 235 | Component / / | 236 | ---------------/---------------/--------- | 237 | | | / | | 238 packet | | --------|-- ----|----- | packet 239 in | | | IPv4 | | Egress | | out 240 -->--->|------>|---->|Forwading |----->| QoS |--->| ---->|-> 241 | | | | | Scheduler| | | 242 | | ----------- ---------- | | 243 | | | | 244 | --------------------------------------- | 245 | | 246 ------------------------------------------------------- 248 The above diagram illustrates ospfd, an OSPF protocol control dae- 249 mon, and a COPS Policy Enforcement Point (PEP) as distinct CPCs. 250 The IPv4 FE component includes the IPv4 Forwarding service module 251 as well as the Egress Scheduling service module. Another service 252 might add a policy forwarder between the IPv4 forwarder and the QoS 253 egress scheduler. A simpler classical service would have consti- 254 tuted only the IPv4 forwarder. 256 Over the years, it has become important to add aditional services 257 to routers to meet emerging requirements. More complex services 258 extending classical forwarding have been added and standardized. 259 These newer services might go beyond the layer 3 contents of the 260 packet header. However, the name "router," although a misnomer, is 261 still used to describe these NEs. Services (which may look beyond 262 the classical L3 service headers) include firewalling, QoS in Diff- 263 serv and RSVP, NAT, policy based routing, etc. Newer control pro- 264 tocols or management activities are introduced with these new ser- 265 vices. 267 jhs_hk_ak_ank draft-forces-Netlink-03.txt 269 One extreme definition of a IP service is something for which a 270 service provider would be able to charge. 272 3. Netlink Architecture 274 Control of IP service components is defined by using templates. 276 The FEC and CPC participate to deliver the IP service by communi- 277 cating using these templates. The FEC might continously get 278 updates from the Control Plane Component on how to operate the ser- 279 vice (e.g., for v4 forwarding or for route additions or deletions). 281 The interaction between the FEC and the CPC, in the Netlink con- 282 text, defines a protocol. Netlink provides mechanisms for the CPC 283 (residing in user space) and the FEC (residing in kernel space) to 284 have their own protocol definition--kernel space and user space 285 just mean different protection domains. Therefore, a wire protocol 286 is needed to communicate. The wire protocol is normally provided 287 by some privileged service that is able to copy between multiple 288 protection domains. We will refer to this service as the Netlink 289 service. The Netlink service can also be encapsulated in a differ- 290 ent transport layer, if the CPC executes on a different node than 291 the FEC. The FEC and CPC, using Netlink mechanisms, may choose to 292 define a reliable protocol between each other. By default, how- 293 ever, Netlink provides an unreliable communication. 295 Note that the FEC and CPC can both live in the same memory protec- 296 tion domain and use the connect() system call to create a path to 297 the peer and talk to each other. We will not discuss this mecha- 298 nism further other than to say that it is available. Throughout 299 this document, we will refer interchangebly to the FEC to mean ker- 300 nel space and the CPC to mean user space. This denomination is not 301 meant, however, to restrict the two components to these protection 302 domains or to the same compute node. 304 Note: Netlink allows participation in IP services by both service 305 components. 307 3.1. Netlink Logical Model 309 In the diagram below we show a simple FEC<->CPC logical relation- 310 ship. We use the IPv4 forwarding FEC (NETLINK_ROUTE, which is 312 jhs_hk_ak_ank draft-forces-Netlink-03.txt 314 discussed further below) as an example. 316 Control Plane (CP) 317 .------------------------------------ 318 | /^^^^^\ /^^^^^\ | 319 | | | / CPC-2 \ | 320 | | CPC-1 | | COPS | | 321 | | ospfd | | PEP | | 322 | / _____/ | 323 | _____/ | | 324 | | | | 325 ****************************************| 326 ************* BROADCAST WIRE ************ 327 FE---------- *****************************************. 328 | IPv4 forwading | | | | 329 | FEC | | | | 330 | --------------/ ----|-----------|-------- | 331 | | / | | | | 332 | | .-------. .-------. .------. | | 333 | | |Ingress| | IPv4 | |Egress| | | 334 | | |police | |Forward| | QoS | | | 335 | | |_______| |_______| |Sched | | | 336 | | ------ | | 337 | --------------------------------------- | 338 | | 339 ----------------------------------------------------- 341 Netlink logically models FECs and CPCs in the form of nodes inter- 342 connected to each other via a broadcast wire. 344 The wire is specific to a service. The example above shows the 345 broadcast wire belonging to the extended IPv4 forwarding service. 347 Nodes (CPCs or FECs as illustrated above) connect to the wire and 348 register to receive specific messages. CPCs may connect to multi- 349 ple wires if it helps them to control the service better. All 350 nodes (CPCs and FECs) dump packets on the broadcast wire. Packets 351 can be discarded by the wire if they are malformed or not specifi- 352 cally formatted for the wire. Dropped packets are not seen by any 353 of the nodes. The Netlink service MAY signal an error to the 354 sender if it detects a malformatted Netlink packet. 356 Packets sent on the wire can be broadcast, multicast, or unicast. 357 FECs or CPCs register for specific messages of interest for pro- 358 cessing or just monitoring purposes. 360 jhs_hk_ak_ank draft-forces-Netlink-03.txt 362 Appendices 1 and 2 have a high level overview of this interaction. 364 3.2. Message Format 366 There are three levels to a Netlink message: The general Netlink 367 message header, the IP service specific template, and the IP ser- 368 vice specific data. 370 0 1 2 3 371 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 372 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 373 | | 374 | Netlink message header | 375 | | 376 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 377 | | 378 | IP Service Template | 379 | | 380 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 381 | | 382 | IP Service specific data in TLVs | 383 | | 384 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 386 The Netlink message is used to communicate between the FEC and CPC 387 for parametrization of the FECs, asynchoronous event notification 388 of FEC events to the CPCs, and statistics querying/gathering (typi- 389 cally by a CPC). 391 The Netlink message header is generic for all services, whereas the 392 IP Service Template header is specific to a service. Each IP Ser- 393 vice then carries parametrization data (CPC->FEC direction) or 394 response (FEC->CPC direction). These parametrizations are in TLV 395 (Type-Length-Value) format and are unique to the service. 397 3.3. Protocol Model 399 This section expands on how Netlink provides the mechanism for ser- 400 vice-oriented FEC and CPC interaction. 402 jhs_hk_ak_ank draft-forces-Netlink-03.txt 404 3.3.1. Service Addressing 406 Access is provided by first connecting to the service on the FE. 407 The connection is achieved by making a socket() system call to the 408 PF_NETLINK domain. Each FEC is identified by a protocol number. 409 One may open either SOCK_RAW or SOCK_DGRAM type sockets, although 410 Netlink does not distinguish between the two. The socket connec- 411 tion provides the basis for the FE<->CP addressing. 413 Connecting to a service is followed (at any point during the life 414 of the connection) by either issuing a service-specific command 415 (from the CPC to the FEC, mostly for configuration purposes), issu- 416 ing a statistics-collection command, or subscribing/unsubscribing 417 to service events. Closing the socket terminates the transaction. 418 Refer to Appendices 1 and 2 for examples. 420 3.3.2. Netlink Message Header 422 Netlink messages consist of a byte stream with one or multiple 423 Netlink headers and an associated payload. If the payload is too 424 big to fit into a single message it, can be split over multiple 425 Netlink messages, collectively called a multipart message. For 426 multipart messages, the first and all following headers have the 427 NLM_F_MULTI Netlink header flag set, except for the last header 428 which has the Netlink header type NLMSG_DONE. 430 The Netlink message header is shown below. 432 0 1 2 3 433 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 434 0 1 2 3 435 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 436 | Length | 437 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 438 | Type | Flags | 439 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 440 | Sequence Number | 441 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 442 | Process ID (PID) | 443 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 445 The fields in the header are: 447 jhs_hk_ak_ank draft-forces-Netlink-03.txt 449 Length: 32 bits 450 The length of the message in bytes, including the header. 452 Type: 16 bits 453 This field describes the message content. 454 It can be one of the standard message types: 455 NLMSG_NOOP Message is ignored. 456 NLMSG_ERROR The message signals an error and the payload 457 contains a nlmsgerr structure. This can be looked 458 at as a NACK and typically it is from FEC to CPC. 459 NLMSG_DONE Message terminates a multipart message. 461 Individual IP services specify more message types, e.g., 462 NETLINK_ROUTE service specifies several types, such as RTM_NEWLINK, 463 RTM_DELLINK, RTM_GETLINK, RTM_NEWADDR, RTM_DELADDR, RTM_NEWROUTE, 464 RTM_DELROUTE, etc. 466 Flags: 16 bits 467 The standard flag bits used in Netlink are 468 NLM_F_REQUEST Must be set on all request messages (typically 469 from user space to kernel space) 470 NLM_F_MULTI Indicates the message is part of a multipart 471 message terminated by NLMSG_DONE 472 NLM_F_ACK Request for an acknowledgment on success. 473 Typical direction of request is from user 474 space (CPC) to kernel space (FEC). 475 NLM_F_ECHO Echo this request. Typical direction of 476 request is from user space (CPC) to kernel 477 space (FEC). 479 Additional flag bits for GET requests on config information in 480 the FEC. 481 NLM_F_ROOT Return the complete table instead of a 482 single entry. 483 NLM_F_MATCH Return all entries matching criteria passed in 484 message content. 485 NLM_F_ATOMIC Return an atomic snapshot of the table being 486 referenced. This may require special privileges 487 because it has the potential to interrupt 488 service in the FE for a longer time. 490 Convenience macros for flag bits: 491 NLM_F_DUMP This is NLM_F_ROOT or'ed with NLM_F_MATCH 493 Additional flag bits for NEW requests 494 NLM_F_REPLACE Replace existing matching config object with 495 this request. 496 NLM_F_EXCL Don't replace the config object if it already 498 jhs_hk_ak_ank draft-forces-Netlink-03.txt 500 exists. 501 NLM_F_CREATE Create config object if it doesn't already 502 exist. 503 NLM_F_APPEND Add to the end of the object list. 505 For those familiar with BSDish use of such operations in route 506 sockets, the equivalent translations are: 508 - BSD ADD operation equates to NLM_F_CREATE or-ed 509 with NLM_F_EXCL 510 - BSD CHANGE operation equates to NLM_F_REPLACE 511 - BSD Check operation equates to NLM_F_EXCL 512 - BSD APPEND equivalent is actually mapped to 513 NLM_F_CREATE 515 Sequence Number: 32 bits 516 The sequence number of the message. 518 Process ID (PID): 32 bits 519 The PID of the process sending the message. The PID is used by the 520 kernel to multiplex to the correct sockets. A PID of zero is used 521 when sending messages to user space from the kernel. 523 3.3.2.1. Mechanisms for Creating Protocols 525 One could create a reliable protocol between an FEC and a CPC by 526 using the combination of sequence numbers, ACKs, and retransmit 527 timers. Both sequence numbers and ACKs are provided by Netlink; 528 timers are provided by Linux. 530 One could create a heartbeat protocol between the FEC and CPC by 531 using the ECHO flags and the NLMSG_NOOP message. 533 3.3.2.2. The ACK Netlink Message 535 This message is actually used to denote both an ACK and a NACK. 536 Typically, the direction is from FEC to CPC (in response to an ACK 537 request message). However, the CPC should be able to send ACKs 538 back to FEC when requested. The semantics for this are IP service- 539 specific. 541 jhs_hk_ak_ank draft-forces-Netlink-03.txt 543 0 1 2 3 544 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 545 0 1 2 3 546 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 547 | Netlink message header | 548 | type = NLMSG_ERROR | 549 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 550 | Error code | 551 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 552 | OLD Netlink message header | 553 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 555 Error code: integer (typically 32 bits) 557 An error code of zero indicates that the message is an ACK 558 response. An ACK response message contains the original Netlink 559 message header, which can be used to compare against (sent sequence 560 numbers, etc). 562 A non-zero error code message is equivalent to a Negative ACK 563 (NACK). In such a situation, the Netlink data that was sent down 564 to the kernel is returned appended to the original Netlink message 565 header. An error code printable via the perror() is also set (not 566 in the message header, rather in the executing environment state 567 variable). 569 3.3.3. FE System Services' Templates 571 These are services that are offered by the system for general use 572 by other services. They include the ability to configure, gather 573 statistics and listen to changes in shared resources. IP address 574 management, link events, etc. fit here. We create this section for 575 these services for logical separation, despite the fact that they 576 are accessed via the NETLINK_ROUTE FEC. The reason that they exist 577 within NETLINK_ROUTE is due to historical cruft: the BSD 4.4 Route 578 Sockets implemented them as part of the IPv4 forwarding sockets. 580 3.3.3.1. 582 Network Interface Service Module 584 jhs_hk_ak_ank draft-forces-Netlink-03.txt 586 This service provides the ability to create, remove, or get infor- 587 mation about a specific network interface. The network interface 588 can be either physical or virtual and is network protocol indepen- 589 dent (e.g., an x.25 interface can be defined via this message). 590 The Interface service message template is shown below. 592 0 1 2 3 593 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 594 0 1 2 3 595 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 596 | Family | Reserved | Device Type | 597 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 598 | Interface Index | 599 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 600 | Device Flags | 601 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 602 | Change Mask | 603 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 605 jhs_hk_ak_ank draft-forces-Netlink-03.txt 607 Family: 8 bits 608 This is always set to AF_UNSPEC. 610 Device Type: 16 bits 611 This defines the type of the link. The link could be Ethernet, a 612 tunnel, etc. We are interested only in IPv4, although the link type 613 is L3 protocol-independent. 615 Interface Index: 32 bits 616 Uniquely identifies interface. 618 Device Flags: 32 bits 620 IFF_UP Interface is administrativel up. 621 IFF_BROADCAST Valid broadcast address set. 622 IFF_DEBUG Internal debugging flag. 623 IFF_LOOPBACK Interface is a loopback interface. 624 IFF_POINTOPOINT Interface is a point-to-point link. 625 IFF_RUNNING Interface is operationally up. 626 IFF_NOARP No ARP protocol needed for this interface. 627 IFF_PROMISC Interface is in promiscuous mode. 628 IFF_NOTRAILERS Avoid use of trailers. 629 IFF_ALLMULTI Receive all multicast packets. 630 IFF_MASTER Master of a load balancing bundle. 631 IFF_SLAVE Slave of a load balancing bundle. 632 IFF_MULTICAST Supports multicast 633 IFF_PORTSEL Is able to select media type via ifmap. 634 IFF_AUTOMEDIA Auto media selection active. 635 IFF_DYNAMIC Interface was dynamically created. 637 Change Mask: 32 bits 638 Reserved for future use. Must be set to 0xFFFFFFFF. 640 Applicable attributes: 641 Attribute Description 642 ........................................................... 643 IFLA_UNSPEC Unspecified. 644 IFLA_ADDRESS Hardware address interface L2 address. 645 IFLA_BROADCAST Hardware address L2 broadcast 646 address. 647 IFLA_IFNAME ASCII string device name. 648 IFLA_MTU MTU of the device. 649 IFLA_LINK ifindex of link to which this device 650 is bound. 651 IFLA_QDISC ASCII string defining egress root 652 queueing discipline. 653 IFLA_STATS Interface statistics. 655 jhs_hk_ak_ank draft-forces-Netlink-03.txt 657 Netlink message types specific to this service: 658 RTM_NEWLINK, RTM_DELLINK, and RTM_GETLINK 660 3.3.3.2. IP Address Service Module 662 This service provides the ability to add, remove, or receive information 663 about an IP address associated with an interface. The address provi- 664 sioning service message template is shown below. 666 0 1 2 3 667 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 668 0 1 2 3 669 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 670 | Family | Length | Flags | Scope | 671 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 672 | Interface Index | 673 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 675 Family: 8 bits 676 Address Family: AF_INET for IPv4; and AF_INET6 for IPV4. 678 Length: 8 bits 679 The length of the address mask. 681 Flags: 8 bits 682 IFA_F_SECONDARY For secondary address (alias interface). 683 IFA_F_PERMANENT For a permanent address set by the user. 684 When this is not set, it means the address 685 was dynamically created (e.g., by stateless 686 autoconfiguration). 687 IFA_F_DEPRECATED Defines deprecated (IPV4) address. 688 IFA_F_TENTATIVE Defines tentative (IPV4) address (duplicate 689 address detection is still in progress). 691 Scope: 8 bits 692 The address scope in which the address stays valid. 693 SCOPE_UNIVERSE: Global scope. 694 SCOPE_SITE (IPv6 only): Only valid within this site. 695 SCOPE_LINK: Valid only on this device. 696 SCOPE_HOST: Valid only on this host. 698 Applicable attributes: 700 jhs_hk_ak_ank draft-forces-Netlink-03.txt 702 Attribute Description 703 ......................................................... 704 IFA_UNSPEC Unspecified. 705 IFA_ADDRESS Raw protocol address of interface. 706 IFA_LOCAL Raw protocol local address. 707 IFA_LABEL ASCII string name of the interface. 708 IFA_BROADCAST Raw protocol broadcast address. 709 IFA_ANYCAST Raw protocol anycast address. 710 IFA_CACHEINFO Cache address information. 712 Netlink messages specific to this service: RTM_NEWADDR, 713 RTM_DELADDR, and RTM_GETADDR. 715 4. Currently Defined Netlink IP Services 717 Although there are many other IP services defined that are using 718 Netlink, as mentioned earlier, we will talk only about a handful of 719 those integrated into kernel version 2.4.6. These are: 721 NETLINK_ROUTE, NETLINK_FIREWALL, and NETLINK_ARPD. 723 4.1. IP Service NETLINK_ROUTE 725 This service allows CPCs to modify the IPv4 routing table in the 726 Forwarding Engine. It can also be used by CPCs to receive routing 727 updates, as well as to collect statistics. 729 4.1.1. Network Route Service Module 731 This service provides the ability to create, remove or receive 732 information about a network route. The service message template is 733 shown below. 735 jhs_hk_ak_ank draft-forces-Netlink-03.txt 737 0 1 2 3 738 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 739 0 1 2 3 740 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 741 | Family | Src length | Dest length | TOS | 742 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 743 | Table ID | Protocol | Scope | Type | 744 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 745 | Flags | 746 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 748 Family: 8 bits 749 Address Family: AF_INET for IPv4; and AF_INET6 for IPV4. 751 Src length: 8 bits 752 Prefix length of source IP address. 754 Dest length: 8 bits 755 Prefix length of destination IP address. 757 TOS: 8 bits 758 The 8-bit TOS (should be deprecated to make room for DSCP). 760 jhs_hk_ak_ank draft-forces-Netlink-03.txt 762 Table ID: 8 bits 763 Table identifier. Up to 255 route tables are supported. 764 RT_TABLE_UNSPEC An unspecified routing table. 765 RT_TABLE_DEFAULT The default table. 766 RT_TABLE_MAIN The main table. 767 RT_TABLE_LOCAL The local table. 769 The user may assign arbitary values between 770 RT_TABLE_UNSPEC(0) and RT_TABLE_DEFAULT(253). 772 Protocol: 8 bits 773 Identifies what/who added the route. 774 Protocol Route origin. 775 .............................................. 776 RTPROT_UNSPEC Unknown. 777 RTPROT_REDIRECT By an ICMP redirect. 778 RTPROT_KERNEL By the kernel. 779 RTPROT_BOOT During bootup. 780 RTPROT_STATIC By the administrator. 782 Values larger than RTPROT_STATIC(4) are not interpreted by the 783 kernel, they are just for user information. They may be used to 784 tag the source of a routing information or to distingush between 785 multiple routing daemons. See for the 786 routing daemon identifiers that are already assigned. 788 Scope: 8 bits 789 Route scope (valid distance to destination). 790 RT_SCOPE_UNIVERSE Global route. 791 RT_SCOPE_SITE Interior route in the 792 local autonomous system. 793 RT_SCOPE_LINK Route on this link. 794 RT_SCOPE_HOST Route on the local host. 795 RT_SCOPE_NOWHERE Destination does not exist. 797 The values between RT_SCOPE_UNIVERSE(0) and RT_SCOPE_SITE(200) 798 are available to the user. 800 Type: 8 bits 801 The type of route. 803 Route type Description 804 ---------------------------------------------------- 805 RTN_UNSPEC Unknown route. 806 RTN_UNICAST A gateway or direct route. 807 RTN_LOCAL A local interface route. 808 RTN_BROADCAST A local broadcast route 810 jhs_hk_ak_ank draft-forces-Netlink-03.txt 812 (sent as a broadcast). 813 RTN_ANYCAST An anycast route. 814 RTN_MULTICAST A multicast route. 815 RTN_BLACKHOLE A silent packet dropping route. 816 RTN_UNREACHABLE An unreachable destination. 817 Packets dropped and host 818 unreachable ICMPs are sent to the 819 originator. 820 RTN_PROHIBIT A packet rejection route. Packets 821 are dropped and communication 822 prohibited ICMPs are sent to the 823 originator. 824 RTN_THROW When used with policy routing, 825 continue routing lookup in another 826 table. Under normal routing, 827 packets are dropped and net 828 unreachable ICMPs are sent to the 829 originator. 830 RTN_NAT A network address translation 831 rule. 832 RTN_XRESOLVE Refer to an external resolver (not 833 implemented). 835 Flags: 32 bits 836 Further qualify the route. 837 RTM_F_NOTIFY If the route changes, notify the 838 user. 839 RTM_F_CLONED Route is cloned from another route. 840 RTM_F_EQUALIZE Allow randomization of next hop 841 path in multi-path routing 842 (currently not implemented). 844 Attributes applicable to this service: 845 Attribute Description 846 --------------------------------------------------- 847 RTA_UNSPEC Ignored. 848 RTA_DST Protocol address for route 849 destination address. 850 RTA_SRC Protocol address for route source 851 address. 852 RTA_IIF Input interface index. 853 RTA_OIF Output interface index. 854 RTA_GATEWAY Protocol address for the gateway of 855 the route 856 RTA_PRIORITY Priority of route. 857 RTA_PREFSRC Preferred source address in cases 859 jhs_hk_ak_ank draft-forces-Netlink-03.txt 861 where more than one source address 862 could be used. 863 RTA_METRICS Route metrics attributed to route 864 and associated protocols (e.g., 865 RTT, initial TCP window, etc.). 866 RTA_MULTIPATH Multipath route next hop's 867 attributes. 868 RTA_PROTOINFO Firewall based policy routing 869 attribute. 870 RTA_FLOW Route realm. 871 RTA_CACHEINFO Cached route information. 873 Additional Netlink message types applicable to this service: 874 RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE 876 4.1.2. Neighbour Setup Service Module 878 This service provides the ability to add, remove, or receive infor- 879 mation about a neighbour table entry (e.g., an ARP entry or an IPv4 880 neighbour solicitation, etc.). The service message template is 881 shown below. 883 0 1 2 3 884 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 885 0 1 2 3 886 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 887 | Family | Reserved1 | Reserved2 | 888 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 889 | Interface Index | 890 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 891 | State | Flags | Type | 892 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 894 jhs_hk_ak_ank draft-forces-Netlink-03.txt 896 Family: 8 bits 897 Address Family: AF_INET for IPv4; and AF_INET6 for IPV4. 899 Interface Index: 32 bits 900 The unique interface index. 902 State: 16 bits 903 A bitmask of the following states: 904 NUD_INCOMPLETE Still attempting to resolve. 905 NUD_REACHABLE A confirmed working cache entry 906 NUD_STALE an expired cache entry. 907 NUD_DELAY Neighbour no longer reachable. 908 Traffic sent, waiting for 909 confirmation. 910 NUD_PROBE A cache entry that is currently 911 being re-solicited. 912 NUD_FAILED An invalid cache entry. 913 NUD_NOARP A device which does not do neighbor 914 discovery (ARP). 915 NUD_PERMANENT A static entry. 917 Flags: 8 bits 918 NTF_PROXY A proxy ARP entry. 919 NTF_ROUTER An IPv6 router. 921 Attributes applicable to this service: 922 Attributes Description 923 ------------------------------------ 924 NDA_UNSPEC Unknown type. 925 NDA_DST A neighbour cache network. 926 layer destination address 927 NDA_LLADDR A neighbour cache link layer 928 address. 929 NDA_CACHEINFO Cache statistics. 931 Additional Netlink message types applicable to this service: 932 RTM_NEWNEIGH, RTM_DELNEIGH, and RTM_GETNEIGH. 934 4.1.3. Traffic Control Service 936 This service provides the ability to provision, query or listen to 937 events under the auspicies of traffic control. These include 938 queueing disciplines, (schedulers and queue treatment 940 jhs_hk_ak_ank draft-forces-Netlink-03.txt 942 algorithms--e.g., priority-based scheduler or the RED algorithm) 943 and classifiers. Linux Traffic Control Service is very flexible 944 and allows for hierachical cascading of the different blocks for 945 traffic resource sharing. 947 ++ ++ +-----+ +-------+ ++ ++ .++ 948 || . || +------+ | |-->| Qdisc |-->|| || || 949 || ||---->|Filter|--->|Class| +-------+ ||-+ || || 950 || || | +------+ | +---------------+| | || || 951 || . || | +----------------------+ | || .|| 952 || . || | +------+ | || || 953 || || +->|Filter|-_ +-----+ +-------+ ++ | || .|| 954 || -->|| | +------+ ->| |-->| Qdisc |-->|| | ||->|| 955 || . || | |Class| +-------+ ||-+-->|| .|| 956 ->dev->|| || | +------+ _->| +---------------+| || || 957 || || +->|Filter|- +----------------------+ || .|| 958 || || +------+ || .|| 959 || . |+----------------------------------------------+| || 960 || | Parent Queuing discipline | .|| 961 || . +------------------------------------------------+ .|| 962 || . . .. . . .. . . . .. .. .. . .. || 963 |+--------------------------------------------------------+| 964 | Parent Queuing discipline | 965 | (attached to egress device) | 966 +----------------------------------------------------------+ 968 The above diagram shows an example of the Egress TC block. We try 969 to be very brief here. For more information, please refer to 970 [Diffserv]. A packet first goes through a filter that is used to 971 identify a class to which the packet may belong. A class is essen- 972 tially a terminal queueing discipline and has a queue associated 973 with it. The queue may be subject to a simple algorithm, like 974 FIFO, or a more complex one, like RED or a token bucket. The out- 975 ermost queueing discipline, which is refered to as the parent is 976 typically associated with a scheduler. Within this scheduler hier- 977 archy, however, may be other scheduling algorithms, making the 978 Linux Egress TC very flexible. 980 The service message template that makes this possible is shown 981 below. This template is used in both the ingress and the egress 982 queueing disciplines (refer to the egress traffic control model in 983 the FE model section). Each of the specific components of the 984 model has unique attributes that describe it best. The common 985 attributes are described below. 987 jhs_hk_ak_ank draft-forces-Netlink-03.txt 989 0 1 2 3 990 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 991 0 1 2 3 992 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 993 | Family | Reserved1 | Reserved2 | 994 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 995 | Interface Index | 996 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 997 | Qdisc handle | 998 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 999 | Parent Qdisc | 1000 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1001 | TCM Info | 1002 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1004 Family: 8 bits 1005 Address Family: AF_INET for IPv4; and AF_INET6 for IPV4. 1007 Interface Index: 32 bits 1008 The unique interface index. 1010 Qdisc handle: 32 bits 1011 Unique identifier for instance of queueing discipline. Typically, 1012 this is split into major:minor of 16 bits each. The major number 1013 would also be the major number of the parent of this instance. 1015 Parent Qdisc: 32 bits 1016 Used in hierarchical layering of queueing disciplines. If this 1017 value and the Qdisc handle are the same and equal to TC_H_ROOT, 1018 then the defined qdisc is the top most layer known as the root 1019 qdisc. 1021 jhs_hk_ak_ank draft-forces-Netlink-03.txt 1023 TCM Info: 32 bits 1024 Set by the FE to 1 typically, except when the Qdisc instance is in 1025 use, in which case it is set to imply a reference count. From the 1026 CPC towards the direction of the FEC, this is typically set to 0 1027 except when used in the context of filters. In that case, this 1028 32-bit field is split into a 16-bit priority field and 16-bit 1029 protocol field. The protocol is defined in kernel source 1030 , however, the most commonly used one 1031 is ETH_P_IP (the IP protocol). 1033 The priority is used for conflict resolution when filters 1034 intersect in their expressions. 1036 Generic attributes applicable to this service: 1038 Attribute Description 1039 ------------------------------------ 1040 TCA_KIND Canonical name of FE component. 1041 TCA_STATS Generic usage statistics of FEC 1042 TCA_RATE rate estimator being attached to 1043 FEC. Takes snapshots of stats to 1044 compute rate. 1045 TCA_XSTATS Specific statistics of FEC. 1046 TCA_OPTIONS Nested FEC-specific attributes. 1048 Appendix 3 has an example of configuring an FE component for a FIFO 1049 Qdisc. 1051 Additional Netlink message types applicable to this service: 1052 RTM_NEWQDISC, RTM_DELQDISC, RTM_GETQDISC, RTM_NEWTCLASS, RTM_DELT- 1053 CLASS, RTM_GETTCLASS, RTM_NEWTFILTER, RTM_DELTFILTER, and RTM_GET- 1054 TFILTER. 1056 4.2. IP Service NETLINK_FIREWALL 1058 This service allows CPCs to receive, manipulate, and re-inject 1059 packets via the IPv4 firewall service modules in the FE. A fire- 1060 wall rule is first inserted to activate packet redirection. The 1061 CPC informs the FEC whether it would like to receive just the meta- 1062 data on the packet or the actual data and, if the metadata is 1063 desired, what is the maximum data length to be redirected. The 1064 redirected packets are still stored in the FEC, waiting a verdict 1066 jhs_hk_ak_ank draft-forces-Netlink-03.txt 1068 from the CPC. The verdict could constitute a simple accept or drop 1069 decision of the packet, in which case the verdict is imposed on the 1070 packet still sitting on the FEC. The verdict may also include a 1071 modified packet to be sent on as a replacement. 1073 Two types of messages exist that can be sent from CPC to FEC. 1074 These are: Mode messages and Verdict messages. Mode messages are 1075 sent immediately to the FEC to describe what the CPC would like to 1076 receive. Verdict messages are sent to the FEC after a decision has 1077 been made on the fate of a received packet. The formats are 1078 described below. 1080 The mode message is described first. 1082 0 1 2 3 1083 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1084 0 1 2 3 1085 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1086 | Mode | Reserved1 | Reserved2 | 1087 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1088 | Range | 1089 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1091 Mode: 8 bits 1092 Control information on the packet to be sent to the CPC. The 1093 different types are: 1095 IPQ_COPY_META Copy only packet metadata to CPC. 1096 IPQ_COPY_PACKET Copy packet metadata and packet payloads 1097 to CPC. 1099 Range: 32 bits 1100 If IPQ_COPY_PACKET, this defines the maximum length to copy. 1102 jhs_hk_ak_ank draft-forces-Netlink-03.txt 1104 A packet and associated metadata received from user space looks 1105 as follows. 1107 0 1 2 3 1108 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1109 0 1 2 3 1110 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1111 | Packet ID | 1112 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1113 | Mark | 1114 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1115 | timestamp_m | 1116 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1117 | timestamp_u | 1118 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1119 | hook | 1120 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1121 | indev_name | 1122 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1123 | outdev_name | 1124 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1125 | hw_protocol | hw_type | 1126 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1127 | hw_addrlen | Reserved | 1128 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1129 | hw_addr | 1130 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1131 | data_len | 1132 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1133 | Payload . . . | 1134 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1136 Packet ID: 32 bits 1137 The unique packet identifier as passed to the CPC by the FEC. 1139 Mark: 32 bits 1140 The internal metadata value set to describe the rule in which 1141 the packet was picked. 1143 timestamp_m: 32 bits 1144 Packet arrival time (seconds) 1146 timestamp_u: 32 bits 1147 Packet arrival time (useconds in addition to the seconds in 1148 timestamp_m) 1150 hook: 32 bits 1152 jhs_hk_ak_ank draft-forces-Netlink-03.txt 1154 The firewall module from which the packet was picked. 1156 indev_name: 128 bits 1157 ASCII name of incoming interface. 1159 outdev_name: 128 bits 1160 ASCII name of outgoing interface. 1162 hw_protocol: 16 bits 1163 Hardware protocol, in network order. 1165 hw_type: 16 bits 1166 Hardware type. 1168 hw_addrlen: 8 bits 1169 Hardware address length. 1171 hw_addr: 64 bits 1172 Hardware address. 1174 data_len: 32 bits 1175 Length of packet data. 1177 Payload: size defined by data_len 1178 The payload of the packet received. 1180 The Verdict message format is as follows 1182 0 1 2 3 1183 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1184 0 1 2 3 1185 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1186 | Value | 1187 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1188 | Packet ID | 1189 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1190 | Data Length | 1191 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1192 | Payload . . . | 1193 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1195 Value: 32 bits 1196 This is the verdict to be imposed on the packet still sitting 1197 in the FEC. Verdicts could be: 1198 NF_ACCEPT Accept the packet and let it continue its 1199 traversal. 1200 NF_DROP Drop the packet. 1202 jhs_hk_ak_ank draft-forces-Netlink-03.txt 1204 Packet ID: 32 bits 1205 The packet identifier as passed to the CPC by the FEC. 1207 Data Length: 32 bits 1208 The data length of the modified packet (in bytes). If you dont 1209 modify the packet just set it to 0. 1211 Payload: 1212 Size as defined by the Data Length field. 1214 4.3. IP Service NETLINK_ARPD 1216 This service is used by CPCs for managing the neighbor table in the 1217 FE. The message format used between the FEC and CPC is described 1218 in the section on the Neighbour Setup Service Module. 1220 The CPC service is expected to participate in neighbor solicitation 1221 protocol(s). 1223 A neighbor message of type RTM_NEWNEIGH is sent towards the CPC by 1224 the FE to inform the CPC of changes that might have happened on 1225 that neighbour's entry (e.g., a neighbor being perceived as 1226 unreachable). 1228 RTM_GETNEIGH is used to solicit the CPC for information on a spe- 1229 cific neighbor. 1231 5. Security Considerations 1233 Netlink lives in a trusted environment of a single host separated 1234 by kernel and user space. Linux capabilities ensure that only 1235 someone with CAP_NET_ADMIN capability (typically, the root user) is 1236 allowed to open sockets. 1238 6. References 1240 jhs_hk_ak_ank draft-forces-Netlink-03.txt 1242 [RFC1633] R. Braden, D. Clark, and S. Shenker, "Integrated 1243 Services in the Internet Architecture: an Overview", RFC 1633, 1244 ISI, MIT, and PARC, June 1994. 1246 [RFC1812] F. Baker, "Requirements for IP Version 4 1247 Routers", RFC 1812, June 1995. 1249 [RFC2475] M. Carlson, W. Weiss, S. Blake, Z. Wang, D. 1250 Black, and E. Davies, "An Architecture for Differentiated 1251 Services", RFC 2475, December 1998. 1253 [RFC2748] J. Boyle, R. Cohen, D. Durham, S. Herzog, R. 1254 Rajan, A. Sastry, "The COPS (Common Open Policy Service) Pro- 1255 tocol", RFC 2748, January 2000. 1257 [RFC2328] J. Moy, "OSPF Version 2", RFC 2328, April 1998. 1259 [RFC1157] J.D. Case, M. Fedor, M.L. Schoffstall, C. Davin, 1260 "Simple Network Management Protocol (SNMP)", RFC 1157, May 1261 1990. 1263 [RFC3036] L. Andersson, P. Doolan, N. Feldman, A. Fredette, 1264 B. Thomas "LDP Specification", RFC 3036, January 2001. 1266 [Stevens] G.R Wright, W. Richard Stevens. "TCP/IP Illus- 1267 trated Volume 2, Chapter 20", June 1995 1269 [Netfilter] http://netfilter.samba.org 1271 [Diffserv] http://diffserv.sourceforge.net 1273 7. Acknowledgements 1275 1) Andi Kleen, for man pages on netlink and rtnetlink. 1277 2) Alexey Kuznetsov is credited for extending Netlink to the IP ser- 1278 vice delivery model. The original Netlink character device was 1280 jhs_hk_ak_ank draft-forces-Netlink-03.txt 1282 written by Alan Cox. 1284 3) Jeremy Ethridge for taking the role of someone who did not under- 1285 stand Netlink and reviewing the document to make sure that it made 1286 sense. 1288 8. Author's Address: 1290 Jamal Hadi Salim 1291 Znyx Networks 1292 Ottawa, Ontario 1293 Canada 1294 hadi@znyx.com 1296 Hormuzd M Khosravi 1297 Intel 1298 2111 N.E. 25th Avenue JF3-206 1299 Hillsboro OR 97124-5961 1300 USA 1301 1 503 264 0334 1302 hormuzd.m.khosravi@intel.com 1304 Andi Kleen 1305 SuSE 1306 Stahlgruberring 28 1307 81829 Muenchen 1308 Germany 1310 Alexey Kuznetsov 1311 INR/Swsoft 1312 Moscow 1313 Russia 1315 9. Appendix 1: Sample Service Hierachy 1317 In the diagram below we show a simple IP service, foo, and the 1318 interaction it has between CP and FE components for the service 1319 (labels 1-3). 1321 The diagram is also used to demonstrate CP<->FE addressing. In 1322 this section, we illustrate only the addressing semantics. In 1323 Apendix 2, the diagram is referenced again to define the protocol 1324 interaction between service foo's CPC and FEC (labels 4-10). 1326 jhs_hk_ak_ank draft-forces-Netlink-03.txt 1328 CP 1329 [--------------------------------------------------------. 1330 | .-----. | 1331 | | . -------. | 1332 | | CLI | / | 1333 | | | | CP protocol | 1334 | /->> -. | component | <-. | 1335 | __ _/ | | For | | | 1336 | | | IP service | ^ | 1337 | Y | foo | | | 1338 | | ___________/ ^ | 1339 | Y 1,4,6,8,9 / ^ 2,5,10 | 3,7 | 1340 --------------- Y------------/---|----------|----------- 1341 | ^ | ^ 1342 **|***********|****|**********|********** 1343 ************* Netlink layer ************ 1344 **|***********|****|**********|********** 1345 FE | | ^ ^ 1346 .-------- Y-----------Y----|--------- |----. 1347 | | / | 1348 | Y / | 1349 | . --------^-------. / | 1350 | |FE component/module|/ | 1351 | | for IP Service | | 1352 --->---|------>---| foo |----->-----|------>-- 1353 | ------------------- | 1354 | | 1355 | | 1356 ------------------------------------------ 1358 The control plane protocol for IP service foo does the following to 1359 connect to its FE counterpart. The steps below are also numbered 1360 above in the diagram. 1362 1) Connect to the IP service foo through a socket connect. A typical 1363 connection would be via a call to: socket(AF_NETLINK, SOCK_RAW, 1364 NETLINK_FOO). 1366 2) Bind to listen to specific asynchronous events for service foo. 1368 3) Bind to listen to specific asynchronous FE events. 1370 jhs_hk_ak_ank draft-forces-Netlink-03.txt 1372 10. Appendix 2: Sample Protocol for the Foo IP Service 1374 Our example IP service foo is used again to demonstrate how one can 1375 deploy a simple IP service control using Netlink. 1377 These steps are continued from Appendix 1 (hence the numbering). 1379 4) Query for current config of FE component. 1381 5) Receive response to (4) via channel on (3). 1383 6) Query for current state of IP service foo. 1385 7) Receive response to (6) via channel on (2). 1387 9) Register the protocol-specific packets you would like the FE to 1388 forward to you. 1390 10) Send service-specific foo commands and receive responses for them, 1391 if needed. 1393 10.1. Interacting with Other IP services 1395 The diagram in Appendix 1 shows another control component configur- 1396 ing the same service. In this case, it is a proprietary Command 1397 Line Interface. The CLI may or may not be using the Netlink proto- 1398 col to communicate to the foo component. If the CLI issues com- 1399 mands that will affect the policy of the FEC for service foo then, 1400 then the foo CPC is notified. It could then make algorithmic deci- 1401 sions based on this input. For example, if an FE allowed another 1402 service to delete policies installed by a different service and a 1403 policy that foo installed was deleted by service bar, there might 1404 be a need to propagate this to all the peers of service foo. 1406 11. Appendix 3: Examples 1408 In this example, we show a simple configuration Netlink message 1409 sent from a TC CPC to an egress TC FIFO queue. This queue algo- 1410 rithm is based on packet counting and drops packets when the limit 1411 exceeds 100 packets. We assume that the queue is in a hierachical 1412 setup with a parent 100:0 and a classid of 100:1 and that it is to 1413 be installed on a device with an ifindex of 4. 1415 jhs_hk_ak_ank draft-forces-Netlink-03.txt 1417 0 1 2 3 1418 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1419 0 1 2 3 1420 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1421 | Length (52) | 1422 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1423 | Type (RTM_NEWQDISC) | Flags (NLM_F_EXCL | | 1424 | |NLM_F_CREATE | NLM_F_REQUEST)| 1425 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1426 | Sequence Number(arbitrary number) | 1427 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1428 | Process ID (0) | 1429 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1430 |Family(AF_INET)| Reserved1 | Reserved1 | 1431 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1432 | Interface Index (4) | 1433 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1434 | Qdisc handle (0x1000001) | 1435 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1436 | Parent Qdisc (0x1000000) | 1437 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1438 | TCM Info (0) | 1439 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1440 | Type (TCA_KIND) | Length(4) | 1441 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1442 | Value ("pfifo") | 1443 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1444 | Type (TCA_OPTIONS) | Length(4) | 1445 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1446 | Value (limit=100) | 1447 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+