ForCES Working Group Jamal Hadi Salim Internet Draft Znyx Networks Hormuzd Khosravi Intel Andi Kleen Suse Alexey Kuznetsov INR/Swsoft November 2001 Netlink as an IP services protocol draft-ietf-forces-netlink-01.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC-2119]. 1. Abstract This document describes Linux Netlink, which is used in Linux both as an inter-kernel messaging system as well as between kernel and draft-forces-netlink-01.txt ^L[Page 1] jhs_hk_ak_ank draft-forces-netlink-01.txt user-space. The purpose of this document is intended as informa- tional in the context of prior art for the ForCES IETF working group. The focus of this document is to describe netlink from a context of a protocol between a Forwording Engine Component (FEC) and a Control Plane Component(CPC) that define an IP service. The document ignores the ability of netlink as a inter-kernel mes- saging system, as a an inter-process communication scheme (IPC) or its use in configuring other non-network as well as network but non-IP services (such as decnet etc). 2. Introduction The concept of IP Service control-forwarding separation was first introduced in the early 1980s by the BSD 4.4 routing sock- ets[stevens]. The focus at that time was a simple IP(v4) forward- ing service and how the CPC, either via a command line configura- tion tool or a dynamic route daemon, can control forwarding tables for that IPV4 forwarding service. The IP world has evolved considerably since those days. Linux netlink, when observed from a service provisioning point of view takes routing sockets one step further by breaking the barrier of focus around IPV4 forwarding. Since the linux 2.1 kernel, netlink has been providing the IP service abstraction to a few services other than the classical IPv4 forwarding. We first give some concept definitions and then describe how netlink fits in. 2.1. Some definitions A Control plane(CP) is an execution environment that may have sev- eral components which we refer to as CPCs. Each CPC provides con- trol for a different IP service being executed by a FE component. This means that there might be several CPCs on a physical CP if it is controlling several IP services. In essence, the cohesion between a CP component and a FE component is the service abstrac- tion. In the diagram below we show a simple FE<->CP setup to provide an example of the classical IPv4 service with an extension to do some basic QoS egress scheduling and how it fits in this described draft-forces-netlink-01.txt ^L[Page 2] jhs_hk_ak_ank draft-forces-netlink-01.txt model. Control Plane (CP) .------------------------------------ | /^^^^^ /^^^^^ | | | | | COPS |- | | | ospfd | | PEP | | | / _____/ | | /------_____/ | | | | | | | | | | |_____________________|___|_________| | | | | ****************************************** Forwarding ************* Netlink layer ************ Engine (FE) ***************************************** .-------------|-----------|------------|---|----------- | IPv4 forwading | / | | FE Service / / | | Component / / | | ---------------/---------------/--------- | | | | / | | packet | | --------|-- ----|----- | packet in | | | IPV4 | | Egress | | out -->--->|------>|---->|Forwading |----->| QoS |--->| ---->|----> | | | | | Scheduler| | | | | ----------- ---------- | | | | | | | --------------------------------------- | | | ------------------------------------------------------- 2.1.1. Control Plane Components (CPCs) Control plane components would encompass signalling protocols with diversity ranging from dynamic routing protocols such as OSPF [RFC2328] to tag distribution protocols such as CR-LDP [RFC3036]. Classical Management protocols and activities also fall under this category. These include SNMP [RFC1157], COPS [RFC2748] or propri- etary CLI/GUI configuration mechanisms. draft-forces-netlink-01.txt ^L[Page 3] jhs_hk_ak_ank draft-forces-netlink-01.txt The purpose of the control plane is to provide an execution envi- ronment for the above mentioned activities with the ultimate goal being to configure and manage the second NE component: the FE. The result of the configuration would define the way packets travesing the FE are treated. In the above diagram, ospfd and COPS are distinct CPCs. 2.1.2. Forwarding Engine Components The FE is the entity of the NE that incoming packets (from the net- work into the NE) first encounter. The FE's service specific component massages the packet to provide it with a treatment to achieve a IP service as defined by the con- trol plane components for that IP service. Different services will utilize different FEC. Service modules maybe chained to achieve a more complex service (as shown in the diagram). When built for providing a specific service, the FE service component will adhere to a Forwading Model. In the above diagram, the IPV4 FE component includes both the IPV4 Forwarding service module as well as the Egress Scheduling service module. Another service might may add a policy forwarder between the IPV4 forwader and the QoS egress Scheduler. A simpler classi- cal service would have constituted only the IPV4 forwarder. 2.1.3. IP Services An IP Service is the treatment of an IP packet within the NE. This treatment is provided by a combination of both the CPC and FEC The time span of the service is from the moment when the packet arrives at the NE to the moment it departs. In essence an IP ser- vice in this context is a Per-Hop Behavior. A service control/sig- naling protocol/management-application (CP components running on NEs defining the end to end path) unifies the end to end view of the IP service. As noted above, these CP components then define the behavior of the FE (and therefore the NE) to a described packet. A simple example of an IP service is the classical IPv4 Forwarding. In this case, control components such as routing protocols(OSPF, draft-forces-netlink-01.txt ^L[Page 4] jhs_hk_ak_ank draft-forces-netlink-01.txt RIP etc) and proprietary CLI/GUI configurations modify the FE's forwarding tables in order to offer the simple service of forward- ing packets to the next hop. Traditionally, NEs offering this sim- ple service are known as routers. Over the years it has become important to add aditional services to the routers to meet emerging requirements. More complex services extending classical forwarding were added and standardized. These newer services might go beyond the layer 3 contents of the packet header. However, the name "router", although a misnomer, is still used to describe these NEs. Services (which may look beyond the classical L3 headers) here include firewalling, Qos in Diffserv and RSVP, NATs, policy based routing etc. Newer control protocols or management activities are introduced with these new services. One extreme definition of a IP service is something a service provider would be able to charge for. 3. Netlink Architecture IP services components control is defined by using templates. The FEC and CPC participate to deliver the IP service by communi- cating using these templates. The FEC might continously get updates from the control plane component on how to operate the ser- vice (example for V4 forwarding route additions or deletions). The interaction between the FEC and the CPC, in the netlink con- text, would define a protocol. Netlink provides the mechanism for the CPC(residing in user space) and FEC(residing in kernel space) to define their own protocol definition. Kernel space and user space just mean different protection domains direct where direct memory access is not allowed inbetween. Therefore a wire protocol is needed to communicate. The wire protocol would be normally be provided by some privileged service that is able to copy between multiple protection domains. We will call this service netlink service. Netlink service could also be mapped to a different transport layer if the CPC should be running on a different node than the CPC. The FEC and CPC, using netlink mechanisms, may choose to define a reliable protocol between each other, for exam- ple. By default netlink provides an unreliable communication. Note that the FEC and CPC can both live in the same memory protec- tion domain and use the connect() system call to create a path to the peer and talk to each other. We will not discuss this further draft-forces-netlink-01.txt ^L[Page 5] jhs_hk_ak_ank draft-forces-netlink-01.txt other than to say it is available as a mechanism. Through out this document we will refer interchangbly to the FEC to mean kernel- space and the CPC to mean user-space. Note: Netlink allows participation in IP services by both service components. 3.1. Netlink Logical model In the diagram below we show a simple FEC<->CPC logical relation- ship. We use the example of IPV4 forwarding FEC (NETLINK_ROUTE, which is discussed further below) as an example. Control Plane (CP) .------------------------------------ | /^^^^^ /CPC-2 | | | CPC-1 | | COPS | | | | ospfd | | PEP | | | / _____/ | | _____/ | | | | | | ****************************************| ************* BROADCAST WIRE ************ FE---------- *****************************************. | IPv4 forwading | | / | | FEC | | | | | --------------/-----|-----------|-------- | | | / | | | | | | .-------. .-------. .------. | | | | |ingress| | IPV4 | |Egress| | | | | |police | |Forward| | QoS | | | | | |_______| |_______| |Sched | | | | | ------ | | | --------------------------------------- | | | ----------------------------------------------------- Netlink logically models FECs and CPCs in the form of nodes inter- connected to each other via a broadcast wire. The wire is specific to a service. The example above shows the broadcast wire belonging to the extended IPV4 forwarding service. draft-forces-netlink-01.txt ^L[Page 6] jhs_hk_ak_ank draft-forces-netlink-01.txt Nodes connect to the wire and register to receive specific mes- sages. CPCs may connect to multiple wires if it helps them to con- trol the service better. All nodes(CPCs and FECs) dump packets on the broadcast wire. Packets could be discarded by the wire if mal- formed or not specifically formated for the wire. Dropped packets are not seen by any of the nodes. The netlink service MAY signal an error to the original if it detects an malformatted netlink packet. Packets sent on the wire could be broadcast, multicast or unicast. FECs or CPCs pick specific messages of interest for processing or just monitoring purposes. 3.2. The message format There are three levels to a netlink message: The general netlink message header, the IP service specific template, the IP service specific data. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Netlink message header | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | IP Service Template | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | IP Service specific data in TLVs | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3.3. Protocol Model This section expands on how netlink provides the mechanism for ser- vice oriented FEC and CPC interaction. draft-forces-netlink-01.txt ^L[Page 7] jhs_hk_ak_ank draft-forces-netlink-01.txt 3.3.1. Service Addressing Access is provided by first connecting to the service on the FE. This is done by making a socket() system call to the PF_NETLINK domain. Each FEC is identified by a protocol number. One may open either SOCK_RAW or SOCK_DGRAM type sockets although netlink doesnt distinguish the two. The socket connection provides the basis for the FE<->CP addressing. Connecting to a service is followed (at any point during the life of the connection) by issuing either a service specific command mostly for configuration purposes (from the CPC to the FEC) or sub- scribing/unsubscribing to service(s') events. 3.3.1.1. Sample Service Hierachy In the diagram below we show a simple IP service, foo, and the interaction it has between CP and FE components for the ser- vice(labels 1-3). We introduce the diagram below to demonstrate CP<->FE addressing. In this section we illustrate only the addressing semantics. In section 4, the diagram is referenced again to define the protocol interaction between srevice foo's CPC and FEC (labels 4-10). draft-forces-netlink-01.txt ^L[Page 8] jhs_hk_ak_ank draft-forces-netlink-01.txt CP [--------------------------------------------------------. | .-----. | | | . -------. | | | CLI | / | | | | | CP protocol | | /->> -. | component | <-. | | __ _/ | | For | | | | | | IP service | ^ | | Y | foo | | | | | ___________/ ^ | | Y 1,4,6,8,9 / ^ 2,5,10 | 3,7 | --------------- Y------------/---|----------|----------- | ^ | ^ **|***********|****|**********|********** ************* Netlink layer ************ **|***********|****|**********|********** FE | | ^ ^ .-------- Y-----------Y----|--------- |----. | | / | | Y / | | . --------^-------. / | | |FE component/module|/ | | | for IP Service | | --->---|------>---| foo |----->-----|------>-- | ------------------- | | | | | ------------------------------------------ The control plane protocol for IP service foo does the following to connect to its FE counterpart. The steps below are also numbered above in the diagram. 1) Connect to IP service foo through a socket connect. A typical con- nection would be via a call to: socket(AF_NETLINK, SOCK_RAW, NETLINK_FOO) 2) Bind to listen to specific async events for service foo 3) Bind to listen to specific async FE events draft-forces-netlink-01.txt ^L[Page 9] jhs_hk_ak_ank draft-forces-netlink-01.txt 3.3.2. Netlink message header Netlink messages consist of a byte stream with one or multiple Netlink headers and associated payload. If the payload is too big to fit into a single message it can be split over multiple netlink messages. This is called a multipart message. For multipart mes- sages the first and all following headers have the NLM_F_MULTI netlink header flag set, except for the last header which has the netlink header type NLMSG_DONE. The netlink message header is shown below. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Process PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The fields in the header are: draft-forces-netlink-01.txt ^L[Page 10] jhs_hk_ak_ank draft-forces-netlink-01.txt Length: 32 bits The length of the message in bytes including the header. Type: 16 bits This field describes the message content. It can be one of the standard message types: NLMSG_NOOP message is ignored NLMSG_ERROR the message signals an error and the payload contains a nlmsgerr structure. This can be looked at as a NACK and typically it is from FEC to CPC. NLMSG_DONE message terminates a multipart message Individual IP Services specify more message types, for e.g., NETLINK_ROUTE Service specifies several types such as RTM_NEWLINK, RTM_DELLINK, RTM_GETLINK, RTM_NEWADDR, RTM_DELADDR, RTM_NEWROUTE, RTM_DELROUTE, etc. Flags: 16 bits The standard flag bits used in netlink are NLM_F_REQUEST Must be set on all request messages (typically from user space to kernel space) NLM_F_MULTI Indicates the message is part of a multipart message terminated by NLMSG_DONE NLM_F_ACK Request for an acknowledgment on success. Typical direction of request is from user space to kernel space. NLM_F_ECHO Echo this request. Typical direction of request is from user space to kernel space. Additional flag bits for GET requests on config information in the FEC. NLM_F_ROOT Return the complete table instead of a single entry. NLM_F_MATCH Return all matching criteria passed in message content NLM_F_ATOMIC Return an atomic snapshot of the table being referenced. This may require special privileges because it has the potential to interrupt service in the FE for a longer time. Convenience macros for flag bits: NLM_F_DUMP This is NLM_F_ROOT or'ed with NLM_F_MATCH Additional flag bits for NEW requests NLM_F_REPLACE Replace existing matching config object with this request. NLM_F_EXCL Don't replace the config object if it already draft-forces-netlink-01.txt ^L[Page 11] jhs_hk_ak_ank draft-forces-netlink-01.txt exists. NLM_F_CREATE Create config object if it doesn't already exist. NLM_F_APPEND Add to the end of the object list. For those familiar with BSDish use of such operations in route sockets, the equivalent translations are: - BSD ADD operation equates to NLM_F_CREATE or-ed with NLM_F_EXCL - BSD CHANGE operation equates to NLM_F_REPLACE - BSD Check operation equates to NLM_F_EXCL - BSD APPEND equivalent is actually mapped to NLM_F_CREATE Sequence Number: 32 bits The sequence number of the message. Process PID: 32 bits The PID of the process sending the message. The PID is used by the kernel to multiplex to the correct sockets. A PID of zero is used when sending messages to user space from the kernel. netlink service fills in an appropiate value when zero. 3.3.2.1. Mechanisms for creating protocols One could create a reliable protocol between an FEC and a CPC by using the combination of sequence numbers, ACKs and retransmit timers. Both sequence numbers and sequence numbers are provided by netlink. Timers are provided by Linux. One could create a heartbeat protocol between the FEC and CPC by using the ECHO flags and the NLMSG_NOOP message. 3.3.2.2. The ACK netlink message This message is actually used to denote both an ACK and a NACK. Typically the direction is from kernel to user space (in response to an ACK request message that is sent). However, user space should be able to send ACKs back to kernel space when requested. This is IP service specific. draft-forces-netlink-01.txt ^L[Page 12] jhs_hk_ak_ank draft-forces-netlink-01.txt 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Netlink message header | | type = NLMSG_ERROR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | error code | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OLD Netlink message header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Error code: integer (typically 32 bits) Error code of zero indicates that the message is an ACK response. An ACK response message contains the original netlink message header that can be used to compare against (sent sequence numbers etc). A non-zero error message is equivalent to a Negative ACK (NACK). In such a situation, the netlink data that was sent down to the kernel is returned appended to the original netlink message header. An error code printable via the perror() is also set (not in the message header, rather in the executing environment state vari- able). 3.3.3. FE services' templates These are services that are offered by the system for general use by other services. They include ability to configure and listen to changes in resource management. IP address management, link events etc fit here. We separate them into this section here for logical purposes despite the fact that they are accessed via the NETLINK_ROUTE FEC. The reason that they exist within NETLINK_ROUTE is due to historical cruft based on the fact that BSD 4.4 rather narrowly focussed Route Sockets implemented them as part of the IPV4 forwarding sockets. 3.3.3.1. Network Interface Service Module draft-forces-netlink-01.txt ^L[Page 13] jhs_hk_ak_ank draft-forces-netlink-01.txt This service provides the ability to create, remove or get informa- tion about a specific network interface. The network interface could be either pohysical or virtual and is network protocol inde- pendent (example an x.25 interface can be defined via this mes- sage). The Interface service message template is shown below. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Family | Padding | Device Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Device Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Change Mask | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Family: This is always set to AF_UNSPEC Device Type: This defines the type of the link. The link could be ethernet, a tunnel etc. Although we are interested only in IPV4, the link type is protocol independent. Interface Index: uniquely identifies interface. Device Flags: IFF_UP Interface is running. IFF_BROADCAST Valid broadcast address set. IFF_DEBUG Internal debugging flag. IFF_LOOPBACK Interface is a loopback interface. IFF_POINTOPOINT Interface is a point-to-point link. IFF_RUNNING Resources allocated. IFF_NOARP No arp protocol IFF_PROMISC Interface is in promiscuous mode. IFF_NOTRAILERS Avoid use of trailers. IFF_ALLMULTI Receive all multicast packets. IFF_MASTER Master of a load balancing bundle. IFF_SLAVE Slave of a load balancing bundle. IFF_MULTICAST Supports multicast IFF_PORTSEL Is able to select media type via ifmap. IFF_AUTOMEDIA Auto media selection active. IFF_DYNAMIC Interface Address is not permanent. draft-forces-netlink-01.txt ^L[Page 14] jhs_hk_ak_ank draft-forces-netlink-01.txt Change Mask: Reserved for future use. Must be set to 0xFFFFFFFF. Applicable attributes: attribute description ....................................................... IFLA_UNSPEC - unspecified. IFLA_ADDRESS hardware address interface L2 address IFLA_BROADCAST hardware address L2 broadcast address. IFLA_IFNAME ascii string Device name. IFLA_MTU MTU of the device. IFLA_LINK Link type. IFLA_QDISC ascii string defining Queueing disci- pline. IFLA_STATS Interface Statistics. Netlink message types specific to this service: RTM_NEWLINK, RTM_DELLINK, RTM_GETLINK 3.3.3.2. IP Address Service module This service provides the ability to add, remove or receive information about an IP address associated with an interface. The Address provi- sioning service message template is shown below. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Family | Length | Flags | Scope | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Family: AF_INET for IPV4 or AF_INET6 for IPV6. Length: the length of the address mask Flags: IFA_F_SECONDARY for secondary address (old alias interface), IFA_F_PERMANENT for a permanent address set by the user as opposed to dynamic addresses. other flags include: IFA_F_DEPRECATED which defines deprecated (IPV6) address IFA_F_TENTATIVE which defines tentative (IPV6) address Scope: the address scope draft-forces-netlink-01.txt ^L[Page 15] jhs_hk_ak_ank draft-forces-netlink-01.txt Applicable attributes: attribute description ....................................................... IFA_UNSPEC - unspecified. IFA_ADDRESS raw protocol address of interface IFA_LOCAL raw protocol local address IFA_LABEL ascii string name of the interface reffered to. IFA_BROADCAST raw protocol broadcast address. IFA_ANYCAST raw protocol anycast address IFA_CACHEINFO cacheinfo address information. Define cacheinfo here -- JHS netlink messages specific to this service: RTM_NEWADDR, RTM_DELADDR, RTM_GETADDR 4. Sample Protocol for The foo IP service Our proverbial IP service "foo" is used again to demonstrate how one can deploy a simple IP service control using netlink. These steps are continued from the "Sample Service Hierachy" sec- tion. 4) query for current config of FE component 5) receive response to 4) via channel on 3) 6) query for current state of IP service foo 7) receive response to 6) via channel on 2) 9) register the protocol specific packets you would like the FE to forward to you 10) send specific service foo commands and receive responses for them if needed 4.1. Interacting with other IP services draft-forces-netlink-01.txt ^L[Page 16] jhs_hk_ak_ank draft-forces-netlink-01.txt The last diagram shows another control component configuring the same service. In this case, it is a proprietary Command Line Inter- face. The CLI (may or ) may not be using the netlink protocol to communicate to the foo component. If the CLI should issue commands that will affect the policy of the FEC for service "foo" then, then the "foo" CPC is notified. It could then make algorithmic decisions based on this input (example if a policy that foo installed was deleted, there might be need to propagate this to all the peers of service "foo"). 5. Currently Defined netlink IP services Although there are many other IP services defined which are using netlink, we will only mention those integrated into the kernel today (kernel version 2.4.6). These are: NETLINK_ROUTE,NETLINK_FIREWALL,NETLINK_ARPD,NETLINK_ROUTE6,NETLINK_IP6_FW NETLINK_TAPBASE,NETLINK_SKIP,NETLINK_USERSOCK. 5.1. IP Service NETLINK_ROUTE This service allows CPCs to modify the IPv4 routing table in the Forwarding Engine. It can also be used by CPCs to receive routing updates. 5.1.1. Network Route Service Module This service provides the ability to create, remove or receive informa- tion about a network route. The service message template is shown below. draft-forces-netlink-01.txt ^L[Page 17] jhs_hk_ak_ank draft-forces-netlink-01.txt 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Family | Src length | Dest length | TOS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Table ID | Protocol | Scope | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Family: Address family of route. AF_INET for IPV4 and AF_INET6 for IPV6. Src length: prefix length of source Dest length: Prefix length of destination IP address TOS: the 8 bit tos (should be deprecated to make room for DSCP) Table ID: Table identifier. Upto 255 route tables are supported. RT_TABLE_UNSPEC an unspecified routing table RT_TABLE_DEFAULT the default table RT_TABLE_MAIN the main table RT_TABLE_LOCAL the local table The user may assign arbitary values between RT_TABLE_UNSPEC and RT_TABLE_DEFAULT. Protocol: identifies what/who added the route. Described further below. draft-forces-netlink-01.txt ^L[Page 18] jhs_hk_ak_ank draft-forces-netlink-01.txt protocol Route origin. .............................................. RTPROT_UNSPEC unknown RTPROT_REDIRECT by an ICMP redirect (currently unused) RTPROT_KERNEL by the kernel RTPROT_BOOT during boot RTPROT_STATIC by the administrator Values larger than RTPROT_STATIC are not inter- preted by the kernel, they are just for user infor- mation. They may be used to tag the source of a routing information or to distingush between multi- ple routing daemons. See for the routing daemon identifiers which are already assigned. Scope: Route scope (distance to destination). RT_SCOPE_UNIVERSE global route RT_SCOPE_SITE interior route in the local autonomous system RT_SCOPE_LINK route on this link RT_SCOPE_HOST route on the local host RT_SCOPE_NOWHERE destination doesn't exist The values between RT_SCOPE_UNIVERSE and RT_SCOPE_SITE are available to the user. Type: The type of route. draft-forces-netlink-01.txt ^L[Page 19] jhs_hk_ak_ank draft-forces-netlink-01.txt Route type description ------------------------------------------------- RTN_UNSPEC unknown route RTN_UNICAST a gateway or direct route RTN_LOCAL a local interface route RTN_BROADCAST a local broadcast route (sent as a broadcast) RTN_ANYCAST a local broadcast route (sent as a unicast) RTN_MULTICAST a multicast route RTN_BLACKHOLE a packet dropping route RTN_UNREACHABLE an unreachable destination RTN_PROHIBIT a packet rejection route RTN_THROW continue routing lookup in another table RTN_NAT a network address translation rule RTN_XRESOLVE refer to an external resolver (not implemented) Flags: further qualify the route. RTM_F_NOTIFY if the route changes, notify the user via rtnetlink RTM_F_CLONED route is cloned from another route RTM_F_EQUALIZE a multicast equalizer (not yet implemented) Attributes applicable to this service: draft-forces-netlink-01.txt ^L[Page 20] jhs_hk_ak_ank draft-forces-netlink-01.txt Attribute description ----------------------------------------------- RTA_UNSPEC ignored. RTA_DST protocol address for route destination address. RTA_SRC protocol address for route source address. RTA_IIF Input interface index. RTA_OIF Output interface index. RTA_GATEWAY protocol address for the gateway of the route RTA_PRIORITY Priority of route. RTA_PREFSRC RTA_METRICS Route metric RTA_MULTIPATH RTA_PROTOINFO RTA_FLOW RTA_CACHEINFO additional netlink message types applicable to this service: RTM_NEWROUTE, RTM_DELROUTE, RTM_GETROUTE 5.1.2. Neighbour Setup Service Module This service provides the ability to add, remove or receive infor- mation about a neighbour table entry (e.g. an ARP entry). The ser- vice message template is shown below. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Family | Padding | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | State | Flags | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Family: Address Family Interface Index: The unique interface index State: is a bitmask of the following states: draft-forces-netlink-01.txt ^L[Page 21] jhs_hk_ak_ank draft-forces-netlink-01.txt NUD_INCOMPLETE a currently resolving cache entry NUD_REACHABLE a confirmed working cache entry NUD_STALE an expired cache entry NUD_DELAY an entry waiting for a timer NUD_PROBE a cache entry that is currently reprobed NUD_FAILED an invalid cache entry NUD_NOARP a device with no destination cache NUD_PERMANENT a static entry Flags: one of: NTF_PROXY a proxy arp entry NTF_ROUTER an IPv6 router Attributes applicable to this service: Attribute$ description ------------------------------------ NDA_UNSPEC unknown type NDA_DST a neighbour cache network layer destination address NDA_LLADDR a neighbour cache link layer address NDA_CACHEINFO cache statistics. Describe the NDA_CACHEINFO nda_cacheinfo header later --JHS additional netlink message types applicable to this service: RTM_NEWNEIGH, RTM_DELNEIGH, RTM_GETNEIGH 5.1.3. Traffic Control Service This service provides the ability to add, remove or get a queueing dis- cipline. The service message template is shown below. draft-forces-netlink-01.txt ^L[Page 22] jhs_hk_ak_ank draft-forces-netlink-01.txt 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Family | Padding | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Qdisc handle | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Parent Qdisc | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TCM Info | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5.2. IP Service NETLINK_FIREWALL This service allows CPCs to receive packets sent by the IPv4 fire- wall service in the FE. Two types of messages exist that can be sent from CPC to FEC. These are: Mode messages and Verdict messages. The formats are described below. The Verdict message format is as follows 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Value | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Packet ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ A ipq_packet_msg packet type is sent from the FEC to the CPC. The format is described below ==> We need to complete this later draft-forces-netlink-01.txt ^L[Page 23] jhs_hk_ak_ank draft-forces-netlink-01.txt 5.3. IP Service NETLINK_ARPD This service is used by CPCs for managing the ARP table in FE. 5.4. IP Service NETLINK_ROUTE6 This service allows CPCs to modify the IPv6 routing table in the FE. It can also be used by CPCs to receive routing updates. draft-forces-netlink-01.txt ^L[Page 24] jhs_hk_ak_ank draft-forces-netlink-01.txt 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 dst addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 dst addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 dst addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 dst addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 src addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 src addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 src addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 src addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 gw addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 gw addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 gw addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 gw addr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | dst length | src length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Metric | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Info | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5.5. IP Service NETLINK_IP6_FW This service allows CPCs to receive packets that failed the IPv6 draft-forces-netlink-01.txt ^L[Page 25] jhs_hk_ak_ank draft-forces-netlink-01.txt firewall checks by that module in the FE. 5.6. IP Service NETLINK_TAPBASE This service allows CPCs to simulate an ethernet driver belonging to the FE. //are the instances of the ethertap device. Ethertap //is a pseudo network tunnel device that allows an //ethernet driver to be simulated from user space. 5.7. IP Service NETLINK_SKIP This service is reserved for ENskip (?). 5.8. IP Service NETLINK_USERSOCK This service is reserved for future Control Plane to FE protocols. 6. Security Considerations Netlink lives in a trusted environment of a single host separated by kernel and user space. Linux capabilities ensures that only someone with CAP_NET_ADMIN capability (typically root user) is allowed to open sockets. 7. References [RFC1633] R. Braden, D. Clark, and S. Shenker, "Integrated Services in the Internet Architecture: an Overview", RFC 1633, draft-forces-netlink-01.txt ^L[Page 26] jhs_hk_ak_ank draft-forces-netlink-01.txt ISI, MIT, and PARC, June 1994. [RFC1812] F. Baker, "Requirements for IP Version 4 Routers", RFC 1812, June 1995. [RFC2475] M. Carlson, W. Weiss, S. Blake, Z. Wang, D. Black, and E. Davies, "An Architecture for Differentiated Services", RFC 2475, December 1998. [RFC2748] J. Boyle, R. Cohen, D. Durham, S. Herzog, R. Rajan, A. Sastry, "The COPS (Common Open Policy Service) Pro- tocol", RFC 2748, January 2000. [RFC2328] J. Moy, "OSPF Version 2", RFC 2328, April 1998. [RFC1157] J.D. Case, M. Fedor, M.L. Schoffstall, C. Davin, "Simple Network Management Protocol (SNMP)", RFC 1157, May 1990. [RFC3036] L. Andersson, P. Doolan, N. Feldman, A. Fredette, B. Thomas "LDP Specification", RFC 3036, January 2001. [stevens] G.R Wright, W. Richard Stevens. "TCP/IP Illus- trated Volume 2, Chapter 20", June 1995 8. Acknowledgements 1) Andi Kleen for man pages on netlink and rtnetlink. 2) Alexey Kuznetsov is credited for extending netlink to the IP ser- vice delivery model. The original netlink character device was written by Alan Cox. draft-forces-netlink-01.txt ^L[Page 27] jhs_hk_ak_ank draft-forces-netlink-01.txt 9. Author's Address: Jamal Hadi Salim Znyx Networks Ottawa, Ontario Canada hadi@znyx.com Hormuzd M Khosravi Intel 2111 N.E. 25th Avenue JF3-206 Hillsboro OR 97124-5961 USA 1 503 264 0334 hormuzd.m.khosravi@intel.com Andi Kleen SuSE Stahlgruberring 28 81829 Muenchen Germany draft-forces-netlink-01.txt ^L[Page 28]