idnits 2.17.1 draft-perlert-wg-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (August 13, 2020) is 1352 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force R. Montero 3 Internet-Draft University of A Coruna 4 Intended status: Informational August 13, 2020 5 Expires: February 14, 2021 7 Protocol for Evaluating Reinforcement Learning Environments in Real Time 8 draft-perlert-wg-00 10 Abstract 12 This document defines a simple UDP protocol for communicating a 13 server simulating a reinforcement learning environment and a client 14 observing it and responding with actions. 16 Reinforcement learning problems are usually defined within the scope 17 of a Markov Decission Process (MDP) where an agent sends an action 18 belonging to an action space to an environment. The environment acts 19 as a black box returning an observation and a reward for the agent, 20 whose goal is to maximize the total obtained rewards. 22 Although the problem statement is easy to understand, there are no 23 conventions on how to communicate a reinforcement learning simulation 24 with a client agent, either in a local network or over the Internet. 25 Additionally, giving an answer to this can be especially useful when 26 it comes to multiagent support and analysis. 28 The protocol PERLERT defined in this document assumes that server and 29 client have shared certain information beforehand via another way of 30 communication like a web page served using HTTP protocol. For 31 example, the client must know a port number and an instance number 32 before proceeding to participate in a simulation run on a server. 34 Also, although it is often desired to know the full feedback from the 35 environment, PERLERT focuses on real-time interaction where human 36 agents can interact with AI agents even if that means that 37 information can be lost due to network packet loss. 39 Status of This Memo 41 This Internet-Draft is submitted in full conformance with the 42 provisions of BCP 78 and BCP 79. 44 Internet-Drafts are working documents of the Internet Engineering 45 Task Force (IETF). Note that other groups may also distribute 46 working documents as Internet-Drafts. The list of current Internet- 47 Drafts is at https://datatracker.ietf.org/drafts/current/. 49 Internet-Drafts are draft documents valid for a maximum of six months 50 and may be updated, replaced, or obsoleted by other documents at any 51 time. It is inappropriate to use Internet-Drafts as reference 52 material or to cite them other than as "work in progress." 54 This Internet-Draft will expire on February 14, 2021. 56 Copyright Notice 58 Copyright (c) 2020 IETF Trust and the persons identified as the 59 document authors. All rights reserved. 61 This document is subject to BCP 78 and the IETF Trust's Legal 62 Provisions Relating to IETF Documents 63 (https://trustee.ietf.org/license-info) in effect on the date of 64 publication of this document. Please review these documents 65 carefully, as they describe your rights and restrictions with respect 66 to this document. Code Components extracted from this document must 67 include Simplified BSD License text as described in Section 4.e of 68 the Trust Legal Provisions and are provided without warranty as 69 described in the Simplified BSD License. 71 Table of Contents 73 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 74 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 75 2. Communication Phases . . . . . . . . . . . . . . . . . . . . 3 76 3. Messages Specification . . . . . . . . . . . . . . . . . . . 3 77 3.1. Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 3 78 3.2. Client Message Types . . . . . . . . . . . . . . . . . . 5 79 3.3. Server Message Types . . . . . . . . . . . . . . . . . . 6 80 4. UDP/IP Ports . . . . . . . . . . . . . . . . . . . . . . . . 7 81 5. Example Case . . . . . . . . . . . . . . . . . . . . . . . . 8 82 6. Additional Considerations . . . . . . . . . . . . . . . . . . 8 83 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 84 8. Security Considerations . . . . . . . . . . . . . . . . . . . 9 85 9. Normative References . . . . . . . . . . . . . . . . . . . . 9 86 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 88 1. Introduction 90 This document specifies PERLERT (Protocol for Evaluation of 91 Reinforcement Learning Environments in Real Time). 93 It is intended to be used in the context of reinforcement learning 94 problems analysis. In reinforcement learning problems an agent sends 95 an action to an environment. The environment acts as a black box 96 returning an observation and a reward for the agent, whose goal is to 97 maximize the total obtained rewards. 99 The main purpose of PERLERT is to make it easier to test and 100 integrate differently implemented agents and run simulation servers 101 separatedly from those agents. 103 1.1. Requirements Language 105 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 106 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 107 document are to be interpreted as described in RFC 2119 [RFC2119]. 109 2. Communication Phases 111 There are two main separated phases in which client and server shall 112 exchange PERLERT messages. 114 lobby 115 This phase is oriented to let agent clients register themselves 116 within the available slots informed by the server. It is 117 especially useful when it comes to environments with multiagent 118 support. 120 rollout 121 This is the main phase. The term "rollout" here acts as a 122 synonym of "simulation". In this section the loop: 124 (action) -> (observation, reward) 126 ...takes place until clients are notified by the server that the 127 simulation has finished. 129 3. Messages Specification 131 Messages defined in the following sections MUST be implemented as 132 UDP/IP datagrams [RFC768]. 134 Also, all messages SHOULD use the same text encoding. It is 135 RECOMMENDED that both server and client encode messages using UTF8 136 [RFC3629]. 138 3.1. Terms 140 In order of appearance: 142 SERVER_INSTANCE_NAME Tag used to distinguish different environments 143 being held by one same server, e.g.: "cartpole". 145 SERVER_INSTANCE_NUMBER Positive integer used to distinguish 146 different instances of the same environment being held by one 147 same server, e.g.: "0". 149 HEADER Shorthand for SERVER_INSTANCE_NAME:SERVER_INSTANCE_NUMBER, 150 e.g.: "cartpole:0". 152 SERVER_LOBBY_PORT UDP/IP port on which server is listening for 153 incoming messages related to the lobby phase. It is necessary 154 that clients know the SERVER_LOBBY_PORT beforehand. 156 SERVER_ROLLOUT_PORT UDP/IP port on which server is listening for 157 incoming messages related to the rollout phase. It will be 158 notified by the server to the clients right before the simulation 159 starts. 161 CLIENT_PORT UDP/IP port of agent clients. Server SHOULD NOT send 162 datagrams to clients if they have not been registered first, 163 following the process explained in next section. 165 AGENT_KEY Key used to identify one available agent slot, e.g.: 166 "agent0". 168 AGENT_TAG Tag used to identify one agent filling one available slot. 169 Specific clients can use a custom tag to identify themselves 170 within the scope of the lobby phase, e.g.: "john_doe_q_learning". 172 BOOL_VALUE "true" or "false" particles, without backticks. 174 ACTION Action chosen by an agent. It MUST NOT contain the colon 175 character (:), semicolon (;), or equal sign (=). There are no 176 other restrictions on how this field is formed as long as it is 177 well understood by both client and server, e.g.: "move_left" or 178 "5,6.78". 180 SLOT_STATUS "open" or "close" particles, without backticks. 182 AGENT_KIND Freeform field used to differentiate aspects of agents 183 relevant during the lobby phase, e.g.: "citizen" or "zombie". It 184 MUST NOT contain the colon character (:), semicolon (;), comma 185 (,) or equal sign (=). There are no other restrictions on how 186 this field is formed as long as it is well understood by both 187 client and server. 189 READY_STATUS "ready" or "not_ready" particles, without backticks. 191 AGENT_SLOT Shorthand for 192 AGENT_KEY=SLOT_STATUS,AGENT_KIND,AGENT_TAG,READY_STATUS; 194 [AGENT_SLOT] Appearance of 1..n AGENT_SLOT. 196 MESSAGE Informative message sent by server instances during lobby 197 phase. 199 TIMESTAMP Number of milliseconds since UNIX Epoch (Jan 1, 1970) 200 according to server time. 202 STEP_NUMBER Positive integer indicating the step number for a 203 running simulation. 205 OBSERVATION Observation for an agent received upon a simulation step 206 run on the server. It MUST NOT contain the semicolon character 207 (;), or equal sign (=). There are no other restrictions on how 208 this field is formed as long as it is well understood by both 209 client and server, e.g.: "x:0.54,y:0.95". 211 REWARD Reward for an agent received upon a simulation step run on 212 the server, usually modeled as a single floating point value. It 213 MUST NOT contain the semicolon character (;), or equal sign (=). 215 EXTRA Additional information for an agent received upon a simulation 216 step run on the server. It MUST NOT contain the semicolon 217 character (;), or equal sign (=). There are no other 218 restrictions on how this field is formed as long as it is well 219 understood by both client and server, e.g.: 220 "did_jump:true,jump_length:6.84". 222 3.2. Client Message Types 224 This section specifies the content format for the message types that 225 shall be implemented by PERLERT clients. 227 lobby information request 228 Message sent by clients to request lobby information associated 229 with a given server instance. 231 HEADER;lobby 233 lobby registration request 234 Message sent by clients to request to participate in a simulation 235 server instance. 237 HEADER;register=AGENT_KEY,AGENT_TAG 239 Clients are allowed to issue multiple lobby registration 240 requests, but only the last one correctly received by the server 241 will take effect. 243 lobby ready request 244 Message sent by clients to inform the server whether they are 245 ready to participate in the simulation or not. 247 HEADER;ready=AGENT_KEY,BOOL_VALUE 249 rollout action 250 Message sent by clients to inform about the desired action to be 251 run in the simulation. It is not needed to send a "rollout 252 action" message per each simulation timestep. Instead, the 253 server will use the last received action for each client and feed 254 it into the environment until receiving a new action. Server 255 instances can choose which action feed to the environment 256 simulation until agent clients provide a valid action. 258 HEADER;action=ACTION 260 3.3. Server Message Types 262 This section specifies the content format for the message types that 263 shall be implemented by PERLERT servers. 265 lobby information 266 Message responded by servers informing clients about lobby agent 267 slots. This datagram MUST be sent to a client upon receiving a 268 "lobby information request", and to all clients whenever the 269 lobby is altered due to a "lobby registration request" or a 270 "lobby ready request". 272 HEADER;[AGENT_SLOT] 274 The message format MAY omit the trailing semicolon character (;). 276 lobby registration response 277 Message sent by servers upon a successful registration request. 279 HEADER;registered=AGENT_KEY 281 Servers MUST NOT allow a single client to be registered in 282 multiple slots. Before proceeding to register one client in one 283 agent slot, such client must be removed from any slot where it 284 may have been registered first. 286 Servers MUST register clients with a default "not_ready" status. 288 lobby message 289 Message sent by servers to registered clients containing relevant 290 general information. 292 HEADER;message=MESSAGE 294 lobby start 295 Message sent by servers to all registered clients informing about 296 the UDP/IP port for the rollout once the simulation is about to 297 start. The server can choose to start the simulation at any time 298 but it MUST NOT do it if any client is in a "not_ready" status. 300 HEADER;start=port:SERVER_ROLLOUT_PORT 302 rollout step 303 Message sent by servers to all registered clients containing the 304 information provided by the environment for a single step. Note 305 that "rollout step" messages should be sent in a regular 306 datastream containing enough data per time unit so that clients 307 can properly render the environment, but should not exceed a 308 reasonable amount of UDP packets. It is RECOMMENDED to limit a 309 maximum of 30 "rollout step" packets per second. 311 HEADER:TIMESTAMP:STEP_NUMBER;obs=OBSERVATION;reward=REWARD;done=B 312 OOL_VALUE 314 Server MAY send additional information by concatenating an extra 315 particle like this: 317 HEADER:TIMESTAMP:STEP_NUMBER;obs=OBSERVATION;reward=REWARD;done=B 318 OOL_VALUE;extra=EXTRA 320 Because several messages of this type will be sent over the 321 network, it is recommended that they are as condensed as 322 possible. For example, it is RECOMMENDED that floating point 323 values either belonging to the OBSERVATION or the REWARD are 324 rounded to a minimal needed amount of decimals. 326 4. UDP/IP Ports 328 All messages sent by one client MUST use the same UDP/IP source 329 CLIENT_PORT during the whole information exchange process, since the 330 agent sends a "lobby registration request" to the server until it 331 receives a "rollout step" response with "done" flag as "true". 333 "lobby information", "lobby registration response", "lobby message", 334 and "lobby start" datagrams MUST use the same UDP/IP source 335 SERVER_LOBBY_PORT for a given server instance. 337 "rollout step" datagrams MUST use the same UDP/IP source 338 SERVER_ROLLOUT_PORT for a given server instance. 340 5. Example Case 342 This section provides a brief example of datagrams exchanged by one 343 client and one server during a PERLERT session. 345 CLIENT SERVER 347 ==================== LOBBY PHASE ====================== 348 UDP port: 55555 UDP port: 32322 350 city:7;lobby --------------------------------------> 352 <-------------- city:7;agent0=open,citizen,cpu,ready 354 city:7;register=agent0,patrick --------------------> 356 <-------------------------- city:7;registered=agent0 358 city:7;ready=agent0,true --------------------------> 360 <--------- city:7;agent0=close,citizen,patrick,ready 361 <----------- city:7;message=Simulation will start... 363 <--------------------------- city:7;start=port:32323 365 ==================== ROLLOUT PHASE ===================== 366 UDP port: 55555 UDP port: 32323 368 city:7;action=walk --------------------------------> 370 <-- city:7:1590853116323:0;obs=45;reward=0;done=false 371 <-- city:7:1590853121058:0;obs=47;reward=0;done=false 372 <-- city:7:1590853126423:0;obs=48;reward=1;done=false 373 <-- city:7:1590853130429:0;obs=49;reward=0;done=false 374 <--- city:7:1590853134833:0;obs=51;reward=1;done=true 376 Figure 1 378 6. Additional Considerations 380 Because packet loss might prevent some PERLERT information from 381 arriving to the other end, the following considerations are to be 382 taken into account: 384 After sending the "lobby start" message, the server instance SHOULD 385 keep the SERVER_LOBBY_PORT open for five (5) seconds and resend the 386 "lobby start" message to any client communicating to such port after 387 the simulation has started. 389 After the simulation is finished for a given client, this is, the 390 "rollout step" message contains the "done" flag as "true", the server 391 instance SHOULD keep the SERVER_ROLLOUT_PORT open for ten (10) 392 seconds and listening to datagrams from such client. The server 393 instance SHOULD resend the appropriate "rollout step" datagram upon 394 receiving a client message within that period. 396 7. IANA Considerations 398 This memo includes no request to IANA. 400 8. Security Considerations 402 Both client and server implementations SHOULD use a fixed buffer size 403 as small as possible for receiving the UDP/IP packets. 405 Both client and server MAY cipher the content of the messages. 406 Although asymmetric publick/private key pairs usage is recommended, 407 it is also encourage to use symmetric ciphering with a pre-shared key 409 PERLERT is especially vulnerable to IP spoofing attacks, because 410 actions received during the rollout phase are only identified by the 411 IP of the sender. Using an VPN is RECOMMENDED in order to tunnelize 412 the information exchange. 414 9. Normative References 416 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 417 Requirement Levels", BCP 14, RFC 2119, 418 DOI 10.17487/RFC2119, March 1997, 419 . 421 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 422 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 423 2003, . 425 [RFC768] Postel, J., "User Datagram Protocol", August 1980, 426 . 428 Author's Address 430 Ruben Montero 431 University of A Coruna 432 Rua San Roque 9 433 A Coruna, Galicia 15002 434 ES 436 Phone: +34 692 983 851 437 Email: ruben.montero@udc.es