idnits 2.17.1 

draft-perlert-wg-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (August 13, 2020) is 1352 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Internet Engineering Task Force                               R. Montero
3	Internet-Draft                                    University of A Coruna
4	Intended status: Informational                           August 13, 2020
5	Expires: February 14, 2021

7	Protocol for Evaluating Reinforcement Learning Environments in Real Time
8	                          draft-perlert-wg-00

10	Abstract

12	   This document defines a simple UDP protocol for communicating a
13	   server simulating a reinforcement learning environment and a client
14	   observing it and responding with actions.

16	   Reinforcement learning problems are usually defined within the scope
17	   of a Markov Decission Process (MDP) where an agent sends an action
18	   belonging to an action space to an environment.  The environment acts
19	   as a black box returning an observation and a reward for the agent,
20	   whose goal is to maximize the total obtained rewards.

22	   Although the problem statement is easy to understand, there are no
23	   conventions on how to communicate a reinforcement learning simulation
24	   with a client agent, either in a local network or over the Internet.
25	   Additionally, giving an answer to this can be especially useful when
26	   it comes to multiagent support and analysis.

28	   The protocol PERLERT defined in this document assumes that server and
29	   client have shared certain information beforehand via another way of
30	   communication like a web page served using HTTP protocol.  For
31	   example, the client must know a port number and an instance number
32	   before proceeding to participate in a simulation run on a server.

34	   Also, although it is often desired to know the full feedback from the
35	   environment, PERLERT focuses on real-time interaction where human
36	   agents can interact with AI agents even if that means that
37	   information can be lost due to network packet loss.

39	Status of This Memo

41	   This Internet-Draft is submitted in full conformance with the
42	   provisions of BCP 78 and BCP 79.

44	   Internet-Drafts are working documents of the Internet Engineering
45	   Task Force (IETF).  Note that other groups may also distribute
46	   working documents as Internet-Drafts.  The list of current Internet-
47	   Drafts is at https://datatracker.ietf.org/drafts/current/.

49	   Internet-Drafts are draft documents valid for a maximum of six months
50	   and may be updated, replaced, or obsoleted by other documents at any
51	   time.  It is inappropriate to use Internet-Drafts as reference
52	   material or to cite them other than as "work in progress."

54	   This Internet-Draft will expire on February 14, 2021.

56	Copyright Notice

58	   Copyright (c) 2020 IETF Trust and the persons identified as the
59	   document authors.  All rights reserved.

61	   This document is subject to BCP 78 and the IETF Trust's Legal
62	   Provisions Relating to IETF Documents
63	   (https://trustee.ietf.org/license-info) in effect on the date of
64	   publication of this document.  Please review these documents
65	   carefully, as they describe your rights and restrictions with respect
66	   to this document.  Code Components extracted from this document must
67	   include Simplified BSD License text as described in Section 4.e of
68	   the Trust Legal Provisions and are provided without warranty as
69	   described in the Simplified BSD License.

71	Table of Contents

73	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
74	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
75	   2.  Communication Phases  . . . . . . . . . . . . . . . . . . . .   3
76	   3.  Messages Specification  . . . . . . . . . . . . . . . . . . .   3
77	     3.1.  Terms . . . . . . . . . . . . . . . . . . . . . . . . . .   3
78	     3.2.  Client Message Types  . . . . . . . . . . . . . . . . . .   5
79	     3.3.  Server Message Types  . . . . . . . . . . . . . . . . . .   6
80	   4.  UDP/IP Ports  . . . . . . . . . . . . . . . . . . . . . . . .   7
81	   5.  Example Case  . . . . . . . . . . . . . . . . . . . . . . . .   8
82	   6.  Additional Considerations . . . . . . . . . . . . . . . . . .   8
83	   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
84	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
85	   9.  Normative References  . . . . . . . . . . . . . . . . . . . .   9
86	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  10

88	1.  Introduction

90	   This document specifies PERLERT (Protocol for Evaluation of
91	   Reinforcement Learning Environments in Real Time).

93	   It is intended to be used in the context of reinforcement learning
94	   problems analysis.  In reinforcement learning problems an agent sends
95	   an action to an environment.  The environment acts as a black box
96	   returning an observation and a reward for the agent, whose goal is to
97	   maximize the total obtained rewards.

99	   The main purpose of PERLERT is to make it easier to test and
100	   integrate differently implemented agents and run simulation servers
101	   separatedly from those agents.

103	1.1.  Requirements Language

105	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
106	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
107	   document are to be interpreted as described in RFC 2119 [RFC2119].

109	2.  Communication Phases

111	   There are two main separated phases in which client and server shall
112	   exchange PERLERT messages.

114	   lobby
115	       This phase is oriented to let agent clients register themselves
116	       within the available slots informed by the server.  It is
117	       especially useful when it comes to environments with multiagent
118	       support.

120	   rollout
121	       This is the main phase.  The term "rollout" here acts as a
122	       synonym of "simulation".  In this section the loop:

124	       (action) -> (observation, reward)

126	       ...takes place until clients are notified by the server that the
127	       simulation has finished.

129	3.  Messages Specification

131	   Messages defined in the following sections MUST be implemented as
132	   UDP/IP datagrams [RFC768].

134	   Also, all messages SHOULD use the same text encoding.  It is
135	   RECOMMENDED that both server and client encode messages using UTF8
136	   [RFC3629].

138	3.1.  Terms

140	   In order of appearance:

142	   SERVER_INSTANCE_NAME  Tag used to distinguish different environments
143	       being held by one same server, e.g.: "cartpole".

145	   SERVER_INSTANCE_NUMBER  Positive integer used to distinguish
146	       different instances of the same environment being held by one
147	       same server, e.g.: "0".

149	   HEADER  Shorthand for SERVER_INSTANCE_NAME:SERVER_INSTANCE_NUMBER,
150	       e.g.: "cartpole:0".

152	   SERVER_LOBBY_PORT  UDP/IP port on which server is listening for
153	       incoming messages related to the lobby phase.  It is necessary
154	       that clients know the SERVER_LOBBY_PORT beforehand.

156	   SERVER_ROLLOUT_PORT  UDP/IP port on which server is listening for
157	       incoming messages related to the rollout phase.  It will be
158	       notified by the server to the clients right before the simulation
159	       starts.

161	   CLIENT_PORT  UDP/IP port of agent clients.  Server SHOULD NOT send
162	       datagrams to clients if they have not been registered first,
163	       following the process explained in next section.

165	   AGENT_KEY  Key used to identify one available agent slot, e.g.:
166	       "agent0".

168	   AGENT_TAG  Tag used to identify one agent filling one available slot.
169	       Specific clients can use a custom tag to identify themselves
170	       within the scope of the lobby phase, e.g.: "john_doe_q_learning".

172	   BOOL_VALUE  "true" or "false" particles, without backticks.

174	   ACTION  Action chosen by an agent.  It MUST NOT contain the colon
175	       character (:), semicolon (;), or equal sign (=).  There are no
176	       other restrictions on how this field is formed as long as it is
177	       well understood by both client and server, e.g.: "move_left" or
178	       "5,6.78".

180	   SLOT_STATUS  "open" or "close" particles, without backticks.

182	   AGENT_KIND  Freeform field used to differentiate aspects of agents
183	       relevant during the lobby phase, e.g.: "citizen" or "zombie".  It
184	       MUST NOT contain the colon character (:), semicolon (;), comma
185	       (,) or equal sign (=).  There are no other restrictions on how
186	       this field is formed as long as it is well understood by both
187	       client and server.

189	   READY_STATUS  "ready" or "not_ready" particles, without backticks.

191	   AGENT_SLOT  Shorthand for
192	       AGENT_KEY=SLOT_STATUS,AGENT_KIND,AGENT_TAG,READY_STATUS;

194	   [AGENT_SLOT]  Appearance of 1..n AGENT_SLOT.

196	   MESSAGE  Informative message sent by server instances during lobby
197	       phase.

199	   TIMESTAMP  Number of milliseconds since UNIX Epoch (Jan 1, 1970)
200	       according to server time.

202	   STEP_NUMBER  Positive integer indicating the step number for a
203	       running simulation.

205	   OBSERVATION  Observation for an agent received upon a simulation step
206	       run on the server.  It MUST NOT contain the semicolon character
207	       (;), or equal sign (=).  There are no other restrictions on how
208	       this field is formed as long as it is well understood by both
209	       client and server, e.g.: "x:0.54,y:0.95".

211	   REWARD  Reward for an agent received upon a simulation step run on
212	       the server, usually modeled as a single floating point value.  It
213	       MUST NOT contain the semicolon character (;), or equal sign (=).

215	   EXTRA  Additional information for an agent received upon a simulation
216	       step run on the server.  It MUST NOT contain the semicolon
217	       character (;), or equal sign (=).  There are no other
218	       restrictions on how this field is formed as long as it is well
219	       understood by both client and server, e.g.:
220	       "did_jump:true,jump_length:6.84".

222	3.2.  Client Message Types

224	   This section specifies the content format for the message types that
225	   shall be implemented by PERLERT clients.

227	   lobby information request
228	       Message sent by clients to request lobby information associated
229	       with a given server instance.

231	       HEADER;lobby

233	   lobby registration request
234	       Message sent by clients to request to participate in a simulation
235	       server instance.

237	       HEADER;register=AGENT_KEY,AGENT_TAG

239	       Clients are allowed to issue multiple lobby registration
240	       requests, but only the last one correctly received by the server
241	       will take effect.

243	   lobby ready request
244	       Message sent by clients to inform the server whether they are
245	       ready to participate in the simulation or not.

247	       HEADER;ready=AGENT_KEY,BOOL_VALUE

249	   rollout action
250	       Message sent by clients to inform about the desired action to be
251	       run in the simulation.  It is not needed to send a "rollout
252	       action" message per each simulation timestep.  Instead, the
253	       server will use the last received action for each client and feed
254	       it into the environment until receiving a new action.  Server
255	       instances can choose which action feed to the environment
256	       simulation until agent clients provide a valid action.

258	       HEADER;action=ACTION

260	3.3.  Server Message Types

262	   This section specifies the content format for the message types that
263	   shall be implemented by PERLERT servers.

265	   lobby information
266	       Message responded by servers informing clients about lobby agent
267	       slots.  This datagram MUST be sent to a client upon receiving a
268	       "lobby information request", and to all clients whenever the
269	       lobby is altered due to a "lobby registration request" or a
270	       "lobby ready request".

272	       HEADER;[AGENT_SLOT]

274	       The message format MAY omit the trailing semicolon character (;).

276	   lobby registration response
277	       Message sent by servers upon a successful registration request.

279	       HEADER;registered=AGENT_KEY

281	       Servers MUST NOT allow a single client to be registered in
282	       multiple slots.  Before proceeding to register one client in one
283	       agent slot, such client must be removed from any slot where it
284	       may have been registered first.

286	       Servers MUST register clients with a default "not_ready" status.

288	   lobby message
289	       Message sent by servers to registered clients containing relevant
290	       general information.

292	       HEADER;message=MESSAGE

294	   lobby start
295	       Message sent by servers to all registered clients informing about
296	       the UDP/IP port for the rollout once the simulation is about to
297	       start.  The server can choose to start the simulation at any time
298	       but it MUST NOT do it if any client is in a "not_ready" status.

300	       HEADER;start=port:SERVER_ROLLOUT_PORT

302	   rollout step
303	       Message sent by servers to all registered clients containing the
304	       information provided by the environment for a single step.  Note
305	       that "rollout step" messages should be sent in a regular
306	       datastream containing enough data per time unit so that clients
307	       can properly render the environment, but should not exceed a
308	       reasonable amount of UDP packets.  It is RECOMMENDED to limit a
309	       maximum of 30 "rollout step" packets per second.

311	       HEADER:TIMESTAMP:STEP_NUMBER;obs=OBSERVATION;reward=REWARD;done=B
312	       OOL_VALUE

314	       Server MAY send additional information by concatenating an extra
315	       particle like this:

317	       HEADER:TIMESTAMP:STEP_NUMBER;obs=OBSERVATION;reward=REWARD;done=B
318	       OOL_VALUE;extra=EXTRA

320	       Because several messages of this type will be sent over the
321	       network, it is recommended that they are as condensed as
322	       possible.  For example, it is RECOMMENDED that floating point
323	       values either belonging to the OBSERVATION or the REWARD are
324	       rounded to a minimal needed amount of decimals.

326	4.  UDP/IP Ports

328	   All messages sent by one client MUST use the same UDP/IP source
329	   CLIENT_PORT during the whole information exchange process, since the
330	   agent sends a "lobby registration request" to the server until it
331	   receives a "rollout step" response with "done" flag as "true".

333	   "lobby information", "lobby registration response", "lobby message",
334	   and "lobby start" datagrams MUST use the same UDP/IP source
335	   SERVER_LOBBY_PORT for a given server instance.

337	   "rollout step" datagrams MUST use the same UDP/IP source
338	   SERVER_ROLLOUT_PORT for a given server instance.

340	5.  Example Case

342	   This section provides a brief example of datagrams exchanged by one
343	   client and one server during a PERLERT session.

345	          CLIENT                                           SERVER

347	          ==================== LOBBY PHASE ======================
348	          UDP port: 55555                         UDP port: 32322

350	          city:7;lobby -------------------------------------->

352	             <-------------- city:7;agent0=open,citizen,cpu,ready

354	          city:7;register=agent0,patrick -------------------->

356	             <-------------------------- city:7;registered=agent0

358	          city:7;ready=agent0,true -------------------------->

360	             <--------- city:7;agent0=close,citizen,patrick,ready
361	             <----------- city:7;message=Simulation will start...

363	             <--------------------------- city:7;start=port:32323

365	          ==================== ROLLOUT PHASE =====================
366	          UDP port: 55555                         UDP port: 32323

368	          city:7;action=walk -------------------------------->

370	             <-- city:7:1590853116323:0;obs=45;reward=0;done=false
371	             <-- city:7:1590853121058:0;obs=47;reward=0;done=false
372	             <-- city:7:1590853126423:0;obs=48;reward=1;done=false
373	             <-- city:7:1590853130429:0;obs=49;reward=0;done=false
374	             <--- city:7:1590853134833:0;obs=51;reward=1;done=true

376	                                 Figure 1

378	6.  Additional Considerations

380	   Because packet loss might prevent some PERLERT information from
381	   arriving to the other end, the following considerations are to be
382	   taken into account:

384	   After sending the "lobby start" message, the server instance SHOULD
385	   keep the SERVER_LOBBY_PORT open for five (5) seconds and resend the
386	   "lobby start" message to any client communicating to such port after
387	   the simulation has started.

389	   After the simulation is finished for a given client, this is, the
390	   "rollout step" message contains the "done" flag as "true", the server
391	   instance SHOULD keep the SERVER_ROLLOUT_PORT open for ten (10)
392	   seconds and listening to datagrams from such client.  The server
393	   instance SHOULD resend the appropriate "rollout step" datagram upon
394	   receiving a client message within that period.

396	7.  IANA Considerations

398	   This memo includes no request to IANA.

400	8.  Security Considerations

402	   Both client and server implementations SHOULD use a fixed buffer size
403	   as small as possible for receiving the UDP/IP packets.

405	   Both client and server MAY cipher the content of the messages.
406	   Although asymmetric publick/private key pairs usage is recommended,
407	   it is also encourage to use symmetric ciphering with a pre-shared key

409	   PERLERT is especially vulnerable to IP spoofing attacks, because
410	   actions received during the rollout phase are only identified by the
411	   IP of the sender.  Using an VPN is RECOMMENDED in order to tunnelize
412	   the information exchange.

414	9.  Normative References

416	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
417	              Requirement Levels", BCP 14, RFC 2119,
418	              DOI 10.17487/RFC2119, March 1997,
419	              <https://www.rfc-editor.org/info/rfc2119>.

421	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
422	              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
423	              2003, <https://www.rfc-editor.org/info/rfc3629>.

425	   [RFC768]   Postel, J., "User Datagram Protocol", August 1980,
426	              <https://tools.ietf.org/html/rfc768>.

428	Author's Address

430	   Ruben Montero
431	   University of A Coruna
432	   Rua San Roque 9
433	   A Coruna, Galicia  15002
434	   ES

436	   Phone: +34 692 983 851
437	   Email: ruben.montero@udc.es