idnits 2.17.1 draft-dowser-spec-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 17 longer pages, the longest (page 18) being 62 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 2 instances of too long lines in the document, the longest one being 16 characters in excess of 72. ** There are 42 instances of lines with control characters in the document. == There are 1 instance of lines with non-RFC3849-compliant IPv6 addresses in the document. If these are example addresses, they should be changed. -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 109 has weird spacing: '...mselves are k...' == Line 111 has weird spacing: '...s to be retri...' == Line 877 has weird spacing: '...for the purpo...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 04, 2004) is 7084 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC1945' is mentioned on line 87, but not defined == Missing Reference: 'URI' is mentioned on line 109, but not defined == Missing Reference: 'RFC2616' is mentioned on line 114, but not defined ** Obsolete undefined reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Looks like a reference, but probably isn't: '34' on line 202 == Unused Reference: 'HTTP' is defined on line 827, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986) ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Downref: Normative reference to an Informational RFC: RFC 3174 (ref. 'SHA') Summary: 8 errors (**), 0 flaws (~~), 11 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet-Draft Carlos Bueno 2 A Distributed Web Search Protocol -- Dowser/0.1 3 draft-dowser-spec-00.txt 5 Status of This Memo 7 This document is an Internet-Draft and is subject to all provisions 8 of Section 10 of RFC2026. 10 Internet-Drafts are working documents of the Internet Engineering 11 Task Force (IETF), its areas, and its working groups. Note that 12 other groups may also distribute working documents as 13 Internet-Drafts. 15 Internet-Drafts are draft documents valid for a maximum of six 16 months and may be updated, replaced, or obsoleted by other 17 documents at any time. It is inappropriate to use Internet- 18 Drafts as reference material or to cite them other than as 19 "work in progress." 21 The list of current Internet-Drafts can be accessed at 22 http://www.ietf.org/1id-abstracts.html 24 The list of Internet-Draft Shadow Directories can be accessed at 25 http://www.ietf.org/shadow.html 27 This document Expires on November 04, 2004 29 Abstract 31 Dowser is an application-level, peer-to-peer protocol for creating a 32 searchable index and cache of documents. It is intended for both 33 small-scale intranets and Worldwide Web-scale indexes. This document 34 describes the messages that members, or "nodes", of the network pass 35 to each other to distribute workload, request & respond to queries, 36 and maintain the index. 38 Table of contents 40 1. Introduction ................................................. 3 41 1.1 Purpose .................................................... 3 42 1.2 Definitions ................................................ 3 43 1.3 URL vs. URI vs. URN ........................................ 4 44 1.4 Requirements ............................................... 5 45 2. Some Examples ................................................ 5 46 2.1 NODEFIND ................................................... 5 47 2.1.1 Node-id & seed ............................................ 6 48 2.1.2 Last-key .................................................. 6 49 2.1.3 Ring-id ................................................... 7 50 2.1.4 Port ...................................................... 7 51 2.1.5 Server Responses .......................................... 7 52 2.2 SEARCH ...................................................... 8 53 2.2.1 Expires ................................................... 9 54 2.2.2 Content-key ............................................... 9 55 2.2.3 Search syntax ............................................. 9 56 2.3. URL caching ............................................... 10 57 2.4. Content caching ........................................... 11 58 2.5. CRAWL ..................................................... 11 59 2.6.INDEXADD ................................................... 12 60 3. Announcing .................................................. 13 61 4. Redundancy .................................................. 14 62 5. Optional headers ............................................ 14 63 6. Error conditions ............................................ 14 64 7. Notes on implementation ..................................... 15 65 7.1 Ranking .................................................... 15 66 7.2 Proxying ................................................... 15 67 7.3 TCP vs UDP ................................................. 16 68 7.4 Ping/Pong .................................................. 16 69 7.5 MHTML ...................................................... 16 70 7.6 Spam ....................................................... 16 71 7.7 Indexing ................................................... 16 72 8. Acknowledgements ............................................ 16 73 9. References .................................................. 17 74 9.1 Informative ................................................ 17 75 9.2 Normative .................................................. 17 76 10. Security Considerations .................................... 17 77 10.1 Personal Information ...................................... 18 78 10.2 Sensitive data ............................................ 18 79 11. IPR & Copyright ............................................ 18 80 12. Contact Information ........................................ 19 81 1. Introduction 83 1.1 Purpose 85 The express purpose of the Dowser protocol is to encourage open 86 research into web-scale, decentralized indexing systems. The 87 specification for the Hypertext Transfer Protocol [RFC1945] 88 has this observation: 90 Practical information systems require more functionality than 91 simple retrieval, including search, front-end update, and 92 annotation. HTTP allows an open-ended set of methods to be used to 93 indicate the purpose of a request. 95 HTTP is in widespread use by hundreds of millions of people every 96 day, and they issue hundreds of millions of searches for information. 97 Most of these searches are served by centralized "search engines" 98 that cache copies of as many websites as possible to create their 99 indexes. The problems of scale have so far been admirably met 100 [GOOGLE], but the operating costs of web-scale engines are now out of 101 the reach of most Universities and corporations. Conversely, larger 102 and larger amounts of storage and processing power are available to 103 personal computer users every year. 105 Dowser is an extension to HTTP that can be used by itself or added to 106 existing HTTP implementations. Each node on the network claims a 107 small part of the keyspace of a distributed hash table [CHORD]. 108 Indexes of documents are keyed under the hash of the word; documents 109 themselves are keyed by their Uniform Resource Identifier [URI]. 110 There is also a "cache of caches", to allow popular documents or 111 indexes to be retrieved from any node that has a copy. 113 The syntax and formatting rules are inherited from HTTP/1.1 114 [RFC2616]. The required "Host" header in HTTP/1.1 does not really 115 apply to Dowser, but including it shouldn't hurt anyone. Several 116 other headers and return codes are adopted for use. 118 1.2 Definitions 120 node 122 In this protocol there is no difference between a "client" or a 123 "server". They are all equal "nodes" on the network, capable of 124 sending and serving requests. Each node is identified by a 125 40-digit hexadecimal number called a "node-id". They participate 126 in creating, storing and serving a distributed hash table. This 127 hash table can contain document caches, indexes of those 128 documents, and other data. 130 For sanity's sake, the term "server" can be taken to mean 131 "responding node" and "client" to mean "requesting node" in the 132 context of the conversation being described. 134 hash, key 136 Hash functions are used to evenly distribute data around the 137 network and to uniquely identify nodes and pieces of information. 138 Dowser uses the Secure Hash Algorithim, version 2 [SHA]. The terms 139 "hash" and "key" refer to the 40-digit hexadecimal number gotten 140 from passing some piece of information through the SHA function, 141 e.g.: 143 SHA("foo") == 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33 145 distributed hash table 147 A distributed hash table is a key-->value mapping that is 148 contained on many computers, where they keys are hashed. 150 range 152 The "range" of a node is defined by the first and last key of the 153 hash table it is expected to store data under. 155 left-hand, right-hand 157 The total keyspace is imagined as the perimeter of a circle, or a 158 "ring". While standing in the center, points to the left generally 159 decrease and points to the right generally increase, except at the 160 point where "fff..." meets "000...". 162 A useful analog is timezones, where West is left and East is 163 right. The time is always earlier to the left except at the 164 International Date Line. 166 1.3 URL vs. URI vs. URN 168 The protocol descibed here has a focus different from most 169 filesharing protocols. Searches for files are not broadcast but 170 targeted; nodes are authoritative regarding the location and 171 checksums of data in their ranges. Searches for a particular node are 172 sent to arbitrary nodes but not broadcast per se; they are not 173 forwarded but responded to directly. 175 In most ad-hoc filesharing protocols the only unique identifier is 176 the checksum. The filesystem names and metadata may be different for 177 different copies but the checksum remains the same. In HTTP jargon 178 this is a Universal Resource Name or URN , that which uniquely 179 identifies the file without necessarily referencing its location or 180 common name [RFC2396]. 182 The content of a cached web page may change with time. This means the 183 checksum will change while its Univeral Resource Locator, the origin 184 of the page, remains constant. 186 Indexes built from these cached pages are also treated as files, one 187 index file per word or phrase. An index's Universal Resource Name is 188 that word. (More correctly it would be "dowser://foo", not "foo".) 190 Searching in Dowser therefore starts with a lookup for the node(s) 191 who handle the search term(s). The search terms (URN) are then sent 192 to those nodes, resulting in either a) the content of index file or 193 b) the latest checksum of the index file and a list of nodes that 194 have previously downloaded a copy. Retrieving a cache of a web page 195 is done the same way. 197 1.4 Requirements 199 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 200 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 201 document are to be interpreted as described in RFC 2119 [34]. 203 An implementation is not compliant if it fails to satisfy one or more 204 of the MUST or REQUIRED level requirements for the protocols it 205 implements. An implementation that satisfies all the MUST or REQUIRED 206 level and all the SHOULD level requirements for its protocols is said 207 to be "unconditionally compliant"; one that satisfies all the MUST 208 level requirements but not all the SHOULD level requirements for its 209 protocols is said to be "conditionally compliant." 211 In reality, the complications of this protocol are mostly in the 212 state information, not the ins and outs of the messages being passed. 214 2. Some Examples 216 We will assume the reader is familiar with HTTP transactions. It is 217 probably most instructive to give examples using the new methods and 218 then a discussion of particulars. Elipses ("...") are used to 219 truncate long hashes for readability. 221 2.1 NODEFIND 223 NODEFINDs are the bread and butter of routing between nodes. The 224 client wants to find the node(s) that have items under a certain hash 225 key, in this example, "5c89379d0aa9840ac910fd8cacbde2dbf877214a". 226 This can be a search term, url, or anything. The responding node may 227 or may not have this key. If not, it tells the client about nodes it 228 knows about that are closer to the target. 230 NODEFIND 5c89379d0aa9840ac910fd8cacbde2dbf877214a Dowser/0.1 231 Ring-Id: deadbeef00000000000000000000000000000000 232 Node-Id: 12222... a5754... 233 Last-key: 13333... 234 Port: 4666 236 - Response - 237 Dowser/0.1 310 Not in my range 238 Ring-Id: deadbeef00000000000000000000000000000000 239 Node-Id: 48888... de34.... 240 Last-key: 49999... 242 192.0.2.10 8000 50000f... 599999... 243 192.0.2.20 6876 43333a... 610000... 245 - Or - 247 Dowser/0.1 211 That's me 248 Ring-Id: deadbeef00000000000000000000000000000000 249 Node-Id: 4000... de34.... 250 Last-key: 7aaaa... 252 There is a lot to cover here. The first line of the request is like 253 HTTP, identifying the protocol and version, the 'method' or action to 254 be done, and the 'path' or key being requested. There are three 255 additional headers which MUST be included in any message: Ring-id, 256 Node-id, and Last-key. Additionally, the Port header MUST be included 257 in every request. 259 2.1.1 Node-id & seed 261 The node-id is comprised of two 40-digit hexadecimal numbers, 262 separated by whitespace. The first number serves as the FIRST key in 263 this node's range. This number MUST be generated by hashing some 264 random data iteratively. To help ensure this has occurred, the second 265 number (the "seed") MUST be the second-to-last result of the 266 iterations. 268 Node-id: b274f2e2a8d2881035af5866014e9ad5510ab15d 269 cdd2ae2594a83ef90c05ee6014b78631db8538d8 271 This example (linewrapped to fit) is well-formed because the first 272 number is the SHA hash of the second number. Servers MUST check that 273 the node-id and its "seed" are correct. If not, they MUST send a 412 274 "Precondition failed" error. 276 2.1.2 Last-key 278 Last-key is also an SHA hash, and is the LAST key in this node's 279 range. The last-key may change over time, as nodes leave or join the 280 network. 282 Note: it has been brought up that a node might be able to route a 283 large portion of traffic to itself by making its Last-key equal to 284 the node-id minus 1. In Internet Protocol terms, this would be 285 similar to setting your network address to 255.255.255.255. There is 286 nothing in the protocol to prevent this, but implementations should 287 come up with a way to deal with it reasonably. Our feeling that it is 288 self-correcting. If someone tries it on a network with a lot of nodes 289 one hopes he or she has good fire-supression equipment handy; but the 290 concern is the possible damage to the network routing. 292 2.1.3 Ring-id 294 Ring-id is yet another SHA hash to identify the group of nodes it 295 wishes to communicate with. This allows many Dowser networks to be 296 running on an internet at once. Alternate rings are useful for 297 testing implementations of the protocol on a large scale, or for 298 partitioning networks in general. 300 The "official" public ring-id: 302 Dowser/0.1 --> deadbeef00000000000000000000000000000000 304 If the Ring-id header sent does not match the ring-id of the server, 305 it MUST send a 412 "Precondition falied" message, and may provide 306 information in the body of the response as to the reason. Note: there 307 is a security risk in actually revealing the server's ring-id in the 308 response, if the ring is intended for private data. 310 2.1.4 Port 312 Simple enough: we need to advertise what TCP or UDP port we are 313 listening on for future reference. 315 2.1.5 Server Responses 317 If the hash is not in the listening node's range, it returns a 310 318 code and a list of nodes that are closer to the target. The list of 319 nodes takes this form: 321 IP PORT NODE-ID LAST-KEY 323 If it is in range, it sends a 211 "That's Me" code. 325 When overlap occurs a node may choose which one to contact first in 326 an implementation-dependent way, such as IP or network distance, or 327 closeness of the desired key to the node-id. 329 Note: IPv4 addresses are used in these examples, but implementations 330 SHOULD be written to handle IPv6 addresses as well, e.g.: 332 FEDC:BA98:7654:3210:FEDC:BA98:7654:3210 333 1080:0:0:0:8:800:200C:4171 334 3ffe:2a00:100:7031::1 335 1080::8:800:200C:417A 336 ::192.9.5.5 337 ::FFFF:129.144.52.38 338 2010:836B:4179::836B:4179 340 2.2 SEARCH 342 We have the basis of all conversations, and have described how nodes 343 identify themselves and each other. Now we can go to the next level: 344 content routing. 346 - Request - 348 SEARCH foo+bar+baz Dowser/0.1 349 Ring-Id: deadbeef00000000000000000000000000000000 350 Node-Id: 1d3470c... 55754... 351 Last-key: 1e400cc... 352 Port: 4666 354 - Response - 356 Dowser/0.1 200 OK 357 Ring-Id: deadbeef00000000000000000000000000000000 358 Node-Id: 48888... de34... 359 Last-key: 49999... 360 Expires: 864000 361 Content-key: 3fdb13677b10691debb3909dd917b00ee751115a 363 URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]] 364 URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]] 365 ... 367 - Or - 369 Dowser/0.1 300 Multiple choices 370 Ring-Id: deadbeef00000000000000000000000000000000 371 Node-Id: 48888... de34... 372 Last-key: 49999... 373 Expires: 864000 374 Content-key: 3fdb13677b10691debb3909dd917b00ee751115a 376 192.0.2.10 8000 50000f... 599999... 377 192.0.2.20 6876 43333a... 610000... 379 A SEARCH request is almost exactly like a nodefind except the "path" 380 is a list of search terms. The server determines if any of the terms 381 fall into its range. How to break up the terms may be 382 implementation-specific, but the simplest is to break on whitespace. 384 If any are in-range it returns a 300 code, the "content-key" of the 385 data (a SHA checksum of the content) a list of the nodes that have 386 requested that data. 388 Or, it may return the index data itself. Requesting nodes SHOULD keep 389 a copy of the data it receives for a reasonable amount of time, as 390 hinted by the Expires: header. The data should be keyed under the 391 content-key, for retrieval via a CACHE request. This helps distribute 392 the workload. The index is in the form: 394 URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]] 396 And is tab-delimited. AGE is the number of seconds since the record 397 was created. One or more "SNIPPET"s contain text that form an 398 abstract of the document. 400 As always, the server can alternately send the standard 310 Not in 401 Range, or some error code. 403 2.2.1 Expires: 405 Indexes and caches of web documents are not long-lived in the way a 406 media file in traditional peer-to-peer networks are. The Expires 407 header hints at how long a node should cache this data, expressed in 408 seconds from the present time. 410 2.2.2 Content-key: 412 This is a SHA hash of the data's content, used to positively identify 413 it. This is different from the hash of the URL or search term. A page 414 whose URL is 416 "http://foo.example.com" 418 ... and whose content is 420 "

Foo!

" 422 ... would have a URL hash of cfab46bb7dbd11e6187360d429586e2942f2d42e 423 and a content-key of cf5ce65061218164e4148038cc3a56a9e988fe7a. The 424 URL hash is a unique identifier of the document's origin, while the 425 content-key is an identifier of a particular version of that 426 document. 428 2.2.3 Search syntax 430 There are a few schemes for search syntax out in the wild. The most 431 common has operands for words that must or must not appear, and exact 432 phrases. Implementations MUST honor these operands to the best of 433 their ability. 435 foo +bar 437 Means that the documents MUST contain "bar" and MAY contain "foo" 439 foo -bar 441 Means that the documents MUST contain "foo" and MUST NOT contain 442 "bar". When there is only one term without a "-" operand, the "+" is 443 implied. 445 "foo bar" 446 'foo bar' 448 Means that the documents MUST contain the exact phrase "foo bar", in 449 that order. 451 Note: in the SEARCH example in section 2.0, it may seem like the "+" 452 signs in "foo+bar+baz" are search syntax. They are not; the terms are 453 "url-encoded", meaning spaces are encoded as "+" and other non-word 454 characters are encoded as %HEX where HEX is the hexadecimal ASCII 455 code. Therefore "foo +bar" would appear as "foo+%2Bbar" 457 Implementations may honor other operands as they see fit. 459 2.3. URL caching 461 Like a SEARCH, a URLCACHE does not necessarily return the data -- it 462 can return the content-hash of the data and a list of the last X 463 nodes to request that data, where X is implementation-specific. If 464 none of the listed nodes have the data, the requesting node may then 465 request it from the original node. 467 - Request - 469 URLCACHE fb9642989a2222305832a5d9c29b34... Dowser/0.1 470 Ring-Id: deadbeef00000000000000000000000000000000 471 Node-Id: 1d3470c... a5754...... 472 Last-key: 1e400... 473 Url: http://foo.example.com/some/page.html 474 Port: 4666 476 - Response - 478 Dowser/0.1 300 Multiple choices 479 Ring-Id: deadbeef00000000000000000000000000000000 480 Node-Id: 48888... de34... 481 Last-key: 49999... 482 Expires: 864000 483 Content-key: 3fdb13677b10691debb3909dd917b00ee751115a 485 192.0.2.10 8000 50000f... 599999... 486 192.0.2.20 6876 43333a... 610000... 488 - Or - 489 Dowser/0.1 200 OK 490 Ring-Id: deadbeef00000000000000000000000000000000 491 Node-Id: 48888... de34... 492 Last-key: 49999... 493 Expires: 864000 494 Content-key: 3fdb13677b10691debb3909dd917b00ee751115a 496 ... (file data) ... 498 2.4. Content caching 500 This can be page caches, indexes, or really anything. A client sends 501 a request with the content-key as the path: 503 - Request - 505 CACHE 3fdb13677b10691debb3909dd917b00ee751115a Dowser/0.1 506 Ring-Id: deadbeef00000000000000000000000000000000 507 Node-Id: 1d3470c... de34... 508 Last-key: 1e400... 509 Url: http://foo.example.com/some/page.html 510 Port: 4666 512 - Response - 514 Dowser/0.1 200 OK 515 Ring-Id: deadbeef00000000000000000000000000000000 516 Node-Id: 48888... de34... 517 Last-key: 49999... 518 Expires: 864000 519 Content-key: 3fdb13677b10691debb3909dd917b00ee751115a 521 ... (file data) ... 523 - Or - 525 Dowser/0.1 404 not found 526 Ring-Id: deadbeef00000000000000000000000000000000 527 Node-Id: 48888... de34... 528 Last-key: 49999... 530 ## todo: content-length? simultaneous downloading? the Range: & Content-length: headers? 532 2.5. CRAWL 534 So we know how to find nodes and content. The last question is: where 535 does the content come from? And the indexes? 536 CRAWL http://foo.example.com/foo/bar/baz.html Dowser/0.1 537 Ring-Id: deadbeef00000000000000000000000000000000 538 Node-Id: 12222... a5754...... 539 Last-key: 13333... 540 Referer: http://foo.example.com/foo/index.html 541 Port: 4666 542 Expires: 864000 543 Content-key: 3fdb13677b10691debb3909dd917b00ee751115a 545 - Response - 547 Dowser/0.1 202 Accepted 548 Ring-Id: deadbeef00000000000000000000000000000000 549 Node-Id: 48888... de34... 550 Last-key: 49999... 552 - Or - 554 Dowser/0.1 310 Not in my range 555 Ring-Id: deadbeef00000000000000000000000000000000 556 Node-Id: 48888... de34... 557 Last-key: 49999... 559 192.0.2.10 8000 50000f... 599999... 560 192.0.2.20 6876 43333a... 610000... 562 When a node crawls a page, it usually has links on it. If the hash of 563 those second-level URLs are not in the node's range, it sends a CRAWL 564 annoucement to the nodes that do handle that range, so they can 565 continue the crawl. 567 A node SHOULD first check (via URLCACHE) if a cached copy exists 568 before downloading a page directly. 570 Implementations SHOULD respect the "robots.txt" protocol for nice 571 spiders. 573 ##todo: so where do we START crawling? that is not part of the 574 protocol because it could be a network of office machines sharing 575 their spreadsheets, or whatever -- need to say that in a nice way. 577 2.6.INDEXADD 579 An INDEXADD is an informational message to a node that a particular 580 URL contains a keyword inside the server's range. 582 INDEXADD 09dd917b00ee751115a3fdb13677b10691debb39 Dowser/0.1 583 Ring-Id: deadbeef00000000000000000000000000000000 584 Node-Id: 12222... a5754...... 585 Last-key: 13333... 586 Term: foo bar baz 587 Url: http://foo.example.com/foo/bar/baz.html 588 Expires: 864000 589 Content-key: 3fdb13677b10691debb3909dd917b00ee751115a 590 Port: 4666 592 ##todo: positions, counts, tdf*idf? 594 - Response - 596 Dowser/0.1 202 Accepted 597 Ring-Id: deadbeef00000000000000000000000000000000 598 Node-Id: 48888... de34.. 599 Last-key: 49999... 601 - Or - 603 Dowser/0.1 310 Not in my range 604 Ring-Id: deadbeef00000000000000000000000000000000 605 Node-Id: 48888... de34.. 606 Last-key: 49999... 608 192.0.2.10 8000 50000f... 599999... 609 192.0.2.20 6876 43333a... 610000... 611 The Term: header contains what words the server may be interested in, 612 separated by tabs. An INDEXADD message implies that the requesting 613 node has a copy of the document. Listening nodes should retrieve it 614 from that node via a CONTENTCACHE message. This puts a cost on adding 615 words to the index. 617 3. Announcing 619 When a new node joins the network, it must first be provided with a 620 list of "well-known" nodes and a ring-id. Knowing nothing else about 621 the network topology, it generates a node-id and sends a NODEFIND 622 request on that key to those well-known nodes for its own key: 624 NODEFIND 12222...0 Dowser/0.1 625 Ring-Id: deadbeef00000000000000000000000000000000 626 Node-Id: 12222... a5754... 627 Last-key: 12222... 628 Port: 4666 630 This eventually leads the new node to its "neighbors" on the ring. 631 The new node's range SHOULD initially be of the same length as its 632 left-hand neighbor, but MAY grow or shrink according to how much 633 capacity it has at hand, but the Last-key MUST NOT be less than the 634 node-id of its closest neighbor. 636 4. Redundancy 638 All nodes may not be available at all times, and the "overlap" is not 639 guaranteed, requiring some fiat redundancy. So: 641 Each node assumes a range starting with its nodeid and That defines 642 its "core" range. It SHALL also handle an "auxilary" range of equal 643 length, calculated by adding 8 to the first digit. The addtion does 644 not carry. For example: 646 Node A "Core" range: 648 9000000000000000000000000000000000000000 to 649 c000000000000000000000000000000000000000 651 Auxilary range: 653 1000000000000000000000000000000000000000 to 654 4000000000000000000000000000000000000000 656 5. Optional headers 658 To save on bandwidth, nodes MAY compress their responses. A 659 requesting node MAY indicate that it can accept compressed data with 660 this optional header: 662 Accept-Encoding: gzip, deflate 664 If a server compresses a response, it MUST inidcate that with this 665 header: 667 Content-encoding: gzip 669 The server MAY also send this optional header to SEARCH responses to 670 help guide the user: 672 Related-terms: meta, syntax, jargon 674 6. Error conditions 676 Many error codes from HTTP/1.1 apply to Dowser. 678 500 Internal Server error 680 501 Not Implemented 682 503 Server busy (Service unavailable) 683 505 HTTP Version Not Supported 685 400 Bad request 687 404 Not Found 689 412 Precondition failed 691 When a node's id and seed don't check out, for instance. This is 692 different from a mal-formed request which should be responded to 693 with a 400. Also, nodes may keep a "blacklist" of node-ids and/or 694 IPs that they do not wish to do business with (See SPAM). In the 695 case of blacklists, the server MAY simply close the connection, 696 but it's polite to give some error information. 698 Implementations MAY elect to withold the Ring-id header in this 699 case, for security. 701 7. Notes on implementation 703 A reference implementation is currently in alpha testing and should be 704 released by mid-summer 2004 [IMPL] 706 7.1 Ranking 708 There is not a lot of ranking data returned by SEARCH requests. The 709 idea is to return somewhat raw indexes to the client and let it sort 710 things out through intersecting multiple responses and its own 711 biases. 713 One time-saving trick may be employed by implementations: recall in 714 the SEARCH example, all of the search terms were sent, even though 715 it's unlikely one node will handle all of them. The reason is to 716 allow nodes to pick out terms that tend to accompany the terms they 717 are responsible for. The node may then keep an extra index of those 718 accompanying terms and use it to trim and/or rank the results sent 719 back. 721 7.2 Proxying 723 If a relatively new node receives a request for a key that is in its 724 range but not its datastore, it may create a new request to its 725 upstream neighbor for it, on the theory that it may still be in that 726 node's datastore. It then keeps a copy while forwarding it to the 727 original requesting node. This allows new nodes to build on the work 728 of older ones. 730 Implementations might want to work out semantics for long-lived 731 connections on the lines of HTTP/1.1 Keep-alive headers. 733 7.3 TCP vs UDP 735 Several methods (NODEFIND, CRAWL, INDEXADD) have two interesting 736 properties: they will be used often, and involve a very small amount 737 of data. It is tempting to implement those parts over UDP and the 738 rest over TCP, or to come up with some tricks to do it all over UDP. 740 7.4 Ping/Pong 742 Neighboring nodes should be in the habit of pinging their nearest 743 neighbors, if they have not heard from them in a while. A "ping" can 744 take the form of a NODEFIND using the neighbor's node-id. 746 7.5 MHTML 748 HTML documents of today tend to be comprised of many files including 749 graphics, stylesheets, etc. It may be useful for implementations to 750 offer "MHTML" versions of cached documents that contain these 751 supporting files in one MIME-encoded file. If a client supports 752 MHTML, it should include it in the Accept: header. Common-sense 753 restrictions apply, such as not including executable scripts or 754 plugins and files that do not come from the same domain. 756 7.6 Spam 758 The network described above may one day contain thousands of nodes. 759 Thousands of nodes implies thousands of users looking for 760 information, which nowadays means spam. An implementation may elect 761 to deal with it in the way email clients do, with filters built from 762 analysis of links classified by the user as "spam" or "useful". Even 763 in the abscence of spurious traffic a good Bayesian filter may help 764 tailor results. 766 A sophisticated method of injecting targeted spam involves generating 767 node-ids until one pair is "close enough" to the hash of a desired 768 keyword, where "close enough" is matching the first 4 or 5 digits. 769 Even thousands of nodes will be sparse in the keyspace, so an 770 exhaustive search is not necessary. 772 7.7 Indexing 774 A large sample of web documents suggests that there is a fairly 775 constant number of terms, something on the close order of 14 million 776 [GOOGLE]. 778 An unpublished survey of a large sample of search engine queries 779 suggests that there is also a fairly constant number of unique terms 780 searched, on the close order of 1 million. This is very interesting. 782 Suppose we give more "weight" to words found in the query stream over 783 those that are not. As we've seen in the adventures of Internet 784 search engines over the last few years, documents often lie about 785 their contents, but users rarely lie about their desires. They are 786 vauge, perhaps, but not false. 788 . . . that is, until the query stream becomes a thing of value. A 789 clever person might include a popular word and a made-up one into his 790 or her web page, then send thousands of queries with both of those 791 words to artificially boost the correlation. I have a wonderful proof 792 describing how this gambit might be defeated, but unfortunately there 793 is not enough room to include it here. 795 8. Acknowledgements 797 This specification builds on the years of work put into designing and 798 implementing Dr. Berner-Lee's HTTP protocol by many thousands of 799 people around the world. You know who you are. 801 Ideas and algorithims were stolen liberally from Chord, Circle, and 802 Freenet. Lots and lots of ideas from the original distributed search 803 projects, Harvest and WebAnts. 805 Special thanks to Richard Vasquez and Thomas Lackner for their 806 devious minds and help with early implementations. 808 9. References 810 9.1 Informative 812 [GOOGLE] Brin & Page, "The Anatomy of a Search Engine", Stanford 813 University, 1997 815 [CHORD] Stoica, et al. "A Scalable Peer-to-peer Lookup Service for 816 Internet Applications", Sigcomm 2001 817 http://www.pdos.lcs.mit.edu/papers/chord:sigcomm01/ 819 [IMPL] Dowser P2P client, http://dowser.sourceforge.net 821 9.2 Normative 823 [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 824 Resource Identifiers (URI): Generic Syntax and Semantics", RFC 825 2396, August 1998. 827 [HTTP] Feilding, et al. "Hypertext Transfer Protocol -- HTTP/1.1", 828 RFC2616, June 1999 830 [SHA] Eastlake & Jones, US Secure Hash Algorithm 1 (SHA1), RFC3174, 831 September 2001 833 10. Security Considerations 835 10.1 Personal Information 837 Dowser deals directly with people's searching habits. While spreading 838 this information around is kind of the point, implementations SHOULD 839 take care not to "leak" any more than neccessary. For instance, it 840 may be tempting to skip the NODEFIND sequence and just issue SEARCHes 841 or URLCACHEs, since they also return node-finding information. This 842 sends people's search terms and requested urls to nodes that don't 843 strictly need it, and SHOULD be avoided. 845 Implementations can be expected to work closely with traditional HTTP 846 applications, with their own privacy & security considerations. 848 10.2 Sensitive data 850 Dowser was originally intended to deal only with publically 851 accessible information, but it can also be used for sharing 852 sensitive data among trusted computers, say, all the word processor 853 documents on an office LAN. For these situations, implementations 854 should use some sort of IP whitelisting or subnetting and make the 855 user aware of the risks. 857 11. IPR & Copyright 859 The IETF invites any interested party to bring to its 860 attention any copyrights, patents or patent applications, or 861 other proprietary rights which may cover technology that may be 862 required to practice this standard. Please address the 863 information to the IETF Executive Director. 865 Copyright (C) The Internet Society (2004). All Rights 866 Reserved. 868 This document and translations of it may be copied and 869 furnished to others, and derivative works that comment on or 870 otherwise explain it or assist in its implmentation may be 871 prepared, copied, published and distributed, in whole or in 872 part, without restriction of any kind, provided that the above 873 copyright notice and this paragraph are included on all such 874 copies and derivative works. However, this document itself may 875 not be modified in any way, such as by removing the copyright 876 notice or references to the Internet Society or other Internet 877 organizations, except as needed for the purpose of developing 878 Internet standards in which case the procedures for copyrights 879 defined in the Internet Standards process must be followed, or 880 as required to translate it into languages other than English. 882 The limited permissions granted above are perpetual and will 883 not be revoked by the Internet Society or its successors or 884 assigns. 886 This document and the information contained herein is provided 887 on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 888 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 889 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE 890 OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY 891 IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 892 PARTICULAR PURPOSE. 894 12. Contact Information 896 Carlos Bueno 897 6231 SW 78th St., Ste. 20 898 Miami, FL, USA 33143 899 Email: carlos@bueno.org 901 This document expires on November 4, 2004