idnits 2.17.1 draft-ietf-morg-fuzzy-search-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 16, 2010) is 4882 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 3501 (Obsoleted by RFC 9051) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Message Organization Working Group T. Sirainen 3 Internet-Draft November 16, 2010 4 Intended status: Standards Track 5 Expires: May 20, 2011 7 IMAP4 Extension for Fuzzy Search 8 draft-ietf-morg-fuzzy-search-03 10 Abstract 12 This document describes an IMAP protocol extension enabling server to 13 perform searches with inexact matching and assigning relevancy scores 14 for matched messages. 16 Note 18 A revised version of this draft document will be submitted to the RFC 19 editor as a Proposed Standard for the Internet Community. Discussion 20 and suggestions for improvement are requested, and should be sent to 21 morg@ietf.org. 23 Status of this Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on May 20, 2011. 40 Copyright Notice 42 Copyright (c) 2010 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 1. Conventions used in this document 57 In examples, "C:" indicates lines sent by a client that is connected 58 to a server. "S:" indicates lines sent by the server to the client. 60 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 61 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 62 document are to be interpreted as described in RFC 2119 [Kwds]. 64 2. Introduction 66 When humans perform searches in IMAP clients, they typically want to 67 see the most relevant search results first. IMAP servers are able to 68 do this in the most efficient way when they're free to internally 69 decide how searches should match messages. This document describes a 70 new SEARCH=FUZZY extension that provides such functionality. 72 3. The FUZZY Search Key 74 FUZZY search key takes another search key as its argument. Server is 75 allowed to perform all matching in an implementation-defined manner 76 for this search key, including ignoring the active comparator as 77 defined by [RFC5255]. Typically this would be used to search for 78 strings, for example: 80 C: A1 SEARCH FUZZY (SUBJECT "IMAP break") 81 S: * SEARCH 1 5 10 82 S: A1 OK Search completed. 84 Besides matching messages with subject "IMAP break", the above search 85 may also match messages with subjects "broken IMAP", "IMAP is 86 broken", or anything else the server decides that might be a good 87 match. 89 This example does a fuzzy SUBJECT search, but a non-fuzzy FROM 90 search: 92 C: A2 SEARCH FUZZY SUBJECT work FROM user@example.com 93 S: * SEARCH 1 4 94 S: A2 OK Search completed. 96 It is implementation-defined how server handles multiple separate 97 FUZZY search keys. 99 4. Relevancy Scores for Search Results 101 Servers SHOULD assign a search relevancy score for each matched 102 message when the FUZZY search key is given. Relevancy scores are 103 given in range 1-100, where 100 is the highest relevancy. The 104 relevancy scores SHOULD use the full 1-100 range, so that clients can 105 show them to users in a meaningful way, such as a percentage value. 107 As the name already tells, relevancy scores specify how relevant to 108 the search the matched message is. It's not necessarily the same as 109 how precisely the message matched. For example a message whose 110 subject matches fuzzily the search string might get a higher 111 relevancy score than a message whose body had the exact string in the 112 middle of a sentence. When multiple search keys are matched fuzzily, 113 it's server-dependent on how the relevancy score is calculated. 115 If server also advertises the ESEARCH capability as defined by 116 [ESEARCH], the relevancy scores can be retrieved using the new 117 RELEVANCY return option for SEARCH: 119 C: B1 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo" 120 S: * ESEARCH (TAG "B2") ALL 1,5,10 RELEVANCY (4 99 42) 121 S: B1 OK Search completed. 123 The RELEVANCY return option MUST NOT be used unless FUZZY search key 124 is also given. Note that SEARCH results aren't sorted by relevancy, 125 SORT is needed for that. 127 5. Fuzzy matching with non-string search keys 129 Fuzzy matching is not limited to just string matching. All search 130 keys SHOULD be matched fuzzily, although what exactly that means for 131 different search keys is left up to server implementations to decide 132 -- including deciding that fuzzy matching is meaningless for a 133 particular key, and falling back to exact matching. Some suggestions 134 are given below. 136 Dates: A typical example could be when a user wants to find a message 137 "from Dave about a week ago". A client could perform this search 138 using SEARCH FUZZY (FROM "Dave" SINCE 21-Jan-2009 BEFORE 24-Jan- 139 2009). Server could return messages outside the specified date 140 range, but the further away the message is, the lower the relevancy 141 score. 143 Sizes: These should be handled similar to dates. If a user wants to 144 search for "about 1 MB attachments", the client could do this by 145 sending SEARCH FUZZY (LARGER 900000 SMALLER 1100000). Again the 146 further away the message size is from the specified range, the lower 147 the relevancy score. 149 Flags: Server could return messages that don't have the specified 150 flags, but with a lower relevancy score. 152 UIDs, sequences, modification sequences: These are examples of keys 153 for which exact matching is probably what makes sense. 154 Alternatively, a server might choose, for instance, to expand a UID 155 range by 5% on each side. 157 6. Extensions to SORT 159 If server also advertises the SORT capability as defined by [SORT], 160 the results can be sorted by the new RELEVANCY sort criteria: 162 C: C1 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo" 163 S: * SORT 5 10 1 164 S: C1 OK Sort completed. 166 The message with the highest score is returned first. As with 167 RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used 168 unless FUZZY search key is also given. 170 If server also advertises the ESORT capability as defined by 171 [CONTEXT], the relevancy scores can be retrieved using the new 172 RELEVANCY return option for SORT: 174 C: C2 SORT RETURN (RELEVANCY ALL) (RELEVANCY) UTF-8 FUZZY TEXT "Helo" 175 S: * ESEARCH (TAG "C2") ALL 5,10,1 RELEVANCY (99 42 4) 176 S: C2 OK Sort completed. 178 To limit the number of returned messages, use the PARTIAL return 179 option. For example this returns the 10 most relevant messages: 181 C: C3 SORT RETURN (PARTIAL 1:10) (RELEVANCY) UTF-8 FUZZY TEXT "World" 182 S: * ESEARCH (TAG "C3") PARTIAL (1:10 42,9,34,13,15,4,2,7,23,82) 183 S: C3 OK Sort completed. 185 7. Formal Syntax 187 The following syntax specification uses the augmented Backus-Naur 188 Form (BNF) as described in [ABNF]. It includes definitions from 189 [RFC3501], [IMAP-ABNF] and [SORT]. 191 capability =/ "SEARCH=FUZZY" 193 score = 1*3DIGIT 194 ;; (1 <= n <= 100) 196 score-list = "(" [score *(SP score)] ")" 198 search-key =/ "FUZZY" SP search-key 200 search-return-data =/ "RELEVANCY" SP score-list 201 ;; Conforms to , from [IMAP-ABNF] 203 search-return-opt =/ "RELEVANCY" 204 ;; Conforms to , from [IMAP-ABNF] 206 sort-key =/ "RELEVANCY" 208 8. Security Considerations 210 Implementation of this extension might enable a denial-of-service 211 attack if the implementation isn't careful to prevent them. Fuzzy 212 search engines are often complex with non-obvious disk space, memory 213 and/or CPU usage patterns. Implementors should test at least the 214 behavior of large messages that contain very long words and/or unique 215 random strings. Also very long search keys might cause excessive 216 memory or CPU usage. 218 Invalid input may also be problematic. For example if the search 219 engine takes UTF-8 stream as input, it might fail more or less badly 220 when illegal UTF-8 sequences are fed to it from a message whose 221 character set was claimed to be UTF-8. This could be avoided by 222 validating all the input and, for example, replacing illegal UTF-8 223 sequences with the Unicode replacement character (U+FFFD). 225 Search relevancy rankings might be susceptible to "poisoning" by 226 smart attackers using certain keywords or hidden markup (e.g. HTML) 227 in their messages to boost the rankings. This can't be fully 228 prevented by servers, so clients should prepare for it by at least 229 allowing user to see all the search results, rather than hide results 230 below a certain score. 232 9. IANA Considerations 234 IMAP4 capabilities are registered by publishing a standards track or 235 IESG approved experimental RFC. The registry is currently located 236 at: 238 http://www.iana.org/assignments/imap4-capabilities 240 This document defines the X-DRAFT-I03-SEARCH=FUZZY [[anchor7: Note to 241 RFC Editor: fix before publication]] IMAP capability. IANA is 242 requested to add it to the registry. 244 10. Acknowledgements 246 Alexey Melnikov, Zoltan Ordogh, Barry Leiba, Cyrus Daboo and Dave 247 Cridland have helped with this document. 249 11. Normative References 251 [ABNF] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 252 Specifications: ABNF", RFC 5234, January 2008. 254 [CONTEXT] Cridland, D. and C. King, "Contexts for IMAP4", RFC 5267, 255 July 2008. 257 [ESEARCH] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH 258 Command for Controlling What Kind of Information Is 259 Returned", RFC 4731, November 2006. 261 [IMAP-ABNF] 262 Melnikov, A. and C. Daboo, "Collected Extensions to IMAP4 263 ABNF", RFC 4466, April 2006. 265 [Kwds] Bradner, S., "Key words for use in RFCs to Indicate 266 Requirement Levels", RFC 2119, March 1997. 268 [RFC3501] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION 269 4rev1", RFC 3501, March 2003. 271 [RFC5255] Newman, C., Gulbrandsen, A., and A. Melnikov, "Internet 272 Message Access Protocol Internationalization", RFC 5255, 273 June 2008. 275 [SORT] Crispin, M. and K. Murchison, "Internet Message Access 276 Protocol - SORT and THREAD Extensions", RFC 5256, 277 June 2008. 279 Author's Address 281 Timo Sirainen 283 Email: tss@iki.fi