| < draft-ietf-morg-fuzzy-search-02.txt | draft-ietf-morg-fuzzy-search-03.txt > | |||
|---|---|---|---|---|
| Message Organization Working Group T. Sirainen | Message Organization Working Group T. Sirainen | |||
| Internet-Draft January 23, 2010 | Internet-Draft November 16, 2010 | |||
| Intended status: Standards Track | Intended status: Standards Track | |||
| Expires: July 27, 2010 | Expires: May 20, 2011 | |||
| IMAP4 Extension for Fuzzy Search | IMAP4 Extension for Fuzzy Search | |||
| draft-ietf-morg-fuzzy-search-02 | draft-ietf-morg-fuzzy-search-03 | |||
| Abstract | Abstract | |||
| This document describes an IMAP protocol extension enabling server to | This document describes an IMAP protocol extension enabling server to | |||
| perform searches with inexact matching and assigning relevancy scores | perform searches with inexact matching and assigning relevancy scores | |||
| for matched messages. | for matched messages. | |||
| Note | Note | |||
| A revised version of this draft document will be submitted to the RFC | A revised version of this draft document will be submitted to the RFC | |||
| editor as a Proposed Standard for the Internet Community. Discussion | editor as a Proposed Standard for the Internet Community. Discussion | |||
| and suggestions for improvement are requested, and should be sent to | and suggestions for improvement are requested, and should be sent to | |||
| morg@ietf.org. | morg@ietf.org. | |||
| Status of this Memo | Status of this Memo | |||
| This Internet-Draft is submitted to IETF in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF). Note that other groups may also distribute | |||
| other groups may also distribute working documents as Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at | This Internet-Draft will expire on May 20, 2011. | |||
| http://www.ietf.org/ietf/1id-abstracts.txt. | ||||
| The list of Internet-Draft Shadow Directories can be accessed at | ||||
| http://www.ietf.org/shadow.html. | ||||
| This Internet-Draft will expire on July 27, 2010. | ||||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2010 IETF Trust and the persons identified as the | Copyright (c) 2010 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| described in the BSD License. | described in the Simplified BSD License. | |||
| 1. Conventions used in this document | 1. Conventions used in this document | |||
| In examples, "C:" indicates lines sent by a client that is connected | In examples, "C:" indicates lines sent by a client that is connected | |||
| to a server. "S:" indicates lines sent by the server to the client. | to a server. "S:" indicates lines sent by the server to the client. | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in RFC 2119 [Kwds]. | document are to be interpreted as described in RFC 2119 [Kwds]. | |||
| skipping to change at page 2, line 37 ¶ | skipping to change at page 2, line 31 ¶ | |||
| When humans perform searches in IMAP clients, they typically want to | When humans perform searches in IMAP clients, they typically want to | |||
| see the most relevant search results first. IMAP servers are able to | see the most relevant search results first. IMAP servers are able to | |||
| do this in the most efficient way when they're free to internally | do this in the most efficient way when they're free to internally | |||
| decide how searches should match messages. This document describes a | decide how searches should match messages. This document describes a | |||
| new SEARCH=FUZZY extension that provides such functionality. | new SEARCH=FUZZY extension that provides such functionality. | |||
| 3. The FUZZY Search Key | 3. The FUZZY Search Key | |||
| FUZZY search key takes another search key as its argument. Server is | FUZZY search key takes another search key as its argument. Server is | |||
| allowed to perform all matching in an implementation-defined manner | allowed to perform all matching in an implementation-defined manner | |||
| for this search key. Typically this would be used to search for | for this search key, including ignoring the active comparator as | |||
| defined by [RFC5255]. Typically this would be used to search for | ||||
| strings, for example: | strings, for example: | |||
| C: A01 SEARCH FUZZY (SUBJECT "IMAP break") | C: A1 SEARCH FUZZY (SUBJECT "IMAP break") | |||
| S: * SEARCH 1 5 10 | S: * SEARCH 1 5 10 | |||
| S: A01 OK Search completed. | S: A1 OK Search completed. | |||
| Besides matching messages with subject "IMAP break", the above search | Besides matching messages with subject "IMAP break", the above search | |||
| may also match messages with subjects "broken IMAP", "IMAP is | may also match messages with subjects "broken IMAP", "IMAP is | |||
| broken", or anything else the server decides that might be a good | broken", or anything else the server decides that might be a good | |||
| match. | match. | |||
| This example does a fuzzy SUBJECT search, but a non-fuzzy FROM | ||||
| search: | ||||
| C: A2 SEARCH FUZZY SUBJECT work FROM user@example.com | ||||
| S: * SEARCH 1 4 | ||||
| S: A2 OK Search completed. | ||||
| It is implementation-defined how server handles multiple separate | ||||
| FUZZY search keys. | ||||
| 4. Relevancy Scores for Search Results | 4. Relevancy Scores for Search Results | |||
| Servers SHOULD assign a search relevancy score for each matched | Servers SHOULD assign a search relevancy score for each matched | |||
| message when the FUZZY search key is given. Relevancy scores are | message when the FUZZY search key is given. Relevancy scores are | |||
| given in range 1-100, where 100 is the highest relevancy. The | given in range 1-100, where 100 is the highest relevancy. The | |||
| relevancy scores SHOULD use the full 1-100 range, so that clients can | relevancy scores SHOULD use the full 1-100 range, so that clients can | |||
| show them to users in a meaningful way, such as a percentage value. | show them to users in a meaningful way, such as a percentage value. | |||
| As the name already tells, relevancy scores specify how relevant to | As the name already tells, relevancy scores specify how relevant to | |||
| the search the matched message is. It's not necessarily the same as | the search the matched message is. It's not necessarily the same as | |||
| how precisely the message matched. For example a message whose | how precisely the message matched. For example a message whose | |||
| subject matches fuzzily the search string might get a higher | subject matches fuzzily the search string might get a higher | |||
| relevancy score than a message whose body had the exact string in the | relevancy score than a message whose body had the exact string in the | |||
| middle of a sentence. | middle of a sentence. When multiple search keys are matched fuzzily, | |||
| it's server-dependent on how the relevancy score is calculated. | ||||
| If server advertises the ESEARCH capability as defined by [ESEARCH], | If server also advertises the ESEARCH capability as defined by | |||
| the relevancy scores can be retrieved using the new RELEVANCY return | [ESEARCH], the relevancy scores can be retrieved using the new | |||
| option for SEARCH: | RELEVANCY return option for SEARCH: | |||
| C: A02 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo" | C: B1 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo" | |||
| S: * ESEARCH (TAG "A02") ALL 1,5,10 RELEVANCY (4 99 42) | S: * ESEARCH (TAG "B2") ALL 1,5,10 RELEVANCY (4 99 42) | |||
| S: A02 OK Search completed. | S: B1 OK Search completed. | |||
| The RELEVANCY return option MUST NOT be used unless FUZZY search key | The RELEVANCY return option MUST NOT be used unless FUZZY search key | |||
| is also given. | is also given. Note that SEARCH results aren't sorted by relevancy, | |||
| SORT is needed for that. | ||||
| 5. Fuzzy matching with non-string search keys | 5. Fuzzy matching with non-string search keys | |||
| Fuzzy matching is not limited to just string matching. All search | Fuzzy matching is not limited to just string matching. All search | |||
| keys SHOULD be matched fuzzily, although what exactly that means for | keys SHOULD be matched fuzzily, although what exactly that means for | |||
| different search keys is left up to server implementations to decide | different search keys is left up to server implementations to decide | |||
| -- including deciding that fuzzy matching is meaningless for a | -- including deciding that fuzzy matching is meaningless for a | |||
| particular key, and falling back to exact matching. Some suggestions | particular key, and falling back to exact matching. Some suggestions | |||
| are given below. | are given below. | |||
| skipping to change at page 4, line 15 ¶ | skipping to change at page 4, line 23 ¶ | |||
| Flags: Server could return messages that don't have the specified | Flags: Server could return messages that don't have the specified | |||
| flags, but with a lower relevancy score. | flags, but with a lower relevancy score. | |||
| UIDs, sequences, modification sequences: These are examples of keys | UIDs, sequences, modification sequences: These are examples of keys | |||
| for which exact matching is probably what makes sense. | for which exact matching is probably what makes sense. | |||
| Alternatively, a server might choose, for instance, to expand a UID | Alternatively, a server might choose, for instance, to expand a UID | |||
| range by 5% on each side. | range by 5% on each side. | |||
| 6. Extensions to SORT | 6. Extensions to SORT | |||
| If server advertises the SORT capability as defined by [SORT], the | If server also advertises the SORT capability as defined by [SORT], | |||
| results can be sorted by the new RELEVANCY sort criteria: | the results can be sorted by the new RELEVANCY sort criteria: | |||
| C: A03 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo" | C: C1 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo" | |||
| S: * SORT 5 10 1 | S: * SORT 5 10 1 | |||
| S: A03 OK Sort completed. | S: C1 OK Sort completed. | |||
| The message with the highest score is returned first. As with | The message with the highest score is returned first. As with | |||
| RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used | RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used | |||
| unless FUZZY search key is also given. | unless FUZZY search key is also given. | |||
| If server advertises the ESORT capability as defined by [CONTEXT], | If server also advertises the ESORT capability as defined by | |||
| the relevancy scores can be retrieved using the new RELEVANCY return | [CONTEXT], the relevancy scores can be retrieved using the new | |||
| option for SORT: | RELEVANCY return option for SORT: | |||
| C: A04 SORT RETURN (RELEVANCY ALL) (RELEVANCY) FUZZY TEXT "Helo" | C: C2 SORT RETURN (RELEVANCY ALL) (RELEVANCY) UTF-8 FUZZY TEXT "Helo" | |||
| S: * ESEARCH (TAG "A04") ALL 5,10,1 RELEVANCY (99 42 4) | S: * ESEARCH (TAG "C2") ALL 5,10,1 RELEVANCY (99 42 4) | |||
| S: A04 OK Sort completed. | S: C2 OK Sort completed. | |||
| To limit the number of returned messages, use the PARTIAL return | ||||
| option. For example this returns the 10 most relevant messages: | ||||
| C: C3 SORT RETURN (PARTIAL 1:10) (RELEVANCY) UTF-8 FUZZY TEXT "World" | ||||
| S: * ESEARCH (TAG "C3") PARTIAL (1:10 42,9,34,13,15,4,2,7,23,82) | ||||
| S: C3 OK Sort completed. | ||||
| 7. Formal Syntax | 7. Formal Syntax | |||
| The following syntax specification uses the augmented Backus-Naur | The following syntax specification uses the augmented Backus-Naur | |||
| Form (BNF) as described in [ABNF]. It includes definitions from | Form (BNF) as described in [ABNF]. It includes definitions from | |||
| [RFC3501], [IMAP-ABNF] and [SORT]. | [RFC3501], [IMAP-ABNF] and [SORT]. | |||
| capability =/ "SEARCH=FUZZY" | capability =/ "SEARCH=FUZZY" | |||
| score = 1*3DIGIT | score = 1*3DIGIT | |||
| skipping to change at page 5, line 24 ¶ | skipping to change at page 5, line 30 ¶ | |||
| search-return-data =/ "RELEVANCY" SP score-list | search-return-data =/ "RELEVANCY" SP score-list | |||
| ;; Conforms to <search-return-data>, from [IMAP-ABNF] | ;; Conforms to <search-return-data>, from [IMAP-ABNF] | |||
| search-return-opt =/ "RELEVANCY" | search-return-opt =/ "RELEVANCY" | |||
| ;; Conforms to <search-return-opt>, from [IMAP-ABNF] | ;; Conforms to <search-return-opt>, from [IMAP-ABNF] | |||
| sort-key =/ "RELEVANCY" | sort-key =/ "RELEVANCY" | |||
| 8. Security Considerations | 8. Security Considerations | |||
| This document is believed not to have any security implications. | Implementation of this extension might enable a denial-of-service | |||
| attack if the implementation isn't careful to prevent them. Fuzzy | ||||
| search engines are often complex with non-obvious disk space, memory | ||||
| and/or CPU usage patterns. Implementors should test at least the | ||||
| behavior of large messages that contain very long words and/or unique | ||||
| random strings. Also very long search keys might cause excessive | ||||
| memory or CPU usage. | ||||
| Invalid input may also be problematic. For example if the search | ||||
| engine takes UTF-8 stream as input, it might fail more or less badly | ||||
| when illegal UTF-8 sequences are fed to it from a message whose | ||||
| character set was claimed to be UTF-8. This could be avoided by | ||||
| validating all the input and, for example, replacing illegal UTF-8 | ||||
| sequences with the Unicode replacement character (U+FFFD). | ||||
| Search relevancy rankings might be susceptible to "poisoning" by | ||||
| smart attackers using certain keywords or hidden markup (e.g. HTML) | ||||
| in their messages to boost the rankings. This can't be fully | ||||
| prevented by servers, so clients should prepare for it by at least | ||||
| allowing user to see all the search results, rather than hide results | ||||
| below a certain score. | ||||
| 9. IANA Considerations | 9. IANA Considerations | |||
| IMAP4 capabilities are registered by publishing a standards track or | IMAP4 capabilities are registered by publishing a standards track or | |||
| IESG approved experimental RFC. The registry is currently located | IESG approved experimental RFC. The registry is currently located | |||
| at: | at: | |||
| http://www.iana.org/assignments/imap4-capabilities | http://www.iana.org/assignments/imap4-capabilities | |||
| This document defines the X-DRAFT-I02-SEARCH=FUZZY [[anchor7: Note to | This document defines the X-DRAFT-I03-SEARCH=FUZZY [[anchor7: Note to | |||
| RFC Editor: fix before publication]] IMAP capability. IANA is | RFC Editor: fix before publication]] IMAP capability. IANA is | |||
| requested to add it to the registry. | requested to add it to the registry. | |||
| 10. Acknowledgements | 10. Acknowledgements | |||
| Alexey Melnikov, Zoltan Ordogh and Barry Leiba have helped with this | Alexey Melnikov, Zoltan Ordogh, Barry Leiba, Cyrus Daboo and Dave | |||
| document. | Cridland have helped with this document. | |||
| 11. Normative References | 11. Normative References | |||
| [ABNF] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | [ABNF] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | |||
| Specifications: ABNF", RFC 5234, January 2008. | Specifications: ABNF", RFC 5234, January 2008. | |||
| [CONTEXT] Cridland, D. and C. King, "Contexts for IMAP4", RFC 5267, | [CONTEXT] Cridland, D. and C. King, "Contexts for IMAP4", RFC 5267, | |||
| July 2008. | July 2008. | |||
| [ESEARCH] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH | [ESEARCH] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH | |||
| skipping to change at page 6, line 22 ¶ | skipping to change at page 6, line 44 ¶ | |||
| [IMAP-ABNF] | [IMAP-ABNF] | |||
| Melnikov, A. and C. Daboo, "Collected Extensions to IMAP4 | Melnikov, A. and C. Daboo, "Collected Extensions to IMAP4 | |||
| ABNF", RFC 4466, April 2006. | ABNF", RFC 4466, April 2006. | |||
| [Kwds] Bradner, S., "Key words for use in RFCs to Indicate | [Kwds] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", RFC 2119, March 1997. | Requirement Levels", RFC 2119, March 1997. | |||
| [RFC3501] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION | [RFC3501] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION | |||
| 4rev1", RFC 3501, March 2003. | 4rev1", RFC 3501, March 2003. | |||
| [RFC5255] Newman, C., Gulbrandsen, A., and A. Melnikov, "Internet | ||||
| Message Access Protocol Internationalization", RFC 5255, | ||||
| June 2008. | ||||
| [SORT] Crispin, M. and K. Murchison, "Internet Message Access | [SORT] Crispin, M. and K. Murchison, "Internet Message Access | |||
| Protocol - SORT and THREAD Extensions", RFC 5256, | Protocol - SORT and THREAD Extensions", RFC 5256, | |||
| June 2008. | June 2008. | |||
| Author's Address | Author's Address | |||
| Timo Sirainen | Timo Sirainen | |||
| Email: tss@iki.fi | Email: tss@iki.fi | |||
| End of changes. 24 change blocks. | ||||
| 40 lines changed or deleted | 78 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||